Monday, 22 August 2016

First Data Pipeline - from Corpus to Word Cloud

Following on from the previous post on the need for a text processing pipeline framework ... I've just implemented a simple one.

It's simple but powerfully illustrates the ideas discussed last time. It also started to flesh out the framework, which will be provided as a Python package for easy reuse.

Simple Pipeline

To recap, the simple pipeline for creating word clouds is:
  1. get text (from data_sets/recipes/txt/??.txt)
  2. simplify whitespace (remove multiple spaces, change line feeds to whitespace)
  3. filter out any non-alphanumeric characters
  4. lowercase the text
  5. split text into words
  6. remove stop words (from stopwords/minimal-stop.txt)
  7. keep only words of a minimum length (5)
  8. count word frequency
  9. plot word cloud
This is slightly more steps than previously to deal with unruly source text. An example of this is source text which contains multiple spaces, tabs, and new lines - all of which needs to be simplified down to a single white space.

Python Package

In starting to learn by doing, it because clear that a Python package was the right way to package up and provide the reusable framework we're developing. A fuller guide to doing this is here - but we'll start minimally.

It's available on github under the text_mining_toolkit directory:

Recipes Corpus to WordCloud

The following diagram shows the above pipeline steps to take the recipes text corpus and emerge with a frequency word cloud - click it to enlarge it.

Word Cloud Text Processing Pipeline

You can see the python text_mining_toolkit package modules and functions being used. Feel free to explore the code at the above github link.

The python notebook is also on github, and shows you the commands implementing this pipeline, making use of the package we're making, and word cloud graphic itself - all very simple, as intended!

Organising the Package

The process of experimenting, and doing, helps us learn and raise questions which we might have missed otherwise.

As I implemented this simple pipeline, it because clear I needed to think about the structure oft he text_mining_toolkit package. Here are a summary of these thoughts:
  • Make as much of the package using functional functions - ie functions that take an input, produce and output, with no reference or dependency on any existing state elsewhere
  • The exception to this is the CorpusReader which is an object containing the corpus, and able to present it on request as individual documents, an aggregation of all documents, or just the names of the documents. This exception should be fine as it is the start of any data pipeline.
  • There are processing steps which make sense applied to the entire text, and others which make sense applied to a sequence/set of words. Therefore two modules are used: text_processing_steps and word_processing_steps to keep things clearer. It may be that some operations are implemented in both modules because they can be applied to both text and words (such as lowercase).
  • Visualisation steps are put into a separate visualisation module.
  • Function names should make absolutely clear what's going on. I dislike working with other frameworks where the function or object names don't make clear what's going to happen, or what is available. I have made a point of using long descriptive function names, and also verbs at the start to really help readers or coders understand the packages. For example, it is really obvious what the function split_text_into_words() does.

Italian Recipes Word Cloud (from the pipeline)

Here's the output image from this pipeline, just for fun again.

Now that the pipeline has proven itself, it is a really clear and simple way to experiment and tweak it, without getting lost in code spaghetti, if you'll excuse the Italian food pun ;)

No comments:

Post a Comment