Tuesday, 26 July 2016

Word Counts, Word Clouds .. and Stopwords

One of the simplest ways of understanding a bunch of text is to see which words occur most often.

Yes, there many many more sophisticated techniques, but we're at the beginning of our journey and we want to start super super simple.

Counting Words

What do we mean by counting words? It's simply going through the text, a word at a time, and keeping a tally of how often we see each word.

Let's look at a short piece of text:

She sells sea shells on the sea shore.
She also sells ice-cream by the sea shore.

The tally of each word looks like this:

    Word        Frequency  
    she            2       
    sells          2       
    sea            3       
    shells         1       
    on             1       
    the            2       
    shore          2       
    also           1       
    ice-cream      1       
    by             1       

You can see how the word "shells" only appears once. Similarly the word "on" only appears once. More interestingly, the word "sea" appears most often - 3 times. Perhaps it represents best what the text is about? Maybe it is an important theme of the text?

What about words that appear a medium number of times? The word "she" appears twice, and we can see she is an important part of the story. So do the words "sells" and "shore" both of which are also an important part of the story.

So word frequency seems to be a fairly good way of working out the most important themes in some text.

Let's  try it on a couple of bigger buckets of text,
The most frequent words for the food recipes are shown here - only the most frequent, not all of them:

   the        273   
   and        203   
   a           133  
   with        79   
   for         77   
  in          72
  1           58
  to          56

   of          49   
   2           44   
   then        43   
   until       41   
   oil         41   

What happened?! Almost all the words are totally boring and uninformative. The words "the", "and", "a" .. "then" and "until" are not very enlightening. It's the thirteenth most frequent word that is in any way useful, the word "oil".

So oil is a major feature of the recipes. But there were 12 more frequent but useless words above it. If we look at the words more closely, we can see why. Those words are just the words we use in English to connect other words together to make proper sentences.

Maybe we should ignore them by filtering them out?

In fact that is exactly what many people do .. and such useless words are called stop words. You can read much more about stop words here, but for now we'll keep it simple with a quite minimal but effective list of stop words:

If we filter out the stop words, the most frequent words now look the following:

   1        58   
   2        44   
   then     43   
   oil      41   
   tsp      38   
 chopped    37   

The word "oil" is now number four on the list. That's better. And we now have useful words like "tsp" (tablespoon) and "chopped" in the top six words.

So stop words have improved things .. and we did it using only a very simple idea.

We still have some words that aren't that useful (we think) .. and we can refine the stop words list again later if we want to, to remove the word "then" for example.

Word Clouds

We could look at top-10 lists for word frequency ... and that would be fine. Sometimes a more visual approach helps readers understand the most important themes quickly, and with a lot less mental effort.

One visual way is to take these words, and plot them on a diagram, sized bigger if they occur more frequently. Many will know these as word clouds or tag clouds:

Python has a nice module called wordcloud which does the job. Here's the kind of code you need ..  simple enough.

# word cloud
wc = wordcloud.WordCloud(max_words=100, width=1600, height=800,background_color="white", margin=10,prefer_horizontal=1.0) 

# plot wordcloud
plt.figure(dpi=300, figsize=(16,8)

Here's the word cloud for the recipes:

You can see that oil, and chopped and olive are the prominent themes. The words "1" and "2" are the most prominent and could be filtered out. Looking at the recipes, they come from the common use of 1 or 2 as quantities.

And here's the word cloud for the Hillary Clinton emails:

We're starting to get the sense that the words "State" and "Department" are key, as is the word "US". You might say .. well, we'd expect the Clinton emails to to be about these words .. what's new? Well we can see "F-2014-20439" as prominent, and is probably a document or case referred to often .. worth checking out ;) There prominent dates too, like "06/30/2015" and "2009" which again are probably related to key events. Who are "Cheryl", "Abedin" and "Jacob" .. they seem to be referred to often enough?

And finally the word cloud for the Iraq Inquiry report:

This one is not so informative. Almost all the words should be stop words. We'll develop other methods later on our journey to help us get insight into the Iraq Report.

Word Frequency - Simple but Effective

What we've done here is very simple .. but actually quite powerful. The idea of using word frequency to imply importance can be applied to huge volumes of text ... without us having to read them manually .. and we can produce a nice visual representation of the most important themes.

Yes, there are imperfections ... and we can use stop words to make the results much better. Again this is a very simple idea. We've not needed to do anything very advanced at all .. all these ideas could be understood by a school student.

Minimum Word Length 5

Before we finish, let's try a rather brutal but effective method to improve the word clouds ... ignore any word that is less than 5 letters long. The idea is that most important words are longer than 4 letters, and most stop words will be short.

Here are the much more interesting results.... enjoy!


Lowercase All Text

After I published the post I started to think about cleaning up the words a bit more. It seemed to me that the following words:

Wordcount  WordCount  wordCount  WOrdcount

would be considered as separate words by our code, because .. well .. they are different. We humans might consider them to be the same, and for any word counts to consider them the same word too.

One easy way to do that is to force all letters to be lowercase .. which will have some downsides (human names, code-names or case file identifiers might be case sensitive) .. but overall it will help improve our aim of distilling out the most important themes in some text.

Here's the code to do it in Python .. again, super easy:

# lowercase words
words[:] = [w.lower() for w in words]

And here are the resultant word clouds ... you can see some changes have happened. For example "olive" is much more prominent now, perhaps because before it was considered as smaller sets of different words. For the Iraq Inquiry, the words "military" and "security" are very much more prominent too, which you would expect.

Monday, 11 July 2016

Clinton Emails and Chilcot Iraq Inquiry Report in Plain Text

It's always much more interesting to explore data sets that are interesting themselves. 

So I've converted two "hot topic" data sets into plain text:

  • The Chilcot Iraq Inquiry Report into whether it was right to go to war, and whether the war and it's aftermath could have been better planned for.
  • Hillary Clinton's use of a personal email server for official business led to controversy. A redacted set of emails was released, and a version is at Kaggle.

The Iraq Inquiry report is in PDF form which is not ideal for text analytics. I've extracted the text using the open source "pdftotext" utility, with an attempt to preserve the text flow layout.

The Clinton emails are provided as an sqlite database or as a CSV file. I've extracted the "RawText" because the provided ExtractedBodyText hasn't worked in some cases. The plain text files are named with the DocumentNumber.

Here are the links on github:

I may update the Iraq Inquiry Report to also include the additional evidence documents.

Have fun!