Yes, there many many more sophisticated techniques, but we're at the beginning of our journey and we want to start super super simple.
Counting WordsWhat do we mean by counting words? It's simply going through the text, a word at a time, and keeping a tally of how often we see each word.
Let's look at a short piece of text:
She sells sea shells on the sea shore.
She also sells ice-cream by the sea shore.
She also sells ice-cream by the sea shore.
The tally of each word looks like this:
You can see how the word "shells" only appears once. Similarly the word "on" only appears once. More interestingly, the word "sea" appears most often - 3 times. Perhaps it represents best what the text is about? Maybe it is an important theme of the text?
What about words that appear a medium number of times? The word "she" appears twice, and we can see she is an important part of the story. So do the words "sells" and "shore" both of which are also an important part of the story.
So word frequency seems to be a fairly good way of working out the most important themes in some text.
Let's try it on a couple of bigger buckets of text,
- 22 food recipes: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/recipes
- The Iraq Inquiry Report: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/iraq_inquiry/
- The Clinton Emails: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/tree/master/data_sets/clinton_emails
What happened?! Almost all the words are totally boring and uninformative. The words "the", "and", "a" .. "then" and "until" are not very enlightening. It's the thirteenth most frequent word that is in any way useful, the word "oil".
So oil is a major feature of the recipes. But there were 12 more frequent but useless words above it. If we look at the words more closely, we can see why. Those words are just the words we use in English to connect other words together to make proper sentences.
Maybe we should ignore them by filtering them out?
In fact that is exactly what many people do .. and such useless words are called stop words. You can read much more about stop words here, but for now we'll keep it simple with a quite minimal but effective list of stop words:
- Minimal stop words: https://github.com/makeyourowntextminingtoolkit/makeyourowntextminingtoolkit/blob/master/stopwords/minimal-stop.txt
If we filter out the stop words, the most frequent words now look the following:
The word "oil" is now number four on the list. That's better. And we now have useful words like "tsp" (tablespoon) and "chopped" in the top six words.
So stop words have improved things .. and we did it using only a very simple idea.
We still have some words that aren't that useful (we think) .. and we can refine the stop words list again later if we want to, to remove the word "then" for example.
Word CloudsWe could look at top-10 lists for word frequency ... and that would be fine. Sometimes a more visual approach helps readers understand the most important themes quickly, and with a lot less mental effort.
One visual way is to take these words, and plot them on a diagram, sized bigger if they occur more frequently. Many will know these as word clouds or tag clouds:
Python has a nice module called wordcloud which does the job. Here's the kind of code you need .. simple enough.
# word cloud
wc = wordcloud.WordCloud(max_words=100, width=1600, height=800,background_color="white", margin=10,prefer_horizontal=1.0)
# plot wordcloud
Here's the word cloud for the recipes:
You can see that oil, and chopped and olive are the prominent themes. The words "1" and "2" are the most prominent and could be filtered out. Looking at the recipes, they come from the common use of 1 or 2 as quantities.
And here's the word cloud for the Hillary Clinton emails:
We're starting to get the sense that the words "State" and "Department" are key, as is the word "US". You might say .. well, we'd expect the Clinton emails to to be about these words .. what's new? Well we can see "F-2014-20439" as prominent, and is probably a document or case referred to often .. worth checking out ;) There prominent dates too, like "06/30/2015" and "2009" which again are probably related to key events. Who are "Cheryl", "Abedin" and "Jacob" .. they seem to be referred to often enough?
And finally the word cloud for the Iraq Inquiry report:
This one is not so informative. Almost all the words should be stop words. We'll develop other methods later on our journey to help us get insight into the Iraq Report.
Word Frequency - Simple but EffectiveWhat we've done here is very simple .. but actually quite powerful. The idea of using word frequency to imply importance can be applied to huge volumes of text ... without us having to read them manually .. and we can produce a nice visual representation of the most important themes.
Yes, there are imperfections ... and we can use stop words to make the results much better. Again this is a very simple idea. We've not needed to do anything very advanced at all .. all these ideas could be understood by a school student.
Minimum Word Length 5Before we finish, let's try a rather brutal but effective method to improve the word clouds ... ignore any word that is less than 5 letters long. The idea is that most important words are longer than 4 letters, and most stop words will be short.
Here are the much more interesting results.... enjoy!
Lowercase All TextAfter I published the post I started to think about cleaning up the words a bit more. It seemed to me that the following words:
Wordcount WordCount wordCount WOrdcount
would be considered as separate words by our code, because .. well .. they are different. We humans might consider them to be the same, and for any word counts to consider them the same word too.
One easy way to do that is to force all letters to be lowercase .. which will have some downsides (human names, code-names or case file identifiers might be case sensitive) .. but overall it will help improve our aim of distilling out the most important themes in some text.
Here's the code to do it in Python .. again, super easy:
# lowercase words
words[:] = [w.lower() for w in words]
And here are the resultant word clouds ... you can see some changes have happened. For example "olive" is much more prominent now, perhaps because before it was considered as smaller sets of different words. For the Iraq Inquiry, the words "military" and "security" are very much more prominent too, which you would expect.