Monday, 22 August 2016

N-Gram Word Clouds

We've just developed code to count n-grams, and plot their frequency as word clouds:

So explore some text data sets to see what happens.

Italian Recipes

We know this small corpus fairly well so these experiments are to see how well word clouds of n-grams work .. or don't ... we have to be open to when these algorithms don't work, otherwise we'll fall into the trap of blindly believing their results.

Here's the word cloud of 2-grams using a min word length of 4. Compared to previous 1-gram word clouds, this is really rather informative. We can see phrases which actually have meaning, or are things .. like bread crumbs, tomato sauce, grated cheese. You can also see that salt and pepper is prominent. The previous 1-grams wouldn't have captured these phrases, or in the case of salt and pepper, the fact that 2 things are closely related. Clearly these things are prominent in Italian cooking!

What about 3-grams? You can see below, that some additional insight is to be had, but not as much as the leap from 1-grams to 2-grams.

With 4-grams, there isn't much that is interesting or informative. It's as if the most interesting language snippets are 1 or 2 words long, sometimes 3.

Chilcott's Iraq War Report

The following shows the 2-gram word cloud for the Chilcott Iraq War Report.

Although the phrases are very relevant to the text corpus, they're not that informative because we know what the report was about and the main elements like the dates and people.

This is actually a useful prompt for an idea we'll explore later - that the most common phrases or words aren't the informative .. :)

Mystery Corpus!

In real life, we may be trying to work out what a set of text is about, without having seen it before. We'll actually be using some "mystery" text corpora as we journey through text mining .. but here's the first sample:
You could take a peek to see what it is .. but try not to. Instead plot the word cloud to see what the main themes or elements are.

Can you tell what the text was about? Yes - it's the story of Little Red Ridinghood. There are quite a few big clues in the cloud to work this out - world, grannie, child, basket ...

We're text analytics detectives now!

No comments:

Post a Comment