Tuesday, 22 November 2016

Exploring The Clinton Emails Through Document Similarity

In the last post we succeeded in finally improving performance to make calculating document similarity for the fairly large Clinton Email data set.

Let's explore it...

Top 800

There are 7944 documents (emails) in the Clinton data set. This turns into 7944 * 7942 / 2 = 31,668,756 doc1-doc2 combinations. Plotting all of these in one graph visualisation might not be useful to see structure.

So let's see what happens when we plot the top 800. By top we mean the 800 pairs of documents with the highest similarity scores.

Here's what happens:

You can see that even with only 800 out of 31 million plotted, the view is quite busy! Luckily the force-directed graph settles and the documents move apart, except where strong links keep things together. This does clarify some clusters of similar documents, but quite a lot moves out of view .. something we need to fix later.

Here's what it looks like after a settling a little:

What are some of these clusters? Let's look at the biggest one in that view, coloured pink above, ... and review the contents of the emails. Here are some of the document names, and what they're about:

  • C05767378.txt
  • C05767361.txt
  • C05767355.txt
  • C05773263.txt
  • C05773249.txt
  • C05773269.txt
  • C05767369.txt

Well - looking at the content of those emails .. it is encouraging that the subject of all of these email is the same ... "Subject: Re: Bravo! Brava! Issue your statement! Sid" .. which makes sense as they are all supposed to be related. This is good news, and suggests our document similarity calculations are working!

Once we've read the documents we realise there is a silly reason that the documents are so similar - they all quote the same guardian article, copying and pasting the text from it. That's ok .. we're interested in documents that are meaningfully related and having read them, we realise they are in fact part of the same conversation. 

Top 10

Let's limit the number of pairs that are drawn to the very top-most 10. This should show us those documents with the most similarity across the entire corpus.

There's only one cluster with more than 2 documents .. so let's take a look at it:

  • C05774734.txt
  • C05770127.txt
  • C05770137.txt

Again it's good to see that these 3 emails are all about the same subject .. with an email subject of "Subject: Re: The political ramifications of the Greek/European debt crisis". Again, there is a silly reason why these are so highly scored for similarity ... they copy and paste the same press release text. Despite this, the emails are actually related. 

Top 50

Increasing the number to 50 .. we still only really get document pairs .. no real clusters yet.

Let's try more pairs.

Top 2000

Let's plot the top 2000 pairs and let it settle. Here's what we get ...

Let's look at that cluster highlighted in blue. 

It's interesting because rather than having lots of connections between most/all the documents ... it is in fact a chain with one document similar to another, which in turn is similar to another .. and so on. 

The chain is:

  • C05767510.txt
  • C05773329.txt
  • C05773316.txt
  • C05767491.txt
  • C05767495.txt
  • C05767499.txt
  • and a branch to C05773315.txt

Reading these emails, we can again say they are linked because they all have the same email subject field "Subject Re: Mubarak Call Sheet" .. they're part of the same conversation. That is a pretty good achievement given we simply had a huge bucket of random emails to work with!

The chain also shows the email trail as it grows with replies to emails including the text of previous emails. Again it is pretty cool that we managed to visualise this on the graph! That branch on the graph to C05773315.txt is also reflected in the text ... it is a forked email conversation.

Not bad ... all this from a simple idea of similar documents!

Topics & Themes

This is all great .. and we can spend loads of time exploring these graphs for interested clusters, and chains .. and even key documents which link different clusters ...

But one thing we still had to do manually is read the documents to find out what the cluster theme was .. that is something we have started to explore when we started to look at reducing dimensions to distill out the key topics in a corpus.

So let's crack on with developing that further ...

Wednesday, 16 November 2016

Does Numba Help Improve Performance? .. Yes!

We saw last time that calculating document similarity between all the document-document combinations in a corpus takes a huge amount of time.

The core calculation is the cosine distance between two document vectors (in word relevance space).

After we've tried to avoid the usual pandas cell-level problems .. we're not left with many options. One of these few options is compiling the code.

Compiling Code?

Compiling code? What's that? When computers run your programs, they ultimately have to execute instructions on the metal-and-wire CPU. Those instructions are very low-level .. they used to be called assembly instruction but no-one seems to call them that anymore.

Back in the old days with slower smaller computers, people used to write some of their code in assembly language .. so that they squeezed every last bit of power out of their computers. The problems is .. assembly language programming is hard and slow and arduous. And prone to error.

These days, computers are fast enough for us to write in nicer, more readable and understandable languages, .. like Python! These languages are nicer for us .. but actually need to be translated into machine instructions so that a CPU can execute them.

Python is interpreted .. that means when you run a program, Python takes the nice easy to understand code and translates it into lower level code (called byte code) .. which itself eventually gets turned into machine instructions run on the metal and wires CPU.

This is fine in the vast majority of cases .. but you can see that it might not be efficient if we need tight code to manipulate an array of numbers ... instead of machine instructions to do just that .. what really happens is the overhead of these translations steps ...

.. and if we're in a loop .. as we often are ... then this translation gets done again and again, repeating the same translation work for no good reason ... because it gets translated to the same thing again and again.

Some languages like C are compiled deliberately before you run the translated code. It's not as fun, or easy, or interactive ... but it does generate efficient code. And because the translation to machine instructions - called compilation - is done once and looks at the whole code .. it can make judgements about optimising the code ... so leaving out code that doesn't get execute .. or moving things outside a loop if they are only calculated once ...

Numba is JIT :)

There are several ways of compiling Python .. but an easier to use way is to do what is called Just In Time (JIT) compilation.

That means, Python will recognise functions that you have marked as needing compilation .. and it will compile them the first time they are executed ... and that efficient compiled set of instructions is kept for the many more times that function is called in a program.

Numba is the name of the Python extension that does this ..

Applying Numba

Numba is really easy to use. Often all you have to do is signal a function to be JIT compiled by adding a decorator @jit just above the function declaration .. like ...

def myfunction(x,y):
    do something
    return z

For our own code we have the following function pulled out:

# create document similarity matrix, the relevance matrix needs to already exist
# sample_fraction is useful to select a subset of documents as the combinations get large quickly
# this version is jitted with numba
@numba.njit((numba.float64[:], numba.float64[:]))
def jit_cosine_distance(x,y):
    #z = 1.0 - numpy.dot(x,y) / (numpy.linalg.norm(x) * numpy.linalg.norm(y))
    z = numpy.dot(x, y)
    xn = numpy.sqrt(numpy.sum(numpy.square(x)))
    yn = numpy.sqrt(numpy.sum(numpy.square(y)))
    z = 1.0 - (z / (xn * yn))
    return  z

That's the core calculation of the cosine similarity using a dot product and then normalising. The @njit is enforcing strict rules because a normal @jit tries to be lenient with code that it can't make efficient. This way we either fail compilation .. or generate efficient code.

We call this function as we loop through all the document pair combinations:

# combinations (not permutations) of length 2, also avoid same-same combinations
    docs_combinations = itertools.combinations(docs_sample, 2)
    for doc1, doc2 in docs_combinations:
        # scipy cosine similarity function includes normalising the vectors but is a distance .. so we need to take it from 1.0
        doc_similarity_dict[doc2].update({doc1: jit_cosine_distance(relevance_index[doc1].values,relevance_index[doc2].values)})

So .. how does it perform?


The following shows how long it takes to make the document similarity matrix .. with the following variables:

  • varying document corpus size (fraction)
  • previous function using scipy.spatial.distance.cosine()
  • new function with calculation separated out and JIT turned on
  • new function with calculation separated out and JIT turned off

Oh dear.

It looks like ...

  • The previous simple function is the fastest.
  • Separating out the calculation into new code is slower.
  • Applying JIT makes it slower still!

You might look at that code and say that the new code is more lines of instructions ... dot product, length for each of the two vectors, division, subtraction from 1.0 ... but the reason for that is numba can't JIT code like numpy.linalg.norm or scipy.spatial.distance.cosine.

In fact. .. it turns out the actual code that is executed when you call scipy.spatial.distance.cosine is already very optimised .. so we've only added more layers ...

Try Again

I asked about the performance of this similarity calculation on stackoverflow .. and had great suggestions.

One suggestion was that I shouldn't just extract the cosine distance calculation into a JIT-able function but instead I should extract out the loop that works over the relevance matrix. The following shows the bigger chunk that has been extracted out to a JIT-able function:

That does need a bit of rework to the code .. we can't use the itertools anymore but have to find a new way to find the right combinations of documents for the similarity calculation. It's not that hard .. its every document in the list combined with every document beyond it. We also can't pass a pandas data frame to the JIT function as numba can't yet work with pandas data frames .. instead we pass the underlying numpy array .. which is easily accessed like df.values. We also have re-add the columns and index when the JIT functin returns a plain numpy array.

Here's the code for the JIT function:

def similarity_jit(m):
    l = m.shape[1]
    s = numpy.zeros((l-1,l-1))
    for i in range(l):
        for j in range(i+1,l):
            s[i,j-1]= numpy.dot(m[:,i], m[:,j])
    return s

Not that complicated at all ... and the funny business around s[I,j-1] just places the results into the right place for the returned similarity matrix.

Does this work better?

Results Take 2

The following shows how long a normal python calculation for similarity takes, compared with the new JIT function.

That vertical scale is plotted logarithmically (1, 10, 100, 1000, 10000 etc). You can see now that the numba JIT code is much faster than the plain python code. It looks like the suggestion to include the loop in the JIT extract was right!

The following shows the speedup factor:

That looks like a rough speedup of 2.5 time ...  not bad at all.

But that 5 days to process the entire Clinton email corpus would now take 2.5 days .. still impractical!

One More Idea

John Zwinck replied to my StackOverflow question with a brilliant idea.

He noticed that the core calculation is essentially the dot product of two vectors .. vectors which are columns in the relevance matrix. Each column represents a document .. with values being the relevance scores (or word count if we're being simple).

He suggested that the numpy array passed to the JIT function should be arranged so that picking out columns was more efficient .. because the computer memory holding that column was contiguous (and not broken over different locations).

A numpy array can be transformed into this column-oriented format as follows. The format is the way Fortran commonly stored arrays, as opposed to format='C' for example.

new_array = numpy.array(old_array, format='F')

Let's see the effect of this:


The effect of re-ordering the array is dramatic! Look at how badly the plain python gets... 20 minutes to calculate the similarities of just 0.05 of the Clinton emails! And we saw the performance gets exponentially worse ... with 5 days projected to complete the corpus.

The following graph shows the dramatic difference this makes.

Look at the speed up factor .. it keeps climbing from 1.3 to 18.5 .. to show how much better this approach is.. and it'll keep climbing with larger data sets.

This now makes analysing larger datasets practical .. when they weren't before.

On a Google cloud virtual machine (which I could run overnight), calculating the document similarities for the entire Clinton email dataset took only 4 hours, down from a projected 5 days! 

In the next post we'll work through the Clinton email data set fully compared for document similarity.

Tuesday, 15 November 2016

More Performance (Memory) Stuff to Fix

We've done loads of work already identifying and fixing performance issues, which had a previous finale here.

Most of those issues were related to CPU inefficiency ... doing lots of stuff in a way that wasn't the best in terms of actual steps to be done, or instructions to be executed on the processor (CPU).

In trying to index the bigger Clinton emails data set ... we hit a problem .. on MacOS the Python kernel crashes .. and a quick check on Linux seemed to show huge memory use leading to memory swapping ... yuk!

So let's look into this a bit more.

Watching Memory Use

The Clinton email data set has 7945 documents, totalling about 40Mb.

That's not big data but it's not trivially small either .. and can highlight code that isn't efficient.

The indexing of each document, through tmt.index_wordcount.create_wordcount_index_for_document(), is now very fast .. 200x faster than our early code.

We can use the MacOS visual tool called Activity Monitor for monitoring CPU and memory consumption .. here's how it looks while the Clinton emails were being indexed. You can see that the python process is only consuming about 128M .. less than the web browser on this laptop!

A more insightful tool is the command line vm_stat (vmstat on Linux). You can see that there is no swapping happening (the rightmost two columns). What's swapping? When normal memory runs out many computers swap out content from memory to disk/storage and swap it back in when needed. Swapping is slow, and using storage is way slower than normal super fast memory .. we we should avoid it as much as possible.

The merging of those per-document indices into a bigger corpus index is done by growing a data frame in memory using tmt.index_wordcount.merge_wordcount_indices_for_corpus(). We can see the memory consumption of Python grow beyond 300M .. towards 400M ... 600M ... 900M ... and it'll keep growing...

The following snapshot shows python's memory climbing past 1.21G ... and it'll keep climbing!

Here's the vmstat from Linux showing memory being swapped out one merging gets going .. baad!

Here's the Mac OS vm_stat output showing the memory overfilling into swapping ...

And here's the crash! Oh dear.

What Can We Do?

What can do .. that's a good question.

One thing we can't do is to reduce the size of data. Well, we could by stripping out content like stop-words .. but right now we want to work with the full text.

That text needs to contribute to a data frame which is the word-document matrix. So the size of that matrix won't go down.

So we have to think not about the full size of data but the efficiency with which is it created. Let's remind ourselves how merging indices works...
  • start with an empty corpus index pandas dataframe
  • for each document index ...
  • merge() it into the corpus data frame

This hits one of the weak spots of pandas.. the repeated extension of data frames .. rather than filling in a pre-allocated dataframe. Under the hood, merge() creates a new data frame in addition to the two it is merging ... you can see how memory consumption explodes!

Instead off merging heavy pandas dataframes as we work through each document index .. maybe we should defer the pandas dataframe .. and for each document index we instead merge a dictionaries instead? Let's say that again more clearly:

  • start with an empty dictionary
  • for each document index ...
  • ... turn the data frame into a dictionary, and update the empty dictionary
  • finally turn the dictionary into the corpus dataframe

Let's try it ...

Performance Improved Again

Here's the previous performance improvements we achieved:

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 4.34         
Total 481.32 26.12 6.63         

The merging of the Iraq corpus took 7.32 seconds. Our new code takes ... 1.74 second! That's a speed up of over 4x !!

Pandas Numpy Better Pandas
Index 474 2.22 2.29         
Merge 7.32   23.9 1.74         
Total 481.32 26.12 4.03         

Overall .. from our initial code taking 481.32 seconds to our new code taking 4.03 .. we have an overall speedup of 120x !!

This validates out idea about deferring the use of pandas dataframes, avoiding their repeated updating.

Tough Test - Clinton Emails

Let's see how it works with the bigger Clinton emails corpus which crashed as we ran out of memory.

Crunching through the emails causes a warning ...

It took me a while to work out what was causing this ... the division by zero shouldn't happen .. because each vector has a non-zero length (because all the documents contain at least 1 word!) ...

It turned out the problem was an empty document C05762481.txt which was removed from the corpus.

That fixed .. we start to get results..

Actually we don't .. calculations seem to take forever! Remember that to do document similarity calculations we have to compare documents with every other document .. the combinations quickly explode! The following shows how ling it takes to calculate a number of combinations from 10 to 100,000. Remember it only takes 1000 documents to create half a million combinations!

The relationship seems linear .. 1200 seconds to crunch through 100,000 combinations..  at that rate, it would take 5 days to crunch through all the Clinton emails (31,549,596 combinations)

Ok ... even of we tweak the code, the fact that the algorithmic complexity is combinatorial will be difficult to mitigate. That is, shaving off a bit of time through quicker code will be negated by growing the dataset size by only a small amount.

So another approach is needed. We could just random sample the documents to create a smaller subset to work with .. it's not perfect as we might miss some key relationships .. but it's better than no insights. The new create_doc_similarity_matrix(content_directory, sample_fraction) function now takes a sample_fraction which determines how much of the documents to sample .. the fraction is between 0 and 1.

Let's try a small fraction of 0.01 .... hat took 50 seconds.
Let's try a fraction of 0.02 .. that took 169 seconds .... and the results look like:

We have a result! We can see clusters of documents... and if we run the process again so a different subset of documents are analysed for similarity we get ..

We could look at the documents in each cluster to determine what they're about .. but that's the subject of our other work .. automatically working out what the key topics are ...

Tuesday, 8 November 2016

So Many Dimensions! .. And How To Reduce Them - Part 1/2

This post is a gentle easing into a topic many people talk about but rarely explain .. and even if you try to find explanations .. they're not that good .. so I'll have a go.

The topic is called Latent Semantic Indexing .. or Latent Semantic Analysis .. or Dimensionality Reduction ..  all horrible phrases! That last one looks like a weapon from Dr Who!

Let's start with a super simple example...

Talking About Vehicles and Cooking - Without Using Those Words

Imagine we have some documents that talk about vehicles, and another set of documents which talk about cooking.

But - and this is important - imagine they never use the word "vehicle" or "cooking" ... instead they talk about specifics like wheel, seat, engine, slice, oven, boil ...

Here's a word-document matrix that shows how often those words occur in our documents.

You can see how 3 documents talk about vehicle stuff .. but don't mention the word vehicle. And you can see that the remaining 3 talk about cooking stuff ... but don't mention the word cooking.

And we know that we should be using word frequency to take account of bias from long or short documents:

That's all nice and easy so far ... we've done this loads in our previous posts.

Two Questions

Looking back at the documents and their words above .. we can ask 2 really good questions:

  • How do we visualise those documents .. with many dimensions (words) .. on a 2 dimensional plot .. so that I can see them clearly and usefully on my computer or on a book page?
  • How do I extract out the underlying topics (vehicle, cooking) ... especially when the documents don't actually mention these more general topic names specifically?

You can see that these are actually good questions - if we could do them, we'd have very powerful tools in our hands!

In fact .. the solution to these two questions is more or less the same! Let's keep going ...

Can't Plot A Billion Dimensions!

Let's step back a bit ... and look at that first question.

If we wanted to compare documents to see how near they were to another one based on their word content, we'd have some difficulty. Why? Because the axes of that plot would be the words (wheel, seat, boil, etc) ... and there are loads of those ... too many to plot meaningfully.

If we had a vocabulary of only two words (wheel, boil) then we could plot the documents on a normal 2-dimensional chart .. like the following, which shows 6 documents plotted in this word space.

By limiting ourselves to 2 dimensions .. 2 words (wheel, boil) we miss out on so much information about those documents. Documents which appear far apart might actually be close if we looked at other dimensions .. if we had chosen other words instead.

So brutally cutting out only 2 dimensions of the word-space we've severely limited the information that view contains .. and it could even be misleading. 

A 3-dimensional view isn't much better ... the following shows 2 dimensions, representing 3 words (wheel, slice, seat).

So what can we do? Even our simplified documents above have only a vocabulary of 6 words ... and we can't plot 6 dimensional charts on this display or paper.

We clearly need to reduce the number of dimensions ... but do so in a way that doesn't throw too much information away .. a way that preserves the essential features or relations in the higher dimensional space. 

Is this even possible? Well .. sort of, yes... but let's work through our own simple documents to illustrate it.

From 6 Dimensions .. Down to 2 

Look again at the 6 word vocabulary that our six documents are made of .. wheel, seat, engine, and slice, oven, boil.

You can see what we already knew .. there are actually 2 underlying topics ... vehicle and cooking.

The first topic vehicle is represented by the three words wheel, seat, engine .. similarly, the second topic cooking is represented by the words slice, oven, boil.

So .. maybe .. we can reduce those 6 dimensions down to 2 dimensions ... with those 2 axes representing the 2 topics  vehicle and cooking?

That would be a useful thing to do .. not just because we're making it easier to plot ... but because those 2 topics are a good generalisation of the topics represented by the many more words in those documents.

How do we do this? How do we use those word frequency numbers?

Well .. we could invent all sorts of ways to combine those 3 numbers for wheel, seat, engine into 1 for vehicle. As usual, if simple works we should prefer it.

Let's start with simply averaging those numbers:

We've cheated here because we know which words to associate with each topic .. we'll come back to that later .. for now we're illustrating a different point about reducing dimensions.

Let's see what the plot now looks like:

Here we have 2 clear topics ... which is a simpler, distilled, view of the 6-dimensional original data.

Topics, Not Words

Ok - so we've used a rather constructed example to show how we can reduce lots of dimensions down to a smaller number .. like 2 here.

That answers the first question we set ourselves .. how do we visualise high dimensional data.

That second question was about how we extract out topics ... topics which are represented by several or many other words in documents. We've just seen how that extraction can happen.

By combining words, by averaging for example, into a single number .. we can extract out topics. Those topics won't be named by this method .. but will tell us what that topic entails.

That's fantastic! And pretty powerful!

Imagine being able to extract out the top 5 topics from a corpus ... and even if we don't have names for each of those topics .. we will know what they're made of ... For example, a topic T made of words apple, banana, orange, kiwi .. probably tells us that the topic T is fruit.

Next Time

That's cool .. we've asked some interesting questions, shown the problem of visualising and understanding high dimensional data .. and one suggested way to reduce dimensions to extract out more general topics.

... but the only thing left to work out is how to choose which words form a topic .. that's the next post!

Friday, 4 November 2016

Grouping Similar Documents .. aka Clustering. Part 2/2

Last time talked about how useful it would be to automatically group similar documents - and we developed quite a simple idea for doing this.

And it worked quite well ..

For a test mixed corpus, made of documents from 4 different corpuses ... the method successfully separated out a cluster consisting only of recipes ... not bad at all for such a simple underlying idea that similar documents share common words ... nothing more complicated than that.

But we need to do more experimenting to see if we fix the big cluster which is a mix of document types ... we should ideally have 4 clusters - Clinton emails, Iraq report, Italian recipes and Macbeth.

The following reminds us of how we constructed the mixed courpus, from 13 documents from each of the source corpora, and our expectation that successful document similarity clustering well separate them again according to these types.

Experiment 1: Remove Stop Words

Perhaps the similarity scores are being skewed by those boring stop-words like a, the, in, it, on ... let's apply our filter to brutally chop them out.

The results, showing 150 links are ...


That works fantastically! There are 4 clear clusters .. and they are pure ... they only contain documents of the same type.

The following shows the same graph plot with the clusters shown more clearly.

This tells us that we do need to deal with boring words skewing the similarity scores. We previously tried to avoid the brutal method of removing stop-words directly and using measures of interestingness ... we could try that ...

Experiment 2: Normalise The Word Frequency Vectors

Let's undo our brutal stop-word removal and try a different approach.

Our current calculations are using word frequency, not raw word counts, to account for any bias from long or short documents .. which will have more or less words because of this. A short document can never have a very high word count.

But we didn't normalise the document vectors when we took the dot product. This wasn't intentional .. we simply forgot!

We can use the scipy.spatial.distance.cosine() function which is just (1 - similarity) that we want .. and includes normalising the vectors.

Here's the top of the results... looks promising because there doesn't appear to be any incorrect similarities ..

doc1 doc2 similarity
0 C05739561.txt C05739554.txt 1.000000
1 C05739565.txt C05739563.txt 0.999959
2 C05739561.txt C05739546.txt 0.997317
3 C05739554.txt C05739546.txt 0.994921
4 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-32.txt 0.994283
5 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-32.txt 0.992807
6 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-31.txt 0.992191
7 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-33.txt 0.991721
8 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-12.txt 0.989763
9 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-31.txt 0.987702
10 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-31.txt 0.981875
11 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-12.txt 0.978545
12 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-12.txt 0.975892
13 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-12.txt 0.968804
14 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-11.txt 0.965843
15 the-report-of-the-iraq-inquiry_section-12.txt the-report-of-the-iraq-inquiry_section-11.txt 0.965642
16 the-report-of-the-iraq-inquiry_section-20.txt the-report-of-the-iraq-inquiry_section-12.txt 0.964386
17 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-20.txt 0.960154
18 the-report-of-the-iraq-inquiry_section-15-1.txt the-report-of-the-iraq-inquiry_section-12.txt 0.956302
19 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-15-1.txt 0.953367

So normalising the vectors seems to help ..  let's see what happens if we plot the top 50 links:

Looks promising .. let's plot more... the top 150:

Oh no! ... things have broke down a bit .. yes, there are 2 good clusters that are pure .. and even a pure tiny cluster .. but the fourth cluster is a big mix of documents that should have been in different clusters.

So normalisation helps mitigate the effects of lots of boring words .. but why? Let's take this step by step ..

  • Our simple idea for document similarity says that documents are similar if they share common words ... documents in a cluster about cars won't mention food words, for example.
  • But the boring stop-words like .. is, a, an, on, to, the .. appear in all kinds of documents. They don't help identify which genre or cluster a document should be in .. that's why they're boring .. they're not relevant words.
  • These boring words occur lots, by far, and dominate the word frequency of more interesting words .. making documents seem similar when they aren't ... because they have lots stop words in common .. and the frequencies are higher than useful words.
  • Normalising the document vectors makes the very large stop-word frequencies smaller .. so that the vectors have a length of 1. The effect on large stop-word frequencies is significant .. but the effect on small interesting words is less so ...  the following diagram shows this.

You can see the effect on high boring-word frequencies is much more dramatic (bigger change) than it is on interesting word frequencies (smaller change).

That explains why normalising vectors seem to mitigate the effect of overly prominent stop-words. But it isn't perfect as we saw.

Word Interesting-ness Factor

What did we do before, when we were trying to deal with boring words? We developed a measure of interesting-ness or relevance .. by combining three factors:

  • term frequency (TF) - how often a word occurs in a document
  • inverse document frequency (IDF) - how often a word occurs across the corpus
  • .. and our own refinement penalising shorter words - because most stop-words are short

Maybe we should use the relevance scores to automatically weed out the boring words?

To do his is easy, we just read the relevance matrix that we created earlier, instead of the word count index. Also, we comment out the calculation of word frequency because that is already done for the relevance index.

The results look promising ... there aren't any incorrect looking similarity pairs in the top 20 list:

doc1 doc2 similarity
0 C05739561.txt C05739546.txt 1.000000
1 C05739561.txt C05739554.txt 0.997375
2 C05739565.txt C05739563.txt 0.992705
3 C05739554.txt C05739546.txt 0.991492
4 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-32.txt 0.950962
5 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-33.txt 0.926943
6 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-32.txt 0.925336
7 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-31.txt 0.916514
8 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-31.txt 0.887027
9 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-31.txt 0.858477
10 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-12.txt 0.845943
11 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-12.txt 0.788654
12 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-12.txt 0.758289
13 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-12.txt 0.742125
14 C05739567.txt C05739545.txt 0.706729
15 the-report-of-the-iraq-inquiry_section-34.txt the-report-of-the-iraq-inquiry_section-11.txt 0.700997
16 the-report-of-the-iraq-inquiry_section-31.txt the-report-of-the-iraq-inquiry_section-11.txt 0.686425
17 the-report-of-the-iraq-inquiry_section-12.txt the-report-of-the-iraq-inquiry_section-11.txt 0.665006
18 the-report-of-the-iraq-inquiry_section-32.txt the-report-of-the-iraq-inquiry_section-11.txt 0.651868
19 the-report-of-the-iraq-inquiry_section-33.txt the-report-of-the-iraq-inquiry_section-20.txt 0.647377

In fact extending the list to the top 100 also looks good ... very promising! This gives us hope that we can automate the mitigation of stop-words .. and not use the brutal method of filtering out predetermined stop-words.

Let' see the visualisation of the clusters. Here's the top 50 links:

Looks good .. clear distinct and pure clusters... let's look at the top 100:

So far so good, 4 clear clusters as we'd expect .. and they're all pure. The recipe's one is a little small with only 2 documents in it ... that must mean other document kinds ... like the Iraq report documents .. are more similar to each other than the recipes .. which kind of makes sense because the recipes are for different dishes, and so have different ingredients and cooking methods.

In fact, if we look at the two recipes shown 10.txt and 11.txt they are for white sauce and yellow sauce, with lots of similarities for preparation .. so no surprise that they should have a higher similarity score than other recipes.

Let's look at the top 200 now ..

Look at the above .. I've colour coded them to make it easier to see which documents are in each cluster.

Whoah! ... what happened? Five clusters?

The recipes seem to have been split into 2 clusters ... but if we examine the smaller cluster of 2 we see that they are both recipes for rice cakes, and are similar to each other .. but they must be sufficiently different to the other recipes for links not to appear in the top 200.

If you look closely at the Iraq report cluster you'll see one document hanging onto the rest by a tenuous link .. its the annex-4. That's just an appendix listing maps ... and doesn't contain much else that's topical .. so the graph visualisation really does show which documents, even in a cluster, are only there by the most minimal of connections.


Let's keep pushing the number of links to plot .. this is tej top 800:

Now that's less useful .. everything is starting to connect to everything now because the even weak similarity links are now shown . even so you can still see the clusters .. the Italian recipes and Clinton emails trying to stay together and away from the rest of the mass.

The Macbeth play and the Iraq report are most easily merging together  ... maybe there's is a message in there somewhere about both tragedies?

Let's stop there and sum up...


We've made fantastic progress:

  • Removing stop-words is effective, and results in clean clusters.
  • But if we don't want to use a predetermined stop-word list .. 
  • Normalisation of the document vectors helps mitigate overly high counts for boring words.
  • And relevance scores (TF/IDF and similar) work great to automatically and adaptively remove boring commonly occurring words.
  • The graph visualisation is really powerful for insights into the text data .. surfacing the natural number of clusters, tenuous links, and even clusters within clusters. 

The code for document similarity is at GitHub:

Great job!

Wednesday, 2 November 2016

Grouping Similar Documents .. aka Clustering. Part 1/2

Being able to group similar documents together is useful .. especially if it can be done automatically.

Why is this useful? Some reasons include ..

  • Finding useful documents based on one you know is useful.
  • Finding documents that are similar to one that you found bathing a keyword .. where the similar don't have that keyword, but are still similar.
  • Automatically organising a medium-to-large collection of documents to help quickly understand what the corpus is about, and maybe prioritise which documents you look at further.
  • ....

Ok that sounds good .... how do we go about doing that?

Word Similarity

Well, we've already done the pretty amazing work for finding words (not documents) that are related or similar.

We did that based on co-occurrence .. which meant that words could be related, even if the two words themselves didn't co-occur in the same document themselves. They'd be related because they shared other words which co-occurred. Have a look at the following to remind ourselves how car and van are related, even though they don't co-occur directly.

Do we want to do this for documents? Could we?

We can't directly. Documents don't have the same idea of co-occurrence with other documents. Well, not in a useful way.

So we have to roll back to first principles again.

Document Similarity?

How do we want to construct the concept of document similarity? Yes, we have the power to invent it ourselves!

Let's start with the most obvious idea ... similar documents will have common words.

That is simple - and makes intuitive sense. Documents about cars will all talk about wheels, seats, paint, etc. The more words they have in common, the more similar they are.

We could try to come up with even more sophisticated schemes like ...

  • documents are similar if the words they contain have a high co-occurrence score with words in the similar document .. so documents about cars are considered similar to documents about vans.
  • documents are similar if the distribution of word counts is similar ... so the distribution is kind of like a fingerprint
  • ...

But as usual, let's start simple first .. and only get more complicated if we really need to.

Simple Document Similarity

That simple idea of documents being similar if they contain common words should be easy to implement in code.

Remember, we've already got a matrix of word frequency per document. If we simplify those word frequencies to a binary yes/no .. a 0 or 1 .. we can then match up words that occur in more than one document.

The following shows this simplification of word count to yes/no or 1/0. You can see that any word count value above 0 gets turned into a 1, and 0 stays 0.

If you look at the above diagram for a few moments, you can start to see what the documents might be about:

  • doc1 and doc2 seem to be about cars as they both contain wheel and seat.
  • doc3 does't seem to be about cars.
  • doc3 seems to be about cake.
  • doc1 and doc2 are definitely not about cakes.
  • but doc3 mentions lights .. maybe candles on a cake, rather than car headlights?

You can start to see how doc1 and doc2 are similar.

More Similar?

But is doc3 similar to doc2 because they both mention lights? Surely not to the same extent as doc1 and doc1 are similar because they're about cars. How do we encode this notion that doc1 and doc2 are more similar than doc2 and doc3?

We could add up the number of "hits" or "matches" ... as shown below:

That certainly shows that doc1 and doc2 are more similar than doc2 and doc3. Job done!

Tiny Bit More Sophisticated?

Do we need to make this more complicated by using the actual word counts, rather than the simplified 1's and 0's?

Let's try it and see what happens ... if we use those word count numbers and multiply them ... kinda like we did with word similarity and co-occurrence scores.

That actually looks like an improvement. The larger numbers boost the similarity scores. We now have doc1 and doc2 as very similar ... and doc2 and doc3 and very slightly similar.

That's what we want - so let's keep this approach!

Those Vectors .. Again!

Those multiplications look very similar to the calculations we were doing for word-word similarity ... when we considered words as vectors in a space of co-occurrences.

This time we have documents (not words) in a space where the dimensions are the word counts (not co-occurences).

Look again at the above diagram .. imagine we have a new document ... it will sit in the vector space and be positioned according to the words it contains .. and how often those words occur. 

So that similarity calculation where we multiply the word counts ... we've already seen that before, it was just like taking the dot product of the two things being compared. So here we have:

$$\mathbf{similarity} = \mathbf{doc1 \cdot doc2}$$

Remember that this dot product is a measure of the angle between two vectors $\mathbf{a \cdot b} = |a| \  |b| \ \mathbf{cos}(\theta)$. If the vectors are normalised to length 1, this simplifies massively to  $\mathbf{a \cdot b} = \mathbf{cos}(\theta)$.

A smaller angle means the vectors are more aligned, and $\mathbf{cos}(\theta) \to 1$ ... a bigger angle means they are point in very different directions and are not similar, and $\mathbf{cos}(\theta) \to 0 \to -1$. Let's update our diagram to show this:

Just like with word-vectors ... you can see that similar documents will point in similar directions .. the direction being influenced directly by the word content of those documents.

Do we really want to normalise those vectors? What do we lose if we do?

If a document has a very large word count for cake, for example, then the vector will be stretched along that axis quite a long way, which will effect the angle it makes with a document with a smaller cake word count. So we don't lose the angle information. And anyway the angle doesn't change massively as word count changes. So we don't lose much and we gain much simpler calculations.

Cool - so let's crack on and see what the results are!

Mixed Corpus 

To find similar documents in a corpus, and to group them (also called clustering) .. we should have a corpus that does actually contain groups of different kinds of documents. At least for testing these ideas .. .in the real world we won't know what a corpus contains.

A mixed corpus can be made by taking a small set of documents from existing corpuses, each of which are centred around a theme .. e.g. recipes, clinton emails, shakespeare ..  That is a quick and easy approach ...

The new mixed corpus lives at:

A Little Refinement - Word Frequency not WordCount

Previously we've realised that word count on it's own can bias towards larger documents, and unfairly penalise shorter documents which will never have high word counts.

To mitigate this we used word frequency - that is word count divided by the number of words in a document. We'll do that here too.


Listing the document pairs in order of descending similarity we hope to see documents that come from the same original corpora as being similar to each other.

That is - we hope to see documents about recipes being most similar to other documents about recipes ... and sections of Macbeth similar to other sections of Macbeth ...

Let's see what we actually get. The following is a snipper of the full 1326 rows of the similarity pairing:

doc1 doc2 similarity
0 11.txt 10.txt 1.000000
1 10.txt 06.txt 0.843739
2 10.txt 08.txt 0.827247
3 10.txt 07.txt 0.810582
4 12.txt 10.txt 0.809347
5 12.txt 11.txt 0.791245
6 11.txt 06.txt 0.781809
7 the-report-of-the-iraq-inquiry_section_annex-5... 10.txt 0.778167
8 10.txt 05.txt 0.768322
9 11.txt 08.txt 0.754004
10 10.txt 03.txt 0.751363
11 12.txt 03.txt 0.744406
12 12.txt 06.txt 0.744156
13 12.txt 08.txt 0.742580
14 06.txt 05.txt 0.730759
15 macbeth_act_01_scene_01.txt 10.txt 0.717309
16 10.txt 02.txt 0.709200
17 11.txt 07.txt 0.701762
18 C05739567.txt C05739545.txt 0.700932
19 C05739567.txt C05739559.txt 0.688125
20 the-report-of-the-iraq-inquiry_section_annex-5... 11.txt 0.686571
21 11.txt 03.txt 0.684366
22 13.txt 12.txt 0.672441
23 08.txt 06.txt 0.671072
24 06.txt 03.txt 0.670434
25 11.txt 05.txt 0.665943
26 13.txt 10.txt 0.665068
27 10.txt 01.txt 0.662917
28 the-report-of-the-iraq-inquiry_section_annex-1... 10.txt 0.660124
29 11.txt 09.txt 0.652069
... ... ... ...
70 08.txt 01.txt 0.561320
71 the-report-of-the-iraq-inquiry_section_annex-1... 06.txt 0.560240
72 the-report-of-the-iraq-inquiry_section_annex-5... the-report-of-the-iraq-inquiry_section-11.txt 0.559821
73 09.txt 08.txt 0.558997
74 C05739559.txt C05739547.txt 0.556281
75 the-report-of-the-iraq-inquiry_section-12.txt 11.txt 0.556279
76 13.txt 08.txt 0.555599
77 09.txt 03.txt 0.554793
78 03.txt 02.txt 0.549816
79 the-report-of-the-iraq-inquiry_section_annex-5... the-report-of-the-iraq-inquiry_section-20.txt 0.548914
80 the-report-of-the-iraq-inquiry_section-11.txt 06.txt 0.548661
81 12.txt 02.txt 0.547816
82 the-report-of-the-iraq-inquiry_section_annex-5... macbeth_act_01_scene_01.txt 0.545395
83 the-report-of-the-iraq-inquiry_section-32.txt 10.txt 0.544826
84 07.txt 05.txt 0.544456
85 11.txt 04.txt 0.544003

Looking at the above, and the rest of the list, we can see:

  • The documents most similar to themselves are the recipes. That's because they share loads of words amongst each other.
  • You may notice that there are no documents most similar to themselves .. that's because the code uses itertools.combinations() which excludes same-same combinations .. which works for us! (we're not interested in the obvious fact that a document is perfectly similar to itself)
  • The results don't perfectly separate into neat groups, the documents we already know are from distinct corpora. For example the 8th row shows a document from the Iraq inquiry being similar to a recipe.

Let's try visualising the results .. which might help us see the big picture, rather than focus on the individual odd matches.

Visualising Document Similarity

The graph of connected nodes that we used before seems like a good starting point for visualising document similarity. 

We do need to change it so that the most similar documents are clustered together in the graph. That means the links need to attract the document nodes. We'll do more refinement in the next post part 2/2 .. but for now we can apply the following change to the visualisation javascript d3.js code:

    var simulation = d3.forceSimulation(graph.nodes)
        .force("link", d3.forceLink(graph.links).distance(function(d){return 10 / d.%%edge_attribute%%;}))
        .force("charge", d3.forceManyBody().strength(-20))
        .force("radius", d3.forceCollide(15))
        .force("center", d3.forceCenter(width / 2.0, height / 2.0));

Previously all the links in the graph had a constant "normal length" (before being subject to forces) .. we now use the similarity score to make strongly similar documents tend to have shorter links.

Let's see what happens if we visualise the top 7 similarities (excluding the Iraq document):

This is nice! It may not look like much but we have automatically .. without any human reading .. identified a bunch of documents from a mixed set .. which are similar to each other. We can confirm this worked because we actually prepared the mixed corpus .. and this first group .. or cluster... are all Italian recipes documents!


Let's add more documents to that visualisation .. let's up the number to .. say .. 50 similarities ..

There are two distinct clusters here. We couldn't easily see that by looking at that list of doc1-doc2 similarities. That's the power of visualisation!

But that second cluster is mixed .. it seems to have both Iraq report documents and Italian recipes in there... we need to investigate further.

Before we pause and investigate further ... let's have a look at what the top 150 links looks like.

There's a similar story here... 2 distinct clusters ... one is fairly pure .. that was the one with the top-most similarity scores .. and the second cluster is mixed ...

Conclusion and Next Steps

Some definitely cool progress:

  • we've developed a simple idea for document similarity, and it seems to work
  • visualisation is much better as showing clusters than a list of numbers
  • ... there similarity method doesn't work perfectly so needs some investigation ...
Next steps are:
  • investigate to see why the similarity method doesn't work better - e.g. why Iraq docs are calculated as similar to recipes
  • investigate to see if we can further improve visualisation