Wednesday, 31 August 2016

Improving on Simple Word Counts (TF-IDF)

Up to now we've used word counts - how often a word occurs in a document or an entire corpus - as a measure of how important or relevant that word is.

We did it early on when we looked at word clouds to visualise the most important themes. And we did it more recently, when we tried to organise search results, trying to put the most relevant ones first.

The assumption that a word that appears more often in a document (or corpus) is relevant or important makes intuitive sense ...

.. but it also has some down-sides too .. so let's look at a few, and see if we can make simple improvements. I say "simple" because we want to stick to avoid overcomplicating things as much as possible if a simpler approach works well enough.

Problem 1: Boring Words

If you think about it, the most frequent words are boring words like "the", "and", "it", "is" .. and so on.

On their own, they're not informative, they don't tell us what a document is about. If I gave you a mystery document to analyse, and you could only see the top five words: "is", "the", "it", "is", "to" ... you wouldn't be able to say what the document was about.

If you remember the recipes example, the first 12 most frequent words were boring, uninformative words that had nothing to do with recipes at all.

We attempted to fix this by removing such stop words. But that approach is a little unsatisfactory, because we have to manually create such a list, and we humans might disagree on which words should be included and which shouldn't.

Also, we might spend ages crafting a perfect stop-word list for one set of documents, then find it was completely wasted effort for another set of documents, because we over-fitted them to the first set. So manual stop word lists are not adaptive - they don't adjust themselves to different sets of documents.

So let's think again about how we might automate identifying such boring words. Let's start by thinking about how they behave, their characteristics:

  1. they occur often, and in all documents
  2. they're often short words, long words tend to have significant meaning

That's all I can think of for now... but that's a powerful start already. Let's see why...

The first observation that boring words occurs often, and across all documents is pretty powerful. It allows us to automatically identify them, and that identification can be different for different sets of documents. The set of boring words for Shakespeare's plays might be different to those for medical reports.

But aren't we back to square one if a set of documents about sport, all contain the word sport, frequently and in all documents .. as you'd expect they might? Well, that second observation can help, because we know that many boring words are short.

Anyway ... how would be encode this idea as a practical algorithm? Well, let's think out loud about what we're trying to express ...

  1. boring words occur often, and in all documents ... so this suggests we note all the documents the word occurs in, and the fraction $\frac{documents\ with\ word}{total\ number\ of\ documents}$ is a measure of boring-ness. A low score means the word is interesting because it is not liberally peppered everywhere in all documents.
  2. boring words are often short words ... suggests we apply a moderating factor to the above measure which penalises shorter words but rewards longer words. Maybe something simple like dividing by word length is enough? Normalised word length might suppress the effect  of normal words if there was an outlier in the text.
So a measure of boring-ness could be:

$$\frac{documents\ with\ word}{total\ number\ of\ documents} \cdot \frac{1}{(word\ length)}$$

That's boring-ness. If we want the opposite, interesting-ness, we just subtract it from 1 it to get:

$$\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}$$

... so that the first fraction ranges between 1 (interesting) and 0 (boring).

By the way, subtracting from 1 avoids the problem of division by zero if we, instead, inverted the fraction, which many people like to do.

Ok, ok!  ... all this isn't precise mathematical derivation from axioms .. but that's the way many methods in text analytics have grown because natural language itself isn't a mathematically correct algebra. We can fix these early ideas later if we find they don't work so well.

Problem 2: Longer Documents Cheat Word Counts

This problem is easy to see.

Imagine a document about books. It will contain the word books, and we expect it will have a higher count of the word books than other documents that aren't focussed on the fascinating topic of books.

But imagine, someone wanted to cheat our system of using word frequency counts ... and took our document and simply doubled it. That is, copied and pasted the text at the end of the same document, to make a new document, double the length, but with the original repeated twice.

You can see in the diagram above, that document 2 actually has  the word book in every sentence, but the cheating document 1 doesn't. Should we reward this cheat? Should we really give this document double the word count for the word banana?

You can start to see that actually, document length can have an unfair bias on relevance.

What to do about it? That's easy too! When trying to account for biasing factors, a very common approach is to normalise what we're interested in (frequency) with the factor that might be causing bias (document length).

Here's how we would counter the biasing (cheating) effects of longer documents:

$$normalised\ frequency = \frac{word\ count}{total\ words\ in\ document}$$

For the above illustrated example, we'd have an updated measure for the word book:

  • document 1 normalised frequency = 4/44= 0.091
  • document 2  normalised frequency = 4/17 = 0.176

Now the second document comes up top, with almost double the relevance score .. we fixed the bias!

You can think of this normalised frequency as a way of measuring word density. Actually, if we're being pedantic, frequency is the right word .. we should have been using count, not frequency before.

Combined Measure of Relevance

We can combine the above two measures which aim to reflect word interestingness, and also counter the biasing effect of document length.

The combined measure for the relevance of a word could be:

\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {(word\ length)}
\frac{word\ count}{total\ words\ in\ document} \right\}$$

A Small Refinement

We should stop fiddling with that expression .. but there is just one more that I want to do. Two of the three parts of that expression have values between 0 and 1. The only one that doesn't is $(word\ length)$.

A common "squishing" function for squeezing in a function's range between 0 and 1 is the $tanh(x)$ function, shown below:

You can see that $tanh(word\ length)$ stays 0 if the word length is 0. But as word length grows, $tanh(word\ length)$ grows towards 1, but never really reaches it.

We do need to think about scale because we don't want all words of length 1 to 8 to be mapped to some tiny 0.000001. We want a normal word, of say length 5 letters, to be mapped to about 0.8.

After playing about with different scaling factors, we find dividing word length by 5 seems to be a good setting, giving words of length 5 a decent score, and penalising words of length 2.

Here's a plot of $tanh(word\ length / 5)$ making clear how words of length 0 to 9 are mapped to the range 0 to 1.

So our expression for relevance, which we'll stop fiddling with now, is ...

\left \{ 1 - \frac{documents\ with\ word}{total\ number\ of\ documents} \right \} \cdot {tanh(\frac{word\ length}{5})}
\frac{word\ count}{total\ words\ in\ document} \right\}$$

Looks complicated, but it's only three fractions multiplied, representing:

  • term frequency
  • inverse document frequency
  • penalising short words.

Using A Word Relevance Measure

How do we use a word relevance measure, like the one we've developed above?

There are at least two ways we can use it:

  • Use it to identify the most important words in a document or corpus .. like we previously used word count to drive how big words were plotted in word clouds.
  • Use it to rank search results .... like we used simple word count before. If the search query contains more than one word, we simply add the measures for each query word found in the matching document.

And there will be more, I'm sure ...

Everyone Talks About TF-IDF

What we've arrived by ourselves at is very similar to the popular and widely used measure of interesting-ness called Term-Frequency Inverse-Document-Frequency (TF-IDF).

If you refer to most textbooks, or wikipedia, you'll see that most variants of TF-IDF contain 2 of our 3 factors. Many implementations also seem to use $log(\frac{total\ number\ of\ documents}{documents\ with\ word})$ for word interesting-ness - that's fine because it increases from 1 as the number documents with a word decreases. But it doesn't do what our measure does which is go from 0 to 1.

Next Time - Results and New Code

We'll have to try out these ideas to see how they work. The next post will cover this .. and also talk about a revised code implementation to do all these new things that the current very simple index based on word counts doesn't.


  1. Using "boring words" to improve TF-IDF is really interesting!

    1. Thanks frank .. I think will be coming back to this as the most common way of doing this uses a log() function and there is some good reason for that which I will research and talk about again.

      But for now I wanted to just start off with an intuitive explanation first.