Thursday, 23 June 2016

Letters, Words or Documents?

The Natual Language Challenge

Trying to extract meaning from unstructured natural language text data is different, if not more challenging, than structured numbers or labels.

There are several reasons for this.
  • There is no strict clear structure in natural language that gives meaning - no columns or rows, lists or arrays, column headers or fields labels.
  • Even the small amount of structure (grammar) that exists in natural language is often broken. Human languages have exceptions, ambiguity, multiple spellings, idiomatic phrases, regional expressions, ... and even more difficult things like sarcasm and irony.

That I ain't done nuffin' could mean blue bird has not done nothing, that is, he's done everything!

Natural language wasn't designed to be an efficient, unambiguous and precise language that we could efficiently compute with. 

It grew organically over time, with more chaos than people sometimes expect. It was only until about 400 years ago that spelling for common words started to settle down for English.

Messy Ambiguity

Have a look at the following piece of natural language:

What does it mean? Fruit flies, a kind of insect, like a banana as a meal? Fruit, when thrown, has the aerodynamics of a banana ... and lands with a splat!? The meaning is ambiguous.

Do we give up? No! We accept the reality of natural language, and try to develop algorithms that are useful, good enough, even if occasionally the messiness of human language breaks them.

What Do We Compute With?

Ok, if we are going to forge ahead and try to compute with natural language, what are the things we manipulate and do calculations with?

As we said at the top, for structured data this is much easier. Numbers are easy to calculate with. We can sum them, find averages, cluster them, etc. Structured text is easy enough too - the names of candidates people vote for, the class names of flowers in the Iris dataset, the names of regions ... and so on. We can count them, create sets of them, and sometimes order them or group similar ones together. In all these cases, each item of structured data is a number or a label, and each has a precise meaning.

It is worth asking what the unit of computation for natural language text should be. Is it a letter? A word? A sentence? A paragraph? A document? A collection?

We could answer this by looking at what computers actually do themselves. Computers are nothing more than a bunch of electrical switches. The electricity flow is either switched on or off - which naturally represents 0 and 1 in binary numbers. You'll remember these from school, binary 001 is one, 010 is two, 101 is five.

The letters you see on your screen as you read this, are represented inside your computer or smartphone using a character numbering code that has been in place since the 1960s, called ASCII. In this scheme A is 65, B is 66 and lowercase z is 122. You can see all of the characters at If you watch traffic on your network, you'll see these codes flying past as they travel to and from the internet. (Don't do this without permission on a network that isn't yours!)

If ASCII characters are how computers store and transmit text, then maybe we should use letters as the basic unit for computing with natural language?

Let's remind ourselves of what we're trying to do. We're trying to extract meaning from natural language text. When we humans read text, we don't understand a word until we've seen all the letters of it. In fact, our minds can correct the spelling of a word because internally we refer back to an existing notion of what the whole word should be. This suggests it is whole words, not letters, that should be the unit of computation.

Doing simple tasks like changing all the letters to lower case can be done letter by letter. But that's because that task doesn't care what the words mean.

So we're arriving at the conclusion that words are the smallest unit of computation for text mining.

We said smallest unit quite carefully there. In some cases, we humans can't understand the meaning of a word without looking at the words around it. Have a look at the following, which illustrates homonyms - words which are written the same but mean different things.

We can only tell the meaning of the word saw, based on the words around it. They tell us that the first saw is a cutting tool for wood, and the second saw is the past tense of the verb to see. Similarly, the first branch is a part of a tree, and the second branch is a local establishment of a larger bank organisation.

So ... does all this mean we were wrong to think that the unit of computation is a word on its own?

No. Like many things in natural language text mining, the theories are good up to a point. So for us, using a word as the basic unit is good for many cases, but will break down when words need context. In that case, we need to take into account, somehow, the words around the word we're interested in.

In some cases, even a phrase, or even a whole sentence won't make sense on its own, and we'll need to reach further and look at the whole paragraph or even the whole document!


  • The smallest unit of computation for text mining is a word, not a letter.
  • Sometimes we'll have to extend beyond a word to establish meaning, looking at the words around it.

Friday, 17 June 2016

Theory, Model and Method

This is my third book, and this experience makes it clear to me that text mining is not yet as neatly coherent as a conceptual framework as other fields.

Loads of Tools, But No Shed

A survey of the many guides will give you lots of methods for processing text to give you some insight into its meaning. There will be methods like

  • word frequency
  • document clustering
  • co-occurance matrices
  • synonym searching
  • ... etc ... etc ..

It really feels like that there are lots of methods, tools, that are offered for you to use but there doesn't seem to be an overall idea or theory which ties them all together. To put this another way, a larger conceptual framework doesn't help us to place each of these tools within it - so we can see which is appropriate to use and when.

Natural Language is Messy

This isn't surprising - because the data, the natural language text, itself isn't a mathematically precise and consistent thing. Human language, and the way we use it, is a messy, organic, incomplete and inefficient scheme ... never designed for crisp complete perfectly precise computation.

This leaves us with conceptual frameworks which fall into two camps:

  • probabilistic - ignoring any underlying structure and simply working with the likelihood of an answer based on "counting" how often it has previously happened
  • structural - trying to make use of underlying structure - either known true, or suspected true - to find answers

Theory -> Model -> Method

With this book, which is focused firmly on being accessible to those new to the subject, we won't present methods like the above in an ad-hoc fashion. That would be unsettling, and not really give comfort that we know what we were doing.

Instead we'll try really hard to follow a pattern for each idea:

  • Start with a Theory - an idea that we think is true, or even know to be true.
  • Use this theory to define a Model which is useful in a computational sense.
  • Explain the Method, or algorithm, that we use to do calculations with that model.

This should be much better than simply throwing a load of methods at readers. Even if we fail to completely fit every tool into a nice perfect complete theory of natural language - we can at least explain what model a method is supposed to work with, and which theory (truth or assumption) that model is trying to reflect.

This transparency also allows you to improve a method for a model, and come up with a better one. Or have more than one method for a model (eg gradient descent vs random search). It even allows you to disagree with the model for a theory, so you can come up with your own (eg frequent words vs rare words). It is even possible to have more than one theory, each one humbly not trying to be all-encompassing but targeting a specific part of natural language.

The following shows these choices:


Here's a simple illustration of the above idea.
  • Theory: The most key concepts are mentioned a lot in a document.
  • Model: Frequency of words indicates the key concepts.
  • Method 1: Count the occurrence of words, the most popular words are the most meaningful concepts. Alternative Method 2: Count the words, but also the synonyms too.

The second method is a refinement of the first method.

You may disagree with the model and say that it is not word frequency which best models the theory which you may agree with. You may say that words like "the", "and" and "or" will occur the most frequently, and these words don't indicate any concept at all. This may lead you to an alternative model, such as words which appear rarely in a paragraph, but do appear in many paragraphs - this counteracts the negative effect of boring words like "and".

You may even disagree with the theory ... :)


Instead of throwing a large number of methods willy-nilly at the reader, we'll try to be more disciplined and transparent about what model, and which theory a particular algorithm is intended for.

Wednesday, 8 June 2016

Hello World!

This is the blog that will follow the progress of Make Your Own Data Mining Toolkit.

Just like Make Your Own Neural Network and Make Your Own Mandelbrot, the aim is to take a very gentle journey through the ideas and mathematics, to make this very cool topic as accessible as possible.

We'll cover simple ideas like indexing for search, then maybe onto things like sentiment analysis and clustering, ... and we'll get to really powerful ideas such as searching for related information and even learning from text.

We're going to have fun! ...