Poor Data Noise Overwhelms Our AlgorithmsWe found that using our toolkit for mining the text ran into some severe problems because much of the original documents were of poor readability (many handwritten notes, degraded copies of copies, etc). The OCR processes to turn scans of these documents into plain text is also far from perfect. That means the text we get from the published PDFs contains lots and lots of noise. That is bad for text mining - at best, it adds low-value data to the set, and at worst, it overwhelms our algorithms which try to separate out meaningful words. The size of the noise also overwhelmed the memory limits of my laptop.
We tried a few drastic things like chopping out major contributors to the data set, like the Home Office (HO*) or South Yorkshire Police (SYP*), in an attempt to reduce the memory load. That's bad, because these two make up much of the data set - about 3/4 of the files by number. It didn't work enough so we ended up removed even more contributors...
We also tried brutal things like removing words which has 3 or more consecutive letters that were the same, in an attempt to remove the noise. That wasn't enough. We then tried words with 2 or more such consecutive letters .. which would remove valid English words too, and yet still junk words like sflcpg, ctwvi, cyujw remained.
A New Approach: An English Word ListA new approach is to directly address the problem of noise and junk from the poor initial dataset. This approach is to only include words that are actually English words, and filter everything else out.
This feels a little like defeat as we didn't want to include manual steps like "stop words" in our pipeline, but in the face of such overwhelming noise, it seems a reasonable thing to do.
There are a few sources of lists of English words, and in fact most unix/linux and macOS systems have a system /usr/share/dict/words.txt file. That is a good start, and serves most purposes. These days, some of the open source spell check technologies include quite sophisticated dictionaries which include variants of words too, and even proper names (like Stephen) and popular place names (like Sheffield).
Our approach uses the aspell spell checker to create a word list at http://app.aspell.net/create. The words are processed with additional steps:
- remove all 's from the end of any words that have it
- lowercase everything
- remove duplicates
You can find this word list on the project's github page at
The resultant word relevance data frame, without any organisations excluded, is 6.8Gb. That's a massive reduction on the 13+Gb previously when major chunks were cut out.
Let's see what this cleaned up data looks like through the previous tools.
Occurrence - Simple Word CountsTaking the top 10 most occurring words (without normalisation) gives us:
That's the same as before when we didn't filter by dictionary words, because all of these dictionary words appear in the top 10. Again the main themes come through - police, ground, football, number .. and a sense of regretful would, should regarding hindsight.
The word cloud is as follows - click to enlarge. It's very similar to our previous one.
Word Relevance (TF-IDF)Calculating the relevance to reduce the prominence of meaningless words, we get the following top 20 most relevant words:
The is different to the one we arrived at before. This should be a more representative one as we have calculated relevance over all the documents, not just a subset like we did before to try to fit within memory.
We see very similar themes - but also some new ones. the second most relevant word is raised - which suggests a key action or verb in the document set, perhaps raised a complaint or concern. Again we see the, perhaps regretful in hindsight, would - what should and would have happened. The puzzling word is indexer - we need to look at what the context of indexer is.
Let's look at the word cloud made from word relevance - click to enlarge.
Co-OccurrenceWe can now look again at word co-occurrance, and this time the entire data set is in scope, not just a subset. Here are the top 20 pairs.
word1 word2 weight
0 police police 1.000000
1 there there 0.673316
2 ground ground 0.650661
3 south yorkshire 0.595835
4 would would 0.568905
5 should should 0.442588
6 material police 0.422234
7 police officers 0.418765
8 action action 0.403173
9 which which 0.402548
10 south police 0.400237
11 stand stand 0.397440
12 ground police 0.394455
13 number number 0.372084
14 there people 0.372017
15 people people 0.369749
16 yorkshire police 0.360681
17 police officer 0.352889
18 material material 0.352867
19 ground there 0.352135
This isn't as informative as the visualisation as a graph of connected words... but again we see the themes of should/would and a new theme around the word material, which suggests we should explore its context in the corpus.
Here's the co-occurrence graph - click to enlarge.
We can see that:
- the police are very central to much of the disucssion
- the expected themes are prominent - grounds, ambulance, injury, witness, Sheffield, ...
- unexpected themes emerge - material, telephone, indexer, signature - suggestion avenues for further investigation
Filtering out lower cooccurrence scores gives us the most prominent themes:
We're looking at these visualisations and perhaps not being that impressed. But if we didn't know what Hillsborough was about, and didn't know the themes of the documents already, these kinds of analysis are really helpful.
Topic ExtractionWe can try topic extraction again, this time over the entire dataset. Previously we were limited to a small subset (Home Office documents only) which limits the useful of topic extraction because of the lesser diversity of topics in a narrower dataset.
Before we dive in, it is useful to check the distribution of eigenvalues from the SVD decomposition to see if any significant topics did emerge. The following shows the overall view - and we can see a strong peak as well as a very long tail.
Zooming in shows four really strong topics, but the above shows that the next few are still signifiant compared to the full set.
Here are the top 15 topics:
topic # 0
Name: 0, dtype: float64
topic # 1
Name: 1, dtype: float64
topic # 2
Name: 2, dtype: float64
topic # 3
Name: 3, dtype: float64
topic # 4
Name: 4, dtype: float64
topic # 5
Name: 5, dtype: float64
topic # 6
Name: 6, dtype: float64
topic # 7
Name: 7, dtype: float64
topic # 8
Name: 8, dtype: float64
topic # 9
Name: 9, dtype: float64
topic # 10
Name: 10, dtype: float64
topic # 11
Name: 11, dtype: float64
topic # 12
Name: 12, dtype: float64
topic # 13
Name: 13, dtype: float64
topic # 14
Name: 14, dtype: float64
Let's look at these topics:
- topic 0 - seems to be about instructions issued to the police, their receivers, the actions, and statements about those actions
- topic 1 - seems to be about subsequent scrutiny or narrative from a more London-centric perspective
- topic 2 - is more about the supports and the stands and their relation to the police
- topic 3 - is much more about the subsequent inquiry by justice Taylor into the ground and stands
- topic 4 - is very much about the legal aspects - attorney, chambers, inquiry, tract,
- topic 6 - is more about witnesses, signed and signatures, visitors
Overall these topics don't appear to be as distinct as those extracted from the Iraq Report or the test mixed set. This is because the data is poorer in quality, and because the dominating themes of the Hillsborough dataset are very similar.
Further Investigation - Material, IndexerThe above has promoted us to look further at the context of the following words, identified above as significant:
We can even use our own search engine to find the most matching documents.
Looking at the top few results we see that the word material is used on witness statements explaining why it is prominent. The top document shows an example of material (cctv video) being submitted :
Searching for indexer, and looking at the results tells us the word is just part of a common form. The same applies to the word signature .. as might now be expected!
Iterative Text Mining Process
A good text mining process would be iterative, and learning what we have, we would re-run the analysis and exclude these words in the stop list, and also re-intriduce the word Hillsborough as it is not in the English dictionary we used.