Wednesday, 10 May 2017

Extracting Topics with SVD at PyData London 2017

I was lucky enough to do a talk at PyData London 2017 on topic extraction with SVD.

The talk was aimed at novice's and tried to gently introduce the idea, it's value, and also an intuition about how the SVD method works.

Here are the slides: (or click the image)

And here's the video from the conference:


Thursday, 20 April 2017

Exploring the Hillsborough Disclosure - Part 2/2

In the last post we started to explore the Hillsborough Independent Panel's Disclosure of evidence, that we had prepared from the not-so-open data.

Poor Data Noise Overwhelms Our Algorithms

We found that using our toolkit for mining the text ran into some severe problems because much of the original documents were of poor readability (many handwritten notes, degraded copies of copies, etc). The OCR processes to turn scans of these documents into plain text is also far from perfect. That means the text we get from the published PDFs contains lots and lots of noise. That is bad for text mining - at best, it adds low-value data to the set, and at worst, it overwhelms our algorithms which try to separate out meaningful words. The size of the noise also overwhelmed the memory limits of my laptop.

We tried a few drastic things like chopping out major contributors to the data set, like the Home Office (HO*) or South Yorkshire Police (SYP*), in an attempt to reduce the memory load. That's bad, because these two make up much of the data set - about 3/4 of the files by number. It didn't work enough so we ended up removed even more contributors...

We also tried brutal things like removing words which has 3 or more consecutive letters that were the same, in an attempt to remove the noise. That wasn't enough. We then tried words with 2 or more such consecutive letters .. which would remove valid English words too, and yet still junk words like sflcpg, ctwvi, cyujw remained.

A New Approach: An English Word List

A new approach is to directly address the problem of noise and junk from the poor initial dataset. This approach is to only include words that are actually English words, and filter everything else out.

This feels a little like defeat as we didn't want to include manual steps like "stop words" in our pipeline, but in the face of such overwhelming noise, it seems a reasonable thing to do.

There are a few sources of lists of English words, and in fact most unix/linux and macOS systems have a system /usr/share/dict/words.txt file. That is a good start, and serves most purposes. These days, some of the open source spell check technologies include quite sophisticated dictionaries which include variants of words too, and even proper names (like Stephen) and popular place names (like Sheffield).

Our approach uses the aspell spell checker to create a word list at The words are processed with additional steps:

  • remove all 's from the end of any words that have it
  • lowercase everything
  • remove duplicates

You can find this word list on the project's github page at

The resultant word relevance data frame, without any organisations excluded, is 6.8Gb. That's a massive reduction on the 13+Gb previously when major chunks were cut out.

Let's see what this cleaned up data looks like through the previous tools.

Occurrence - Simple Word Counts

Taking the top 10 most occurring words (without normalisation) gives us:

police 358160
there 251553
ground 243109
which 220052
other 188236
would 180488
should 136455
football 135171
people 134446
number 133143

That's the same as before when we didn't filter by dictionary words, because all of these dictionary words appear in the top 10. Again the main themes come through - police, ground, football, number .. and a sense of regretful would, should regarding hindsight.

The word cloud is as follows - click to enlarge. It's very similar to our previous one.

Word Relevance (TF-IDF)

Calculating the relevance to reduce the prominence of meaningless words, we get the following top 20 most relevant words:

police        83.451991
raised        79.235280
ground        66.290401
actions       65.596721
document      63.932343
action        61.090630
material      57.589417
number        51.046630
indexer       50.493850
instructions  48.442706
statement     47.873629
other         46.347452
football      46.155356
indicated     44.902818
people        43.851017
there         42.416433
would         41.164988
sheffield     40.631458
supporters    40.336179
stand         40.262084

The is different to the one we arrived at before. This should be a more representative one as we have calculated relevance over all the documents, not just a subset like we did before to try to fit within memory.

We see very similar themes - but also some new ones. the second most relevant word is raised - which suggests a key action or verb in the document set, perhaps raised a complaint or concern. Again we see the, perhaps regretful in hindsight, would - what should and would have happened. The puzzling word is indexer - we need to look at what the context of indexer is.

Let's look at the word cloud made from word relevance - click to enlarge.


We can now look again at word co-occurrance, and this time the entire data set is in scope, not just a subset. Here are the top 20 pairs.

        word1 word2 weight
0 police police 1.000000
1 there there 0.673316
2 ground ground 0.650661
3 south yorkshire 0.595835
4 would would 0.568905
5 should should 0.442588
6 material police 0.422234
7 police officers 0.418765
8 action action 0.403173
9 which which 0.402548
10 south police 0.400237
11 stand stand 0.397440
12 ground police 0.394455
13 number number 0.372084
14 there people 0.372017
15 people people 0.369749
16 yorkshire police 0.360681
17 police officer 0.352889
18 material material 0.352867
19 ground there 0.352135

This isn't as informative as the visualisation as a graph of connected words... but again we see the themes of should/would and a new theme around the word material, which suggests we should explore its context in the corpus.

Here's the co-occurrence graph - click to enlarge.

We can see that:

  • the police are very central to much of the disucssion
  • the expected themes are prominent - grounds, ambulance, injury, witness, Sheffield, ...
  • unexpected themes emerge - material, telephone, indexer, signature - suggestion avenues for further investigation

Filtering out lower cooccurrence scores gives us the most prominent themes:

We're looking at these visualisations and perhaps not being that impressed. But if we didn't know what Hillsborough was about, and didn't know the themes of the documents already, these kinds of analysis are really helpful.

Topic Extraction

We can try topic extraction again, this time over the entire dataset. Previously we were limited to a small subset (Home Office documents only) which limits the useful of topic extraction because of the lesser diversity of topics in a narrower dataset.

Before we dive in, it is useful to check the distribution of eigenvalues from the SVD decomposition to see if any significant topics did emerge. The following shows the overall view - and we can see a strong peak as well as a very long tail.

Zooming in shows four really strong topics, but the above shows that the next few are still signifiant compared to the full set.

Here are the top 15 topics:

 topic # 0
raised          1.646954
actions         1.344083
document        1.159934
indexer         0.970770
instructions    0.953516
action          0.946094
indicated       0.901837
number          0.591718
statement       0.580177
receivers       0.508224
Name: 0, dtype: float64 

 topic # 1
reference     3.179260
extension     0.137096
telephone     0.114481
memorandum    0.113648
raised        0.110221
london        0.102715
scrutiny      0.097901
actions       0.089925
secretary     0.084932
queen         0.071101
Name: 1, dtype: float64 

 topic # 2
police        0.677746
ground        0.550280
material      0.545654
raised        0.396858
stand         0.350034
people        0.342326
south         0.333551
actions       0.311439
supporters    0.310529
there         0.298958
Name: 2, dtype: float64 

 topic # 3
secretary    0.572476
london       0.459885
justice      0.365538
material     0.355462
taylor       0.308684
ground       0.307226
inquiry      0.292526
disaster     0.246981
stand        0.236582
queen        0.227525
Name: 3, dtype: float64 

 topic # 4
chambers     1.377323
castle       0.226434
tract        0.062330
secretary    0.051054
attorney     0.030952
justice      0.028493
general      0.024988
taylor       0.024985
inquiry      0.024941
street       0.021848
Name: 4, dtype: float64 

 topic # 5
material     0.363232
inquiry      0.338313
property     0.252636
number       0.237683
officers     0.207103
people       0.204529
briefly      0.195419
message      0.194922
secretary    0.182719
justice      0.182356
Name: 5, dtype: float64 

 topic # 6
statement       0.307106
property        0.281948
signed          0.280987
visitors        0.274388
people          0.255564
witness         0.253254
signature       0.227993
midlands        0.203407
continuation    0.202329
court           0.199706
Name: 6, dtype: float64 

 topic # 7
visitors     1.186298
business     0.127914
property     0.100223
statement    0.078688
number       0.069357
witness      0.062938
signed       0.062578
exhibit      0.054888
court        0.053854
photo        0.052688
Name: 7, dtype: float64 

 topic # 8
property     0.295102
telephone    0.265546
yorkshire    0.262619
police       0.249554
subject      0.246568
south        0.221909
number       0.219346
inquiry      0.195477
halley       0.181322
constable    0.176241
Name: 8, dtype: float64 

 topic # 9
halley       1.168691
property     0.046086
telephone    0.041354
yorkshire    0.041154
subject      0.038612
police       0.038588
south        0.034380
number       0.034135
inquiry      0.031168
constable    0.027442
Name: 9, dtype: float64 

 topic # 10
secretary    0.380099
private      0.255705
message      0.247981
telephone    0.213645
action       0.209681
london       0.197113
material     0.163540
number       0.153276
premises     0.145124
street       0.144161
Name: 10, dtype: float64 

 topic # 11
secretary    0.462739
private      0.388733
telephone    0.303815
message      0.264320
sheffield    0.207366
inquiry      0.181807
signed       0.165700
street       0.162892
costs        0.143109
midlands     0.134161
Name: 11, dtype: float64 

 topic # 12
hammond      1.048282
commences    0.217295
secretary    0.062752
private      0.038406
london       0.031946
evidence     0.030058
midlands     0.027553
support      0.027198
signed       0.024466
inquiry      0.023950
Name: 12, dtype: float64 

 topic # 13
message      0.250191
subject      0.236295
street       0.224257
london       0.219849
people       0.205558
downing      0.199686
action       0.195125
signed       0.181644
telephone    0.164518
report       0.133955
Name: 13, dtype: float64 

 topic # 14
arrive       1.033862
secretary    0.092545
private      0.087563
subject      0.053544
sheffield    0.050590
justice      0.042844
london       0.042612
message      0.038389
general      0.037748
attorney     0.037600
Name: 14, dtype: float64 

Let's look at these topics:

  • topic 0 - seems to be about instructions issued to the police, their receivers, the actions, and statements about those actions
  • topic 1 - seems to be about subsequent scrutiny or narrative from a more London-centric perspective
  • topic 2 - is more about the supports and the stands and their relation to the police
  • topic 3 - is much more about the subsequent inquiry by justice Taylor into the ground and stands
  • topic 4 - is very much about the legal aspects - attorney, chambers, inquiry, tract, 
  • topic 6 - is more about witnesses, signed and signatures, visitors
  • ...

Overall these topics don't appear to be as distinct as those extracted from the Iraq Report or the test mixed set. This is because the data is poorer in quality, and because the dominating themes of the Hillsborough dataset are very similar.

Further Investigation - Material, Indexer

The above has promoted us to look further at the context of the following words, identified above as significant:

  • material
  • indexer
  • signature

We can even use our own search engine to find the most matching documents. 

Looking at the top few results we see that the word material is used on witness statements explaining why it is prominent. The top document shows an example of material (cctv video) being submitted :

Searching for indexer, and looking at the results tells us the word is just part of a common form. The same applies to the word signature .. as might now be expected!

Iterative Text Mining Process

A good text mining process would be iterative, and learning what we have, we would re-run the analysis and exclude these words in the stop list, and also re-intriduce the word Hillsborough as it is not in the English dictionary we used.

Monday, 17 April 2017

Exploring the Hillsborough Disclosure - Part 1/2

In the last post we talked about extracting the raw text data from the PDF's made public by the Hillsborough Independent Panel as part of its collation and review of evidence about the disaster from 89 organisations.

The Data

That data is fairly chunky:

  • 19, 217 text files
  • total of 874,408K in size .. or 853Mb in total
  • an average size of 45.5k

Here's a breakdown of the documents from each organisation. You can see that most of the documents came from the South Yorkshire Police (SYP), the Home Office (HOM) and the department for Culture, Media and Sport (CMS).

Organisation Count
SYP 10078
HOM 3816
CMS 1073
FFA 413
CPS 409
SYC 345
SPP 267
YAS 259
LCS 229
COO 200
AGO 191
Grand Total 19216

The following chart makes it easier to understand these numbers - click to enlarge.

Here we'll explore it with some of the tools we've developed.

Word Cloud - Simple Word Counts

A simple, and early, tool we developed was the word cloud to show the most occurring words, as to get an initial feel for what the text data set is about.

Before we create a word cloud chart, we need to clean the data. Here are the steps we're already familiar with:
  • simplify whitespace
  • only keep alphanumeric characters
  • lowercase
  • split into words
  • remove stop-words from manual list
  • only keep words of minimum length 5 (assumes longer words will be more interesting), and reduce
The top 10 most occurring words are:

police 358160
there 251553
ground 243109
which 220052
other 188236
would 180488
should 136455
football 135171
people 134446
number 133143

There are words in thee that aren't that informative. Remember why we moved instead to a measure of interesting-ness (TF-IDF) to ensure boring words aren't so prominent. Despite this the list does give us a feel for the text corpus - police, ground, football, number .. all relevant themes. Here's the word cloud.

Again - there are informative words in there, which we know are relevant to the history and events of. the disaster - police, stand, south, crowd, action, turnstiles, evidence, witness, supporters, pitch.

The word cloud is often derided - but is very simple and very effective.

Word Cloud - Relevance (TF-IDF)

Working out the relevance to reduce the effect of boring words blows up the memory of my computer - the dataset is too large - I'll need to improve the code in future, or shift to a non-memory based system like Python Dask.

So for now, we'll chop the dataset up by selecting only documents that come from a specific organisation. This is easy because the text files have a prefix which identifies their origin. The Home Office files are prefixed with a HOM .. like HOM000049500001.txt, for example.

The selection is done using the corpus reader as follows:

cr = tmt.corpus_reader.CorpusReader(content_directory="data_sets/hillsborough/txt/", text_filename_pattern="HOM*.txt")

Here's the top 20 list of most relevant words:

police        13.652282
inquiry       12.031492
secretary     11.218956
justice       10.757810
football      9.901103
letter        9.803244
yorkshire     9.501367
evidence      9.079484
hillsborough  8.938162
london        8.917985
taylor        8.695437
would         8.640892
disaster      8.519620
scrutiny      8.362206
reference     8.135450
there         7.996140
authority     7.765933
should        7.601597
south         7.530626
report        7.525849

That's a much much better list of most relevant terms.

Let's visualise the word cloud of relevant terms - click to enlarge.

The words included here are much more relevant and we can judge this because it is a subject we're fairly familiar with. The police are a major theme, for instance, and for good reason.


Again focussing on the HOM subset (because the entire set breaks my computer's memory) we can apply our co-occurrence tool.

Here are the top 20 most co-occurring words:

word1 word2 weight
0 there there 1.000000
1 would would 0.974656
2 police police 0.866106
3 south yorkshire 0.681754
4 should should 0.671855
5 football football 0.666573
6 which which 0.566681
7 justice taylor 0.520375
8 police officers 0.503252
9 there which 0.494395
10 there people 0.492280
11 there would 0.490765
12 ground ground 0.486991
13 which would 0.468009
14 would there 0.465085
15 yorkshire police 0.445687
16 people people 0.439034
17 justice stuartsmith 0.432644
18 people there 0.425958
19 which there 0.418774

That contains word pairs where words are both the same .. I need to fix that! But also highlighted are pairs which are very informative about the data set.

It is interesting that there is a lot of use of the conditional future ... there would .. should ... which would. This suggests the material is discussing what should have happened, after the fact, in an almost apologetic way.

The graph of linked nodes representing co-occurring words should be interesting:

So what can we see here? We can see that:

  • The word police is at the centre of many relationships - so the police are a very pertinent and relevant theme of the evidence. This is in fact true of the disaster, where many of the inquiries have been into the role of the police. That's a powerful revelation by the chart, if we didn't know this before.
  • locations are important too .. Liverpool, Midlands, Sheffield, Yorkshire.
  • Again the word would and should are central, reflecting the regretful view of hindsight.

Let's take only the most co-occurring words, with normalised scores of over 0.2.

Colours have been added to the groupings to make them clearer. We can see some themes already:

  • Ground sports safety
  • Lord Justice Stuart-Smith and Justice Taylor inquires and reports
  • Chief constable
  • Police control, authority and evidence.

We should really apply this to other organisations of the data set .. but first let's crack now with the other analyses and come back later.

Document Similarity

We'll skip over the document similarity for now because we're only looking at the Home Office documents. If we were doing a broader analysis across different organisations that would be much more interesting for a document similarity map.

Topic Extraction

The latest tool we developed was the extraction of topics using singular valued decomposition. It worked rather well for the Iraq Report. Let's see how it does for the Home Office Hillsborough documents.

The first thing to check when extracting topics is the distribution of SVD eigenvalues:

Ok, there are a lot of eingenvalues here! Luckily the first few seem to be much more significant then the long tail. Let's soon into the first few:

That's better. The first two eigenvalues are much larger than the rest. The next 2 are also significant. The next dozen or so are worth looking at, but beyond that we're into the long tail.

Let's see what the top 10 topics actually are:

 topic # 0
inquiry         0.238646
police          0.233339
secretary       0.228070
justice         0.213491
letter          0.200483
scrutiny        0.198494
yorkshire       0.195255
london          0.186311
reference       0.180697
hillsborough    0.175345
Name: 0, dtype: float64 

 topic # 1
rfctr6fltyj        9.754874e-01
statement          9.668067e-17
stand              9.368396e-17
report             7.629219e-17
people             6.985798e-17
private            6.167880e-17
ground             5.815512e-17
recommendations    5.815209e-17
recommendation     5.489136e-17
yorkshire          5.228198e-17
Name: 1, dtype: float64 

 topic # 2
chevf             7.613946e-01
superintendent    1.157154e-16
rover             1.024354e-16
leppings          8.931737e-17
aoorv             7.558747e-17
dspys             7.558747e-17
cuxjo             7.558747e-17
chapman           6.959515e-17
football          6.090088e-17
trapped           5.805032e-17
Name: 2, dtype: float64 

 topic # 3
cecic         7.613946e-01
submission    1.025629e-16
psl4310       8.894763e-17
heard         8.755130e-17
tickets       8.643075e-17
dated         7.646144e-17
april         7.373294e-17
tragedy       7.347410e-17
early         7.005352e-17
thank         6.976673e-17
Name: 3, dtype: float64 

 topic # 4
reference      0.268700
scrutiny       0.239448
midlands       0.193232
authority      0.158826
police         0.158758
costs          0.146980
yorkshire      0.136030
london         0.127953
stuartsmith    0.124732
south          0.113730
Name: 4, dtype: float64 

 topic # 5
reference    0.215533
midlands     0.204680
yorkshire    0.185533
costs        0.145715
south        0.136164
football     0.130660
authority    0.125776
scrutiny     0.106040
ground       0.101162
safety       0.084994
Name: 5, dtype: float64 

 topic # 6
ikwmmiwr      4.607135e-01
jmnmiir       4.425598e-01
liverpool     1.023169e-16
semifinal     7.846477e-17
manchester    7.286901e-17
paragraph     7.279151e-17
meeting       7.272110e-17
taylor        7.171484e-17
authority     7.121209e-17
submission    6.293662e-17
Name: 6, dtype: float64 

 topic # 7
reference    0.212675
inquiry      0.158190
secretary    0.134060
whalley      0.127025
scrutiny     0.116281
private      0.100031
london       0.094764
ground       0.092146
people       0.088364
letter       0.086487
Name: 7, dtype: float64 

 topic # 8
18aug1989       4.732789e-01
ifcrl           3.806973e-01
provide         3.862433e-17
stuartsmiths    3.757920e-17
states          3.699572e-17
bodies          3.634809e-17
central         3.534114e-17
community       3.329393e-17
lloyd           3.059613e-17
constabulary    3.041726e-17
Name: 8, dtype: float64 

 topic # 9
reference      0.252258
private        0.118237
football       0.117194
evidence       0.111125
extension      0.102815
scrutiny       0.091190
safety         0.087857
stuartsmith    0.085739
telephone      0.084538
memorandum     0.075057
Name: 9, dtype: float64

Let's look at these topics:

  • topic 0 seems to be about the inquiry and scrutiny into the police, involving the secretary of state, seeking justice as a theme. That's a good topic to extract!
  • topic 1 seems to be related to safety recommendations about the stands and grounds after the disaster
  • topic2 seems to be about the role of the chief superintendent and this tole in people being trapped, in relation to the Leppings Lane stand.
  • .. 

These topics are somewhat concrete, but some seem to be similar varying by a relatively small factor. This is likely because the Home Office documents are probably all about a similar set of themes - and as such it is difficult to extract very different topics .. because they aren't there!

A cross-organisation analysis would more likely extract different topics, just like we saw with the Iraq Report.

Note also the topic words are polluted by non-English words which are. there because of the process of optical character recognition (OCR) that tries to convert, often badly formed, scanned images into text.

Reduced Corpus To Ease Memory Pressure

We were forced to take only the Home Office documents because the entire set, and indeed just the South Yorkshire Police documents, broke the memory limits of my laptop with 16GB RAM!

Let's try a broader exploration by including all the documents except the HOM SYP sets. The easiest way to do this is to move all SYP*.txt and HOM*.txt files to a subdirectory, because the Python glob() doesn't support patterns that exclude files.

Trying that, the memory explodes again, we we exclude the CMS* files too.

The top 20 relevant words make sense:

police 11.081260
football 10.203385
sheffield 9.313793
hillsborough 9.238009
would 9.190118
liverpool 8.361010
there 8.161893
letter 8.041335
meeting 7.611749
report 7.499404
which 7.458836
ground 7.424201
evidence 7.025812
should 6.927943
telephone 6.765532
authority 6.662008
coroner 6.653248
disaster 6.570584
secretary 6.516341
committee 6.341970

Aside from the expected words, there is an interesting words in there: telephone. Maybe telephony was an important aspect of the events?

A word cloud of the relevant words is interesting too - click to enlarge:

We see another aspect coming through - safety, coroner, director, street...

Trying to extract topics again blows up the memory so we exclude more subsets, this time the FFA* and CPS* documents. But that didn't work either .. memory still blew up!

So let's have another look at the data. We notice there are lots and lots of words which are junk with repeated characters like AAAAA and 00000 and zzzzzz. A good filter would be to remove words which have n or more repeated characters. So here it is, added to the word_processing module:

# remove word with n repeated characters
def remove_words_with_n_repeated_chars(input_words_list, n):
    # words with repeated chars anywhere in the strong (re.match only matches from the start)
    # seems to require (n-1) in expression
    regex = re.compile(r'(.)\1{' + str(n - 1) + r',}')
    output_text = [word for word in input_words_list if not]
    return output_text

There aren't many (if any) English words with more than 3 consecutively repeated characters so let's add this to the filter at the top of the pipeline, and see if that helps. The filter is applied as follows:

# remove words which have a character conseqcutively repeated n=3 times or more
    gl = tmt.word_processing.remove_words_with_n_repeated_chars(fl, 3)

That words to an extent but not my much! The memory used by the word count index is reduced from 13.8Gb to 13.4Gb .. so not a huge change.

Looking again at the data we see lots of numeric-only words .. so let's create a filter that removes numeric characters (which should be used only intentionally as numbers can be useful). Here it is, very similar to the keep_only_alphanumeric() function we've used before.

# keep only alpha (not numeric) characters
def keep_only_alpha(input_text):
    regex = re.compile('[^a-zA-Z ]+')
    output_text = regex.sub('', input_text)
    return output_text

That seems to work a bit better. The memory consumption of the word index is now reduced from 13.8Gb to 11.9GB. Still not a massive drop.  Let's combine this with the removal of repeating characters.  That reduces it to 11.6Gb.

Time to get drastic!

Looking again at the data .. we still see nonsense words .. like:

aabac                              0.0                  0.0   
aabalanceaan                       0.0                  0.0   
aabalanceaanhanan                  0.0                  0.0   
aabaw                              0.0                  0.0   
aability                           0.0                  0.0   
aabiscf                            0.0                  0.0   
aabit                              0.0                  0.0   
aabjjivujtzuy                      0.0                  0.0   
aablt                              0.0                  0.0   
aabout                             0.0                  0.0   
aabove                             0.0                  0.0   
aabroad                            0.0                  0.0   
aabrook                            0.0                  0.0   
aabtt                              0.0                  0.0   
aabulance                          0.0                  0.0   
aabulancenanwoman                  0.0                  0.0   
aabulances                         0.0                  0.0   
aabulanoe                          0.0                  0.0   
aabularce                          0.0                  0.0   
aabulonco                          0.0                  0.0   
aaburo                             0.0                  0.0   
aacabd                             0.0                  0.0   
aacacxoiza                         0.0                  0.0   
aaccmpaappll                       0.0                  0.0   
aaccommodation                     0.0                  0.0   
aaccord                            0.0                  0.0   
aacdi                              0.0                  0.0   
aacede                             0.0                  0.0   
aacent                             0.0                  0.0  

Let's get brutal and remove all words with 2 consecutive letters that are the same. This will remove valid English words .. but right now we can't process such huge data.

Removing these words, leads to a reduction to 8.2Gb. The SVD calculation takes a while but succeeds without crashing .. that's a testament to the quality of Python and the open source libraries that it can crunch through an 8Gb data frame to do a matrix decomposition.

Here's the resultant top 10 topics:

topic # 0
aunder        1.173536
newcastle     0.000598
august        0.000438
laboratory    0.000366
research      0.000347
services      0.000306
stephenson    0.000297
reference     0.000295
building      0.000281
telephone     0.000278
Name: 0, dtype: float64 

 topic # 1
police       0.201300
authority    0.151340
coroner      0.150925
report       0.149464
would        0.146609
disaster     0.143472
secretary    0.134563
there        0.131142
evidence     0.129285
which        0.129257
Name: 1, dtype: float64 

 topic # 2
visitors     0.970682
seiolisia    0.116385
seiojisia    0.116385
coroner      0.002195
police       0.002041
inquest      0.001848
would        0.001843
report       0.001822
evidence     0.001820
authority    0.001810
Name: 2, dtype: float64 

 topic # 3
sleisielie    9.637464e-01
stage         1.463056e-16
arvypftea     1.444467e-16
csbcjfcjl     1.444467e-16
jfdacjci      1.406117e-16
fictnizt      1.406117e-16
ctfyjzy       1.350711e-16
dstaqef       1.350711e-16
tsrigjt       1.350711e-16
sctenjs       1.350711e-16
Name: 3, dtype: float64 

 topic # 4
lcvtuow       8.850934e-01
acoruanw      2.947940e-16
jfythli       3.623346e-17
sjhtzl        3.411774e-17
joflpw        2.758480e-17
uvztl         2.520039e-17
lavlfrvm      1.748524e-17
while         1.747938e-17
confidence    1.738987e-17
ywjorim       1.679626e-17
Name: 4, dtype: float64 

 topic # 5
uircrmn       8.850934e-01
barclays      2.745948e-16
salmon        1.013610e-16
government    9.558924e-17
received      8.447307e-17
hours         8.078021e-17
court         7.419536e-17
travel        7.372786e-17
league        7.275761e-17
merseyside    7.198510e-17
Name: 5, dtype: float64 

 topic # 6
skflfr      8.334114e-01
lavlfrvm    4.561791e-16
ywjorim     4.382041e-16
cvwfmrj     4.382041e-16
krktkt      4.126167e-16
utaujt      4.126167e-16
lcftj       3.769504e-16
taxyi       3.769504e-16
cyujw       3.769504e-16
lsksy       3.769504e-16
Name: 6, dtype: float64 

 topic # 7
yjicyv           8.334114e-01
brighton         8.086643e-17
meyskens         5.765278e-17
white            4.939777e-17
emphasise        4.678263e-17
anxious          3.861623e-17
belgium          3.771096e-17
important        3.679099e-17
tickets          3.405656e-17
misunderstand    3.161114e-17
Name: 7, dtype: float64 

 topic # 8
sflcpg          8.334114e-01
budget          4.750198e-17
domes           4.653915e-17
crush           4.478878e-17
event           4.470186e-17
compensation    4.360769e-17
detailed        4.334802e-17
brain           4.158248e-17
which           4.130543e-17
arguments       4.114549e-17
Name: 8, dtype: float64 

 topic # 9
swrvja       8.334114e-01
vhtwefva     8.298114e-17
tfifiq       7.505693e-17
emfaj        6.856907e-17
cikiv        6.856907e-17
judgement    5.835493e-17
schemes      5.201091e-17
behalf       5.043767e-17
roger        4.794902e-17
defence      4.680745e-17
Name: 9, dtype: float64 

 topic # 10
kaovi        7.613720e-01
contd        3.137082e-17
ctwvi        2.709474e-17
raymond      2.668085e-17
carter       2.554545e-17
otherham     2.488737e-17
indemnity    2.377327e-17
children     2.289848e-17
direction    2.255600e-17
traynor      2.220917e-17
Name: 10, dtype: float64 

 topic # 11
ajudo         7.613720e-01
opdwd         2.482052e-16
persons       1.818546e-16
authority     1.079457e-16
royal         1.074050e-16
index         1.062596e-16
photcgraph    1.046602e-16
qiaslr        1.045700e-16
jaijo         9.553127e-17
luodt         8.268289e-17
Name: 11, dtype: float64 

 topic # 12
downing      0.579697
ytufa        0.287647
inister      0.134422
tvihvsv      0.078043
vcoas        0.067134
oatcr        0.067134
stmotf       0.064715
arocl        0.059121
secretary    0.034207
ambulance    0.032477
Name: 12, dtype: float64 

 topic # 13
ambulance    0.219331
control      0.177408
hospital     0.149227
ground       0.143579
incident     0.112497
station      0.108420
vehicle      0.094678
patients     0.084698
there        0.083653
downing      0.081564
Name: 13, dtype: float64 

 topic # 14
coroner      0.234570
inquest      0.161868
resolved     0.126806
services     0.109531
working      0.099156
digitised    0.097361
evidence     0.092663
council      0.086339
party        0.085621
sincerely    0.083311
Name: 14, dtype: float64

We can see some topics that make sense, for example:

  • topic 0 - laboratory, research, services...
  • topic 1 - police, authority, disaster, evidence, ..
  • topic 2 - visitors, inquest, evidence, ..
  • topic 8 - budget. domes, crush, event, compensation, brain, ..
  • topic 13 - ambulance, control, hospital, ground, incident, vehicle, patients
  • topic 14 - coroner, inquest, .. digitised, evidence, council ...

The good news is these topics are more varied now that we're looking across a wider more varied set of documents.

The bad news is that data quality is still causing problems with the analysis .. with words like sflcpg, ctwvi, cyujw and so on dominating the data.

So the lesson here is that we need to spend much more time cleaning the data. We'll do that next time.

Lesson - Data Quality

We've done a good job managing a huge data set, and seen how our text mining toolkit works well.

The main lesson here is that data quality matters. The poor quality of original documents, the limited effectiveness of OCR, all combined lead to a dataset which is dominated by mostly meaningless text.

We need to deal with that next time.