On 12 September 2012 it published its report and simultaneously launched a website containing 450,000 pages of material collated from 85 organisations and individuals over two years.
It would be very interesting to analyse that huge set of very varied textual data, data that is hugely important to a very public matter.
This post will desctibe getting and pre-processing the data. Future posts will discuss exploring the data itself.
Machine V HumanThat 450,000 pages of material from 85 organisations, a huge amount of information, was processed and analysed manually. This has advantages and disadvantages.
Human readers, domain experts especially, can make sense of that information. But no human is a machine, and we all make errors, get tired, skip over key information, and fail to make connections across that vast dataset, and miss patterns that might be there.
So this in an experiment in applying automated machine text mining methods to complement the human experts, to see if anything interesting emrges.
Partially Open OpenDataFirst we need to get the data.
The website for the Hillsborough Independent Disclosure allows us to navigate the data set because it has been tagged with labels such as originating organisation, date and so on. However there is no direct and easy way to get all the text data.
The UK government's opendata site for this disclosure also fails to make the text context easily available. This is against the principles of open data, which urge organisations to provide data that is readily computable.
The best that is available is a catalogue of all the content, in the form of a CSV or XML file.
CSV CatalogueThe Hillsborough Disclosure site has a page offering you access to all the data (here) but what is available is a catalogue of all the content, in CSV or XML format.
CSV is easier to work with so we'll download that (link at the site, or direct link).
Here's the very top few lines of that CSV catalogue - there's too much to show it all (about 19,217 rows of data).
"Unique ID","Title","Description","Contributor Reference","Start Date","End Date","Names of The Deceased","Persons Involved","Organisations Involved","Contributor","Folder Title","Sub-folder Title","Original Format","Format","Disclosure Status","Reason Not Disclosed On Site","Duplicate","Document Landing Page URL","Document URL","Copyright"
"AGO000000010001","Letter dated 15 April 1992 from Dr S J Crosby to the Attorney General concerning Mr and Mrs J S Williams","Letter from Dr S J Crosby to the Attorney General concerning Mr and Mrs J S Williams",,"1992-04-14",,,,,"Attorney General's Office","Attorney General's Office (AGO) files","General Hillsborough files. Including requests for new inquests and the correspondence around the establishment of the Stuart-Smith scrutiny.","Paper","Document",,,,"http://hillsborough.independent.gov.uk/repository/AGO000000010001.html","http://hillsborough.independent.gov.uk/repository/docs/AGO000000010001.pdf","Hillsborough Fair Use Licence"
"AGO000000020001","Letter dated 24 April 1992 from Donna Carlile sister of Paul Carlile Deceased, to the Attorney General: request for a new inquest","Letter from Donna Carlile to the Attorney General",,"1992-04-23",,"Carlile, Paul William",,"Police Complaints Authority","Attorney General's Office","Attorney General's Office (AGO) files","General Hillsborough files. Including requests for new inquests and the correspondence around the establishment of the Stuart-Smith scrutiny.","Paper","Document",,,,"http://hillsborough.independent.gov.uk/repository/AGO000000020001.html","http://hillsborough.independent.gov.uk/repository/docs/AGO000000020001.pdf","Hillsborough Fair Use Licence"
You can see above that the very first row of text contains the field headers for the actual data. For example, the very first data item on each row is a unique identifier. There is other interesting stuff like the originating organisation, a description, date, etc .. but we want the content itself, not a wrapper describing it.
Look closely and you'll see there is a field called "Document URL", thats the penultimate data field. That's what we will use to get the source content. The URL points to the source content. We just need to download every one .. that's over 19,000 downloads!!
Here's a Python script to open that CSV, and for each entry, print out the URL. Some are empty, so this is tested for, and those entries are ignored if the URL is empty.
with open("Hillsborough-Disclosure.csv", encoding='windows-1252') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
url = row
if len(url) > 0:
We can use standard unix shell commans to pipe the output of that script to a file, which will become a file of URLS. Here the script is called extract_hip_files.py and the target file for the URLS is called hip_urls.txt.
./extract_hip_files.py > hip_urls.txt
That's quite a quick process on a modern laptop. Looking inside the hip_urls.txt gives us a list of URLs as expected. Here are just a few from the top:
We actually deleted the very first line becuase that was just the header information. It was there because our python script didn't have the extra complication of skipping it.
Now we need to get these PDFs.
PDFS? What? So the source content is not plain text .. it's provided as PDFs. Looking at a few, they look like they are scanned images, and the images are of very variable and often poor quality original documents. Extracting plain text will be hard.
We'll deal with the PDF issue later. Back to downloading the content.
Downloading 19,000 FilesThat file hip_urls.txt contains the URLs pointing to the source documents. A quick unix shell command wc -l hip_urls.txt tells us there are 19217 URLs .. 19217 documents to download!
That's a huge amount to do manually. Let's use another cool and common open source tool to do the hard work. wget is a tool provided by many unix/linux systems. We can get it to look in a file for a list of URLs to retrieve.
More importantly, this is a large amount to download and it will take time. And we might get interrupted, or have to stop and come back later. Luckily wget allows us to resume broken or interrupted downloads, and also not to re-download files it has already downloaded.
Here's the command with these options (from inside a subfolder called downloads):
wget -nc -c -i ../hip_urls.txt
That took just under 5 hours on my laptop over my home internet connection!
Extracting Text from PDFsLuckily each of the PDFs already has a text layer, because the task of running OCR on the scanned images was already done. We need to extract that text.
There is a good utility commonly found on all Linux systems called pdftotext (not the similarly named pdf2text or other variations). The good one is part of the poppler libraries, and is often packaged as doppler-utils. It can try to preserve the ordering of text as it appears in the PDFs with the -layout switch. Here's the command:
for i in *.pdf; do pdftotext -layout $i; done
That's much quicker and takes minutes, not hours.
A New Hillsborough CorpusThe new 19217 text files can now become a new interesting text corpus to explore. You can find it here on github:
It'll be really interesting to explore this new data set with the tools we've already built, and new ones we will develop in future too. Keep an eye out for new posts ...