Wednesday, 3 August 2016

Using the Humongous British National Corpus (BNC)

Many models for text mining need an example set of natural language text to learn from .. that set of text is the "example" for machine learning methods.

The word used for such examples of text is corpus. I know .. sounds very grand!

You can see that small set of text would provide very limited learning opportunities .. because no machine or human mind can learn from a paucity of examples. So a large corpus is a good thing ... it provides lots of examples of language use, including odd variations that we humans like to put into our language.

Sometimes it is useful to have a corpus that is narrowly focused on a specific domain - like medical research, or horticulture, or Shakespeare's plays. That way the learning is also focused on that specific domain .. and adding additional text from another domain would dilute the example.

But there are cases where we actually do want a wide range of domains represented in a corpus .. to give us as as general an understanding of how language is used.

Finding Corpora

So given how useful large, and sometimes specialised, corpora are .. where do we find them? We don't want to make them ourselves as that would take huge amounts of effort.

Sadly, many of the best corpora are proprietary. They are not freely available, and even when they are for personal use, you have to agree to a scary looking set of terms. Almost always, you are prohibited from sharing the corpus onwards. This is a shame, because many of these corpora are publicly funded, or derived from publicly funded sources.

There are some out-of-date corpora if you look hard enough - 20 nntp news groups here (scikit-learn) and here (Apache Mahout), ... seriously?! And there is a tendency for too many researchers to use the same set of corpora which happen to be freely available.

There are some notable good examples of freely available and usable text. Project Gutenberg publishes out of copyright texts in very accessible forms. It's a great treasure trove .. have a look: https://www.gutenberg.org

Another good source are the public data releases, such as the Clinton emails we used earlier in previous posts. Similarly, public reports such as the Iraq Inquiry report are great sets of text, especially if you're interesting in exploring a particular domain.


The British National Corpus

The British National Corpus (BNC) is a truly massive corpus of English language. It is a really impressive effort to collate a very wide range of domains and usage, including spoken and regional variations.

You can find out more about the BNC  at http://www.natcorp.ox.ac.uk/corpus/index.xml but here are the key features:
  • 100 million words .. yes one hundred million words! 
  • 90% from written text including newspapers, books, journals, letters, fiction, ...
  • 10% from spoken text including informal chat, formal meetings, phone-ins, radio shows .. and from a range of social and regional contexts.

Sadly the BNC corpus is proprietary - you can't take it and do what you want with it. You can apply for a copy for personal use from http://ota.ox.ac.uk/desc/2554.

There is a smaller 45,000 word free sample, called the BNC Baby, at http://ota.ox.ac.uk/desc/2553 which we will use to test our algorithms on first, as it is quicker and less resource intensive than working on the humongous full BNC.


Extracting the Text with Python

The BNC is not apparently available in plain text form. It is instead published in a rich XML format, which includes lots of annotation about the word such as parts of speech (verb, noun, etc).

We want make our own text mining toolkit - so we want to start with the plain text. The following is the simple Python code for accessing and extracting the plain text, in the form of sentences and words. You can see below how we can switch between the full BNC and the BNC Baby.

# code to convert the BNC XML to plain text words

# import NLTK BNC corpus reader
import nltk.corpus.reader.bnc


# full BNC text corpus
#a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2554/2554/download/Texts', fileids=r'[A-K]/\w*/\w*\.xml')


# smaller sample BNC Baby corpus

a = nltk.corpus.reader.bnc.BNCCorpusReader(root='data_sets/bnc/2553/2553/download/Texts', fileids=r'[a-z]{3}/\w*\.xml')


# how many sentences
len(a.sents())

280851

# how many words
len(a.words())

3540423

# print out first 50 words
a.words()[:50]
['BEING',
 'DRAWN',
 'TO',
 'AN',
 'IMAGE',
 'Guy',
 'Brett',
 'Why',
 'do',
 'certain',
 'images',
 'matter',
 ...


The snippet of code to write out a new plain text file is easy too:

# extract sentences and add to new file
with open("data_sets/bnc/txt/bnc_baby.txt", 'w') as nf:
    for s in a.sents():
        #print(' '.join(s))
        nf.write(' '.join(s))
    pass  



The BNC Baby sample turns into a 17Mb plain text file!


Damned Lincense

Sadly I can't put this plain text file on github to share it because of the damned restrictive license http://www.natcorp.ox.ac.uk/docs/licence.html so you'll have to recreate it yourself using the above code.

Annoying, I know .. feel free to petition the University Of Oxford ota@it.ox.ac.uk to change the license and make it #openscience

No comments:

Post a Comment