Getting Useful Information Out of Unstructured Text
Let’s say that you’re interested in performing a basic analysis of the US M&A market over the last five years. You don’t have access to a database of transactions and don’t have access to tombstones (public advertisements announcing the minimal details of a closed deal, e.g. ABC acquires XYZ for $500mm). What you do have is access to is a large corpus of financial news articles that contain within them – somewhere – the basic transactional details of M&A deals.
What you need to do is design a system that takes in this large database and outputs clean fields containing M&A transaction details. In other words, map an excerpt like this:
“After much speculation and rumors, Calvin Klein announced yesterday that it had acquired Royal Dutch Shell for an unprecedented sum: the company will shell out $12,400 for Shell, the multinational oil and gas conglomerate.”
onto this ontology:
so that it becomes:
After we do this for all of the articles in our database, we should have a nice structured dataset that can be queried, aggregated, joined with other datasets, etc. to help us better analyze the M&A market.
This project describes a common task in natural language processing called information extraction (IE). So how do we get from news articles to identifying relationships between entities? In general, the process usually looks something like the diagram below:
We start with raw text, identify the sentences, identify the words in the sentences, identify the part of speech of the words (nouns, verbs, prepositions, etc.), identify entities within the text (called chunking, or shallow parsing), and finally identify relationships between those entities.
To me, chunking is one of the more interesting stages in the information extraction process. Effective chunking requires a good toolkit of software techniques and requires that you think carefully and critically about your data and your system design: rule-based approaches, machine learning approaches, exception handling, complexity vs. speed tradeoffs, domain specificity, and all of the nasty and exciting challenges that human language poses for automation.
In this post, we’ll focus on entity detection and chunking. First I’ll talk about chunking in the context of linguistics, describe some common problems and approaches, and then talk about named entity recognition. My hope is that by taking a medium-deep dive into chunking you’ll gain a better understanding of the challenges and techniques used in all stages of information extraction, and will have a few more ideas for your own NLP projects.
What is Chunking?
Chunking is somewhere between part of speech (POS) tagging and full language parsing, hence the name shallow parsing. If chunkers are an inbetween stage then why are they relevant? The answer comes down to utility and speed.
POS tagging is very fast but often doesn’t provide a ton of utility for information extraction. It’s helpful to know the POS tags, but when we try to derive information about our text we’re still swimming within the unstructured soup of words in a sentence. Knowing that word 1, 4 and 7 in our sentence are nouns won’t often won’t prove useful enough to help us reliably gain knowledge about what our sentence is actually saying; there’s too much room for mistake.
POS Tags:
On the other hand, full parsing is extremely useful: we’re able to understand the syntactic relationship details between the words in our text, and information extraction becomes much easier to define. However, full parsing takes a very long time and will often give you information you don’t necessarily need. Some degree of parsing helps structure our text, but knowing that the determiner in the middle of our sentence is four branches down from the root and part of a nested prepositional clause within a NP clause within the main VP clause…might be overkill, as is the other parse tree produced for our sentence because the syntactic ambiguity in the prepositional clause lends itself to two interpretations: the subject in pajamas shooting an elephant, or the subject shooting the elephant that is wearing his pajamas.
(Sort of correct) Parsing:
And let’s be honest, you’re only here because you want to understand what Twitter is saying about your company’s new line of designer sandwiches or whatever, so all of this extra information is unnecessary and confusing.
Chunking is the happy middle ground that gives you enough information about the syntactic structure to reliably extract meaning from language without burdening your system with unnecessary information.
Chunking:
How do we chunk?
Let’s start by importing the relevant libraries and writing a simple preprocessing function to tokenize and tag your text:
import nltk import pprint import re def preprocess(doc): sentences = nltk.sent_tokenize(doc) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] return sentences
Take the sentence “The blogger taught the reader to chunk.” After POS-tagging, we have:
sentence = "The blogger taught the reader to chunk" preprocess(sentence) [('The', 'DT'), ('blogger', 'NN'), ('taught', 'VBD'), ('the', 'DT'), ('reader', 'NN'), ('to', 'TO'), ('chunk', 'VB'), ('.', '.')]
In case you’re rusty, a quick reference from your linguistics class back in college.
Chunking is an intermediate step towards full text parsing, and lets us take POS-tagged words and build them into noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and whatever else we might want. Chunking will segment, label, and create a simple hierarchical syntactic structure out of our discrete tokens given some grammatical rules:
To produce the diagram above, we’ve built a simple noun phrase chunking grammar, where we simply search for and assign labels to patterns that we identify as possible NPs and disregard VPs and PPs. Specifically, our custom chunk grammar tags a noun phrase chunk whenever it finds:
- an optional determiner DT (“the,” “a,” etc.)
- followed by any number of adjectives
- followed by a noun
We’ll use regular expressions to formally define this grammar and a regex parser from NLTK:
grammar = "NP: {<DT>?<JJ>*<NN>}" NPChunker = nltk.RegexpParser(grammar) result = NPChunker.parse(sentence[0]) result.draw()
(Note that our preprocessor function returns a list of lists, hence “sentence[0]” above.)
The draw function creates a tree diagram like the one above, and “print result” returns the tree data structure:
Tree('S', [Tree('NP', [('The', 'DT'), ('blogger', 'NN')]), ('taught', 'VBD'), Tree('NP', [('the', 'DT'), ('reader', 'NN')]), ('to', 'TO'), ('chunk', 'VB'), ('.', '.')])
Clearly, we’re a very long way away from creating accurate syntactic representations, but you can probably see how we might expand the noun phrase definition, build up verb phrases, relative clauses, etc. by extension. Depending on your fluency with regular expressions and familiarity with linguistics you could spend more time beefing up our toy grammar. For example, by just looking at our sample sentence we can cheat a little bit and define a grammar that correctly (well, almost) parses our sentence by including a custom VP and infinitive (I) pattern:
NP: {<DT|PP\$>?*} I: {} VP: {<NP|PP>*}
Of course, this won’t generalize nicely outside of our sample sentence.
As a side note, look at what we did with the VP pattern: we nested a NP as an element inside of a VP pattern. Our grammar first identified and tagged all NP chunks, and only afterwards identified VP chunks which can include previously defined NP nodes. Recursion is necessary for correct chunking of natural language, and puts us on the road to full parsing. Note that the parser works through our grammar in sequential order…so what if we use a grammar with an NP pattern like this?
NP: {<DT|PP\$>?*} I: {} VP: {<NP|PP>*}
Our NP pattern looks for VP tags…but these haven’t been created yet! And if we push our VP pattern to the top of our grammar, then it won’t be able to find NP clauses defined in its grammar. To solve this we need to loop over our pattern so that we can identify and tag the nested patterns. NLTK’s regexparser lets us do so with a loop param:
cp = nltk.RegexpParser(grammar, loop=2)
Anyway, back to chunking: a pure chunker would be only concerned with creating a set of mutually exclusive, non-nested phrases like NP, VP, and PPs:
NP: {<DT>?<JJ>*<NN>} VP: {<VBD>?<TO>?<VB>?}
[NP The blogger ] [VP taught ] [NP the reader ] [VP to chunk ].
What is chinking?
A chink is a sequence of tokens that should be excluded from a chunk, and is another tool in our shallow parsing toolbox. Instead of creating a comprehensive inclusive grammar for accurate chunking, it might be easier to define the edge cases we want to exclude. To create a chinking pattern, create the tag but wrap it in open brackets }{ instead of closed ones. We won’t use it in our grammar, but if – for example – we wanted to exclude prepositions and determiners from our NP pattern, we could write something like this:
NP: }<IN|DT>+{ # chinking pattern: prepositions, determiners
Evaluating chunker performance
Now that we have the tools to make a chunker, we need a labeled test dataset against which we can see how well our chunker performs. We’ll start by importing the tagged and chunked Wall Street Journal corpus conll2000 from nltk, and then evaluating different chunking strategies against it.
nltk.download("conll2000") from nltk.corpus import conll2000
Chunk structures can be either represented in tree or tag format. We’ve already seen the tree format above, i.e. Tree(‘S’,[Tree(..)..]..) The widely used tag notation is known as IOB tag format: each token is tagged with either I (inside), O (outside), or B (begin). These tags refer to whether a token is the beginning of a chunk, somewhere inside a chunk, or outside a chunk:
In this format, a corpus can be represented in three columns: the word, the POS, the chunk tag (IOB) and chunk tag phrase.
He PRP B-NP reckons VBZ B-VP the DT B-NP current JJ I-NP account NN I-NP deficit NN I-NP will MD B-VP narrow VB I-VP to TO B-PP only RB B-NP # # I-NP 1.8 CD I-NP billion CD I-NP in IN B-PP September NNP B-NP . . O
Establishing Null Accuracy
Before we evaluate our classifier, it’s important to figure out the null accuracy – in other words, how well would our classifier do if we assigned NP at random, or assigned everything as NP, or nothing as NP? The classifier is only good insofar as it can outperform these random/uneducated assignment benchmarks.
We’ll start by looking only at NPs and will establish a base accuracy for any NP classifier that we create. To do this, we’ll input an empty string where our grammar should go:
grammar = "" cp = nltk.RegexpParser(grammar) test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print cp.evaluate(test_sents)
ChunkParse score: IOB Accuracy: 43.4%% Precision: 0.0%% Recall: 0.0%% F-Measure: 0.0%%
This means that 43.4% of our NP-tagged words are tagged with O and not in an NP chunk. For clarification, note the chunk_types param in the code above. Setting the chunk_types param means we will only look at NP chunks:
print conll2000.chunked_sents('train.txt')[99]
(S (PP Over/IN) (NP a/DT cup/NN) (PP of/IN) (NP coffee/NN) ,/, (NP Mr./NNP Stone/NNP) (VP told/VBD) (NP his/PRP$ story/NN) ./.)
print conll2000.chunked_sents('train.txt', chunk_types = ['NP'])[99]
(S Over/IN (NP a/DT cup/NN) of/IN (NP coffee/NN) ,/, (NP Mr./NNP Stone/NNP) told/VBD (NP his/PRP$ story/NN) ./.)
Getting More Advanced
Now that we’ve established a baseline, let’s build a classifier-based chunker. What is a classifier-based chunker? We saw that with our regex chunker, every decision about what to chunk and not chunk was purely a grammatical function of the text’s POS tags, the tag sequences, the tag frequencies, etc. In a classifier-based approach, we treat the chunker more like a traditional machine learning classification algorithm: take in a feature set and create predictions. Clearly our feature set is larger than just the POS tags we used in the regex chunker; we can extract information about the text from the tokens, token sequences, frequencies, sentence lengths, punctuation occurrences, etc. In the classifier-based approach we’ll exploit this information to create new features for our classifier.
For this post we’ll be using the classifier code provided on the NLTK website and in the book Natural Language Processing with Python.
First we’ll introduce the main body of the classifier. In this case, we use a greedy sequence classifier. This classifier looks at the training data one sentence at a time, and for each sentence takes the first element, predicts a tag, records the first element features and prediction in a history and predicts the tag of the second element using the the second element features and the history, etc. until all elements of the sentence have been predicted using the contextual information of all the words and predictions which came before it.
The embedded tagger and model evaluation interfaces for NLTK require different formats. The ConsecutiveNPChunker class at the bottom simply converts between IOB tags and tree format with chunk.tree2conlltags, chunk.conlltags2tree, and list comprehensions.
For each sentence in the training set, the ConsecutiveNPChunkTagger class constructor builds a sequenced history of features composed of the actual tag and untagged features passed to a separate feature generation function npchunk_features. It then trains a maximum entropy classifier on this data, which the tag method uses to classify new data.
class ConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append( (featureset, tag) ) history.append(tag) self.classifier = nltk.MaxentClassifier.train( train_set, algorithm='IIS', trace=0) def tag(self, sentence): history = [] for i, word in enumerate(sentence): featureset = npchunk_features(sentence, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history) class ConsecutiveNPChunker(nltk.ChunkParserI): def __init__(self, train_sents): tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags = [(w,t,c) for ((w,t),c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags)
The feature generation function is the most interesting part of our classifier and has been separated from the main body so that we can easily iterate on it. Our first version of the feature generator simply returns the POS tag for each word – nothing special.
def npchunk_features(sentence, i, history): word, pos = sentence[i] return {"pos" : pos}
We then call and evaluate our classifier like so:
chunker = ConsecutiveNPChunker(train_sents) print chunker.evaluate(test_sents)
ChunkParse score: IOB Accuracy: 92.9%% Precision: 79.9%% Recall: 86.8%% F-Measure: 83.2%%
Not bad. Let’s see if we can improve our performance by adding more features. In the next modification of npchunk_features, we’ll create a feature for the token itself, as well as a lag feature for the POS of the previous token.
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] return {"pos": pos, "word": word, "prevpos": prevpos}
ChunkParse score: IOB Accuracy: 94.6%% Precision: 84.6%% Recall: 89.8%% F-Measure: 87.1%%
With the addition of just a tiny selection of features, our score has improved quite significantly. Let’s add a feature for the next POS tag, bigram POS tags, and a feature that aggregates the POS tags we’ve seen since the last encounter with a determiner:
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] if i == len(sentence)-1: nextword, nextpos = "<END>", "<END>" else: nextword, nextpos = sentence[i+1] return {"pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "tags-since-dt": tags_since_dt(sentence, i)} def tags_since_dt(sentence, i): tags = set() for word, pos in sentence[:i]: if pos == 'DT': tags = set() else: tags.add(pos) return '+'.join(sorted(tags))
ChunkParse score: IOB Accuracy: 96.0%% Precision: 88.3%% Recall: 91.1%% F-Measure: 89.7%%
Let’s add a few more features: previous word, next word, and trigrams for both word and POS:
def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] if i == len(sentence)-1: nextword, nextpos = "<END>", "<END>" else: nextword, nextpos = sentence[i+1] return {"pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevword": prevword, "nextword": nextword, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "prevpos+pos+nextpos": "%s+%s+%s" % (prevpos, pos, nextpos), "prevword+word+nextword": "%s+%s+%s" % (prevword, word, nextword), "tags-since-dt": tags_since_dt(sentence, i)} def tags_since_dt(sentence, i): tags = set() for word, pos in sentence[:i]: if pos == 'DT': tags = set() else: tags.add(pos) return '+'.join(sorted(tags))
ChunkParse score: IOB Accuracy: 96.5%% Precision: 90.2%% Recall: 92.5%% F-Measure: 91.3%%
A small but not insignificant improvement. As you can see, by adding more and more context – whether POS, tokens, sequences, etc. – we help improve chunking accuracy.
Named Entity Recognition (NER)
So, why have we spent all this time identifying verb phrases in WSJ data when we’re really interested in M&A deal data? The answer is that the same methods used in linguistic chunking and parsing don’t just apply to VPs, NPs, and PPs…they can be generalized to identification of other custom chunks like place names, company names, people, dates, dollar amounts, etc.
At first pass, this sounds like overkill. If we wanted to identify every location mentioned in a Wall Street Journal article, for example, why don’t we just index the words in the article against a database of place names? Well, let’s give our naive lookup system a shot:
Even if we take some basic measures to update our system like requiring capitalization of place names (excludes “books”) and exclusion of non-nouns (excludes “popular”) , we’ll still misclassify lots of place name instances. Our toy example from the beginning of this post presents us with similar (somewhat contrived) challenges:
“After much speculation and rumors, Calvin Klein announced yesterday that it had acquired Royal Dutch Shell for an unprecedented sum: the company will shell out $12,400 for Shell, the multinational oil and gas conglomerate.”
Supposing our NER system wants to identify companies, people, places, etc., how can it tell that Calvin Klein is a company and not a person? Or which instance of “shell” refers to the company?
The solution is to incorporate information about our linguistic chunks and apply new chunking-like approaches to the identification of new entities. NLTK provides a classifier, nltk.ne_chunk, that has been pre-trained to recognize entities like people, organizations, and locations. “Nick taught the Google employee from Illinois to chunk.” becomes
(S (PERSON Nick/NNP) taught/VBD the/DT (ORGANIZATION Google/NNP) employee/NN from/IN (GPE Illinois/NNP) to/TO chunk/VB ./.)
where people, organizations, and locations have been correctly tagged.
Although NLTK’s off-the-shelf chunker worked well in this case, I should mention that chunkers are notoriously brittle; they tend not to generalize well outside of the domain-specific text for which their grammars have been written or their classifiers have been trained on.
Fortunately, after our lesson on chunking, you now have two approaches that you can use to build your own entity recognition chunker.
- If you have a training dataset of chunked text then lucky you – you can follow the classifier-based approach to develop a chunker.
- If you don’t happen to have a labeled training set then you can develop a regex grammar on top of the POS-tagged corpus, the chunk-tagged corpus, even the ne_chunk-tagged corpus.