Getting started with OpenNLP (Natural Language Processing)
I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.
I found a great set of tools for natural language processing. The Java package includes a sentence detector, a tokenizer, a parts-of-speech (POS) tagger, and a treebank parser. It took me a little while to figure out where to start so I thought I'd post my findings here. I'm no linguist and I don't have previous experience with NLP, but hopefully this will help some one get setup with OpenNLP.
What do these tools do?
Given the input sentences below, we'll take a look at what these tools actually do.
This isn't the greatest example sentence in the world because I've seen better. Neither is this one. This one's not bad, though.
Sentence Detector
Straight-forward: it detects sentences. This is more complicated than it sounds, since sentences don't only end with periods and dialogue can also complicate things. Fortunately for us, all of this is handled by their code and we just grab the sentence tokens. Just like the examples in the README, you'll probably start most processing with this because the other tools deal with one sentence at a time.
The sentence detector returns an array of strings. In our example the first element would be:
This isn't the greatest example sentence in the world because I've seen better.
Tokenizer
The POS tagger and treebank parser both need to have sentences broken down into tokens separated by spaces. Tokens are usually words but I noticed that some words get split into multiple tokens. For example, "don't" gets split into "do" and "n't," after it's uncontracted form, "do not." Some punctuation also gets split into separate tokens. Here's what it does to our sentence:
This is n't the greatest example sentence in the world because I 've seen better .
Note that the "n't" has become a separate token. The same happened to the contracted "have" and the period has also become a token.
POS Tagger
The tagger uses a dictionary of tags and a trained model to apply parts of speech tags (verb, adverb, personal pronoun) to each token in a sentence. The tagging output conforms to the "Penn Treebank Style." Here's the result of tagging the tokenized sentence:
This/DT is/VBZ n't/RB the/DT greatest/JJS example/NN sentence/NN in/IN the/DT world/NN because/IN I/PRP 've/VBP seen/VBN better/RB ./.
As you can see, each token has been appended with a slash followed by a POS tag. I found this parts of speech reference useful for understanding the tags.
While the tagger can give you quite a bit of information about the sentence, it doesn't tell you very much about the sentence structure.
Treebank Chunker
The goes a little further in showing sentence structure by breaking the sentence into simple chunks. Noun phrases and verb phrases are recognized and tagged appropriately. Taking our example sentence, we get something like this:
[NP This/DT ] [VP is/VBZ ] n't/RB [NP the/DT greatest/JJS example/NN sentence/NN ] [PP in/IN ] [NP the/DT world/NN ] [SBAR because/IN ] [NP I/PRP ] [VP 've/VBP seen/VBN ] [ADVP better/RB ] ./.
This is pretty useful output. Although it doesn't provide as much information as the parser, it does load up a lot quicker and doesn't require as much memory.
Treebank Parser
This is the big kahuna. You can tell by the resources it consumes. The parser tags tokens and groups phrases into a hierarchy, building sentence trees of Parse objects. Each of the possible trees for the sentence are also given a probability which indicates the likelihood that this is the correct way to interpret the sentence. The parser uses the models in the parser models directory, takes about thirty seconds to start up on my machine, and ends up using around 300MB of memory. Once loaded, however, the actual text parsing happens pretty quickly.
Here's the tree generated for our example sentence:
(TOP
(S
(NP (DT This))
(VP
(VBZ is)
(RB n't)
(NP
(NP
(DT the)
(JJS greatest)
(NN example)
(NN sentence)
)
(PP
(IN in)
(NP
(DT the)
(NN world)
)
)
)
(SBAR
(IN because)
(S
(NP (PRP I))
(VP
(VBP 've)
(VP
(VBN seen)
(ADVP (RB better))
)
)
)
)
)
(. .)
)
)
Some code to get you going
String paragraph = "..."; // the sentence detector and tokenizer constructors // take paths to their respective models SentenceDetectorME sdetector = new SentenceDetector("models/sentdetect/EnglishSD.bin.gz"); Tokenizer tokenizer = new Tokenizer("models/tokenize/EnglishTok.bin.gz"); // the parser takes the path to the parser models // directory and a few other options boolean useTagDict = true; boolean useCaseInsensitiveTagDict = false; int beamSize = ParserME.defaultBeamSize; double advancePercentage = ParserME.defaultAdvancePercentage; ParserME parser = TreebankParser.getParser( "models/parser", useTagDict, useCaseInsensitiveTagDict, beamSize, advancePercentage); // break a paragraph into sentences String[] sents = sdetector.sentDetect(para.toString());
Now we feed each of the sentences to the tokenizer, and pass the output to the parser.
String sent = sents[0]; // tokenize brackets and parentheses by putting a space on either side. // this makes sure it doesn't get confused with output from the parser sent = untokenizedParenPattern1.matcher(sent).replaceAll("$1 $2"); sent = untokenizedParenPattern2.matcher(sent).replaceAll("$1 $2"); // get the tokenizer to break apart the sentence String[] tokens = tokenizer.tokenize(sent); // build a string to parse as well as a list of tokens StringBuffer sb = new StringBuffer(); List<String> tokenList = new ArrayList<String>(); for (int j = 0; j < tokens.length; j++) { String tok = convertToken(tokens[j]); tokenList.add(tok); sb.append(tok).append(" "); } String text = sb.substring(0, sb.length() - 1).toString();
The parser takes a two-layer tree of Parse objects. The parent Parse object holds a list of child Parse objects—one for each token in the sentence.
// the parent parse instance spans the entire sentence Parse p = new Parse(text, new Span(0, text.length()), "INC", 1, null); // create a parse object for each token and add it to the parent int start = 0; for (Iterator ti = tokenList.iterator(); ti.hasNext();) { String tok = (String) ti.next(); p.insert(new Parse(text, new Span(start, start + tok.length()), ParserME.TOK_NODE, 0)); start += tok.length() + 1; } // fetch multiple possible parse trees Parse[] parses = parser.parse(p,numParses);
At this point, parses holds the different possible parse trees for the sentence. Now, figure out what you want to do with it, you must!


