Hacker News new | past | comments | ask | show | jobs | submit login
Parsing English with 500 lines of Python (2013) (honnibal.wordpress.com)
285 points by adamnemecek on Apr 28, 2014 | hide | past | web | favorite | 61 comments

It's worth checking out the OP's previous post on this: A good POS tagger in about 200 lines of Python: http://honnibal.wordpress.com/2013/09/11/a-good-part-of-spee...

The OP says his PyGreedyAP gets 96.8% accuracy in 12s (vs NLTK's 94% at 236s)

How does this relate/compare to NLTK?


The truth is nltk is basically crap for real work, but there's so little NLP software that's put proper effort into documentation that nltk still gets a lot of use.

You can work your way down the vast number of nltk modules, and you'll find almost none of them are useful for real work, and those that are, ship a host of alternatives that are all much worse than the current state-of-the-art.

nltk makes most sense as a teaching tool, but even then it's mostly out of date. The chapter on "Parsing" in the nltk book doesn't even really deal with statistical parsing. The dependency parsing work referenced in this post is almost all 1-3 years old, so obviously it isn't covered either.

As an integration layer, nltk is so much more trouble than it's worth. You can use it to compute some of your scoring metrics, or read in a corpus, but...why bother?

I'm slowly putting together an alternative, where you get exactly one tokeniser, exactly one tagger, etc. All these algorithms have the same i/o, so we shouldn't ask a user to choose one. We should just provide the best one.

My previous post, on POS tagging, shows that ntlk's POS tagger is incredibly slow, and not very accurate: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-spe... . nltk scores 94% in 3m56s, my 200 line implementation scores 96.8% in 12s.

I used to use nltk for tokenisation and sentence-boundary detection, but this library seems better for that: https://code.google.com/p/splitta/

As an aside, OpenCV seems to have the same problem. Widely used and at least a std deviation away from OK.

Totally. OpenCV was my first case of encountering and having then to work out a way around bugs in something that's supposed to "just work". Not many other options, though.

The ones I know are: http://simplecv.org and http://libccv.org/post/an-elephant-in-the-room/ (introduced on HN 2 yrs ago: https://news.ycombinator.com/item?id=4180979 )

Anyone have thoughts on how they or other libs compare to openCV?

OpenCV is the OpenSSL of vision libraries.

Since NLTK does enjoy some popularity, would it make sense to try to get some of your code into the project? I'm sure a lot of people could benefit from using/learning about more current algorithms.

My code is BSD licensed, so nltk can take and adapt it in any way they choose. However, I won't do any of the work to integrate it into their class hierarchies, or to write the tests and documentation in the way that they require.

I expect a successful patch request would take a lot of work. That's not a criticism; it's just how it is, given the aims, history and status of their project.

I was thinking about working through the NLTK book once I'm finished with Bishop's Pattern Recognition, would you be able to recommend an alternative?

check out Michael Collins's NLP course: https://www.coursera.org/course/nlangp

and notes: http://www.cs.columbia.edu/~mcollins/

he talks about averaged perceptron at the end (Lectures on generalized log-linear models - GLM http://www.cs.columbia.edu/~cs4705/)

perceptron tagger code (hw4 solution) can be found here https://github.com/emmadoraruth/Perceptron_Tagger.git

I think the book "Speech and Language Processing, 2nd Edition" from Prof. Daniel Jurafsky is very good http://www.amazon.com/Speech-Language-Processing-Daniel-Jura...

You can also check out the great online NLP course taught by the author and Prof. Chris Manning from Stanford: https://www.youtube.com/watch?v=nfoudtpBV68&list=PL6397E4B26...

Dependency Parsing by Nivre et al was a good source for catching up from an NLP course to state-of-the-art http://www.amazon.com/Dependency-Synthesis-Lectures-Language...

It's still tough to recommend that, imo. If you could choose to beam it straight into your head? Yeah, go ahead. But working through a book takes a lot of time...It only gives you dependency parsing, and then you have to catch up on the last five years of dependency parsing.

There's no alternative NLP book I can really recommend, no --- sorry. It's moving too fast, the books are going out of date too quickly.

You could try Hal Daume's blog, and Bob Carpenter's blog.

Here is a review [0] by Bob Carpenter of Foundations of Statistical Natural Language Processing [1].

[0] http://www.amazon.com/review/R2FUAZHGUOERHV

[1] http://www.amazon.com/Foundations-Statistical-Natural-Langua...

I know that book well. It's too dated.

Are there any feasible production-ready NLP solutions? My default approach has been to browse the codebases of different computational linguistics labs, but something more central would be very handy. As it stands I just say something to the tune, "please don't use nltk, use opennlp, cleannlp, redshift, etc".

If you prefer C, you can try SENNA: http://ml.nec-labs.com/senna/. It includes POS tagging, chunking, NER, constituency parsing and semantic role labeling. But dependency parsing is not there yet.

It's super fast (thanks to C) and very accurate (thanks to Deep Learning approach). The license is not for commercial usage though.

SENNA can be used with NLTK: http://pydoc.net/Python/nltk/2.0.2/nltk.tag.senna/

ClearNLP looks the best imo, especially you're already using Java. If you're using Python...well, Redshift isn't production-ready, but if you needed to, it'd be the best thing to base your work off.

ClearNLP has a lot of nifty bells and whistles that would make a big difference. In particular, it selects the model for you, based on your text, and its similarity to various subsets of the training data. So, you get a model more matched to your data, which will improve real-world accuracies a lot.

If that overhead comes from Python then have you tried running either of the implementations on PyPy?

Thanks that was enlightening.

NLTK is a toolkit which includes a bunch of different tokenizers, parsers, taggers, stemmers, lemmatizers, and corpora. The Stanford parser which he references in this post is a parser which can be integrated to work with NLTK, but it seems the methods which he discusses are more accurate and much faster than anything included in NLTK.

Also, the work he describes requires that the text already be tokenize and POS-tagged which will add some time, but not too much (and are also functions which can be performed by NLTK).

It's also worth noting that the Stanford parser is a Java software, so using it from within NLTK involves calling Java from Python.

As a Python newbie, just curious:

Python3 came out in 2008. Right now the year is 2014. Assuming this is pretty new code, what reason could there possibly be for not using Python3 for this?

You're free to try and see if the code also works in Python3.

Python3 used to come without a lot of the "batteries" that make Python a useful language for science-y stuff (numpy, matplotlib, Cython),and the 'improvements' that Python3 brings are not big enough that people would switch over.

Contrast this to C++11, which brings real improvements to pain points that existed before (i.e., areas of the STL that ought to have been standardized but were not).

Contrast this to Java 6 (Generics) and Java 8 (Lambdas) which solve actual perceived pain points that many people who program in Java are feeling.

The biggest pain point in Python2-the-language isn't any missing language feature -- most people are happy with those since at least 2.6. Instead, it's speed, and people are indeed transitioning part of their programs from Python2 to Cython. Python3 doesn't do anything for speed.

No need to try, even a beginner can see that it won't.

I should really have written the code to be Python 2/3 compatible.

I've not really worked with Python 3 yet, and I've regretted not basing a recent project, http://cloze.it , on it. I was getting unicode problems that Python 3 addresses pretty well.

I'm not a python user myself but from what I understand some people are rather partial to particular versions of python. I think 2.7 is very popular. I think there also may have been some backward compatibility issues in python 3 that some people were unhappy with. Perhaps someone with more knowledge on the matter can chime in.

That question is best asked and answered in other forums. OT.

On a related note, parsing Chinese could be much harder since the first step, identifying word boundaries [1], is already hard enough...

1. https://en.wikipedia.org/wiki/Text_segmentation

State of the art research indicates that joint processing (segmentation, tagging and syntactic analysis) achieves better results than a pipeline model. For an example, see Hatori et al: http://aclweb.org/anthology//P/P12/P12-1110.pdf

Also, OPs model only runs unlabeled dependency parsing. Most applications require labeled dependency parsing, which is much harder. State of the art results for English are currently ~93% established by Joakim Nivre and Yue Zhang in http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.p... and based on the zpar parser framework (see http://www.cl.cam.ac.uk/~sc609/pubs/cl11_early.pdf ).

zpar ( http://sourceforge.net/projects/zpar/ ) is the fastest dependency parser I am aware of, and it achieves lower parsing rates.

In all papers, note how many more feature templates are specified. More recent work contains yet another order of magnitude more feature templates. I'm betting python (w/ or w/o Cython) won't last very long as competition.

All that being said, the most significant problem in this part of NLP is that the best corpus files required for training modern accurate models are very expensive to license for both research and commercial purposes (tens if not hundreds of thousands of $s).

The Cython parser is basically an implementation of zpar, and achieves slightly faster run-times, and slightly better accuracy (I've added some extra features, and made some nips and tucks).

Note that the Stanford label set has 40 labels, so there are about 80 classes to evaluate. The Penn2Malt scheme has 20 labels, so you need to be careful which dependency scheme is being referenced when run-time figures are reported.

The way the run-time cost works is, if you extract f features per transition, for c classes, with a beam of size k and n words, you make O(cfkn) feature-lookups, which is the main cost.

For the parser.py implementation, most of the speed is coming from greedy search (k=1), and low number of classes (c=3, instead of c=80). Number of feature templates, f, is similar between this parser and zpar. We could add some more templates for label features, and do labelled parsing, and gain about 1% accuracy here, at the cost of being about 40x slower. The only reason I didn't was that it complicates the implementation and presentation slightly. The implementation was all about the blog post.

The Cython parser does everything with C data structures, which are manually memory managed. I don't think I'm paying any language overhead compared to a C++ implementation. So you're absolutely right that as more feature templates stack on, and you use more dependency labels, speed goes down. But, the Cython parser has no problem relative to zpar in this respect.

You can do better than O(cfkn). You can precompute the results of some features seen together (i.e. S0w/S0t/S0wt), and store a ref to them on the words. This saves feature lookups per feature. If you store them as vectors for the classes, you can get a nice perf. boost from SIMD.. you can knock off at least an order of magnitude if not two. something like O(c'f'kn), where c' is the # of SIMD ops to aggregate pre computed weight vectors and f' is the number of feature template groups. Also Yoav Goldberg mentioned feature signatures in one of his papers which can do a bit better.

A significant problem with unlabeled dep. parsing is that you can't differentiate important things like subject vs object dependents. In the sentence "They ate the pizza with anchovies.", how would a program distinguish between 'they' as the subject and 'pizza' as the object? In other words, who ate what?

Yes, you're probably aware of this paper, right?

Goldberg et al, "Efficient Implementation of Beam-Search Incremental Parsers". ACL 2013. http://www.aclweb.org/anthology/P13-2111

I haven't been able to work out how to do the feature caching in a way that won't ruin my implementation when I need to add more features.

I also get substantial benefit at high k from hashing the "kernel tokens" and memoising the score for the state.

I did try the tree-structured stack that they recommend, but I didn't find any run-time benefits from it, and the implementation kept confusing me. I might have made a mistake, but I suspect it's because my state arrays are copied with low-level malloc/free/memcpy, where they pay Python overhead on their copies.


I didn't see noticeable improvements from TSS either. I did some performance tuning - much more time goes to feature extraction and scoring. Can you elaborate on what you mean by 'hashing the "kernel tokens" and memoising the score for the state'? Are the kernel tokens something like the head of stack/queue?

For feature caching, I went with a generic model for a feature template as a combination of feature elements (for features like S0t+Q0t+Q1t) that have a closed set, so the feature template is limited to a set that is a cartesian product of the elements' sets. When you initialise parsing for a new sentence, you can select a subset of the possibilities to generate a "submodel" for only that sentence. That way you need much less memory. If you can pack it properly you can get a lot of it into the lower level caches which should allow for significant speed up.

Thanks for the explanation of how your cache works. Will you be at ACL this year?

The memoisation I refer to is called here:


What happens is, I extract the set of token indices for S0, N0, S0h, S0h2, etc, into a struct SlotTokens. SlotTokens is sufficient to extract the features, so I can use its hash to memoise an array of class scores. Cache utilisation is about 30-40% even at k=8.

While I'm here...


The big enum names all of the atomic feature values that I extract, and places their values into an array, context. So context[S0w] contains the word of the token on top of the stack.

I then list the actual features as tuples, referring to those values. So I can write a group of features with something like new_features = ((S0w, S0p), (S0w,), (S0p,)). That would add three feature templates: one with the word plus the POS tag, one with just the word, one with just the POS tag.

A bit of machinery in features.pyx then takes those Python feature definitions, and compiles them into a form that can be used more efficiently.

I won't be at ACL this year, maybe next. My advisor will be there though (Reut Tsarfaty).

First things first: hats off on a concise, accurate and efficient implementation in python.

But the Cython parser is not doing labeled parsing.. a significant number of feature templates are relevant to labeled parsing.

To reach Nivre and Zhang's 2011 results I think you'll find labeled parsing will degrade performance significantly and you'll need more complex code

1. you have more than an order of magnitude more transitions (80 vs 4)

2. you need to add beam search, with early update for training. beam size = 64 to be consistent with zpar

3. an order of magnitude more feature templates, with more complex features

So now you'll be evaluating 20x more transitions, on ~64x more candidates, with ~10x more feature templates. You can't multithread properly in python, you can maybe get some more juice by using some form of SIMD (w/ or w/o GPU). Your model is much larger in memory, so any luck you had in L2/L3 caching is probably gone. Python has limits and this is the point you start hitting them.

Edit: I see you've updated your comment, I guess we're on the same page. It's nice to see a concise implementation. I wrote one in golang for my masters thesis. I achieved perfect parity with zpar, but it took two order of magnitude more LOC.

Sorry for the lack of clarity. The Cython parser results that I quoted uses labelled parsing with beam search. There are various versions in the repo, but there's definitely a configuration that runs the Zhang and Nivre (2011) experiment verbatim, with exactly the same feature templates, and the same results that they get. Replicating those results was essential to implementing the parser correctly.

Apologies, could you add yet more clarity? According to the blog post your results are:

Parser Accuracy Speed (w/s) Language LOC

Stanford 89.6% 19 Java > 50,000[1]

parser.py 89.8% 2,020 Python ~500

Redshift 93.6% 2,580 Cython ~4,000

Are these the labeled parsing results you are referring to? How many sents/sec? Using same PTB data sets as Zhang and Nivre '11?

Those are Unlabelled Accuracy score results, but Redshift is quietly computing the dependency labels, while parser.py does not. Running Redshift in unlabelled mode gives very fast parse times, but about 1% less accuracy. The labels are really good, both as features and as a way to divide the problem space.

The data sets are the _Stanford_ labels, where the main results in Zhang and Nivre refer to MALT labels. Z&N do provide a single Stanford accuracy in their results, of 93.5% UAS.

Sentences per second should be just over 100. I use k=8 with some extra features referring to words further down the stack, where Z&N use k=64. Right at the bottom of the post, you can find the commit SHA and the commands for the experiment.

Hi, how does this parser compare to clearnlp? it's supposed to be also super fast and very accurate to?

Segmentation is a big issue for Chinese, but actually incremental parsers have been doing promising things here, because you can do the segmentation jointly with the parsing.

I'd start with Yue Zhang's papers for work on this: http://www.sutd.edu.sg/yuezhang.aspx

We've been working on this problem for a while - and have a pretty solid solution.

Check us out https://www.repustate.com

We also do this for Arabic, which is equally challenging.

>Repustate will tell extract the opinions contained within your text...

What does that mean, "tell extract"? Not familiar with this terminology.

I couldn't find any publication on the RedShift system that is part of the author's current work. Any pointers?

If you're looking for the academic papers, my Google Scholar page is here: http://scholar.google.com.au/citations?user=FXwlnmAAAAAJ&hl=... . The 2013 paper with Yoav Goldberg and Mark Johnson, and the 2014 paper with Mark Johnson are the two main things I've published with the system. I have another paper almost ready for publication, too.

Yes, that was what i was looking for, thanks. I was interested in details on the evaluation and why the Stanford parser is still widely accepted as state-of-art, especially by non-NLP researchers that want to try some NLP features to work with. By the way, what is a typical use-case for unlabled parses?

Well, the subtle point is that the Stanford parser really _is_ a fine choice for a lot of experiments...Even while it's far from state-of-the-art!

For researchers outside of NLP, it's often actually worse to have your parser be 2% better than the previous work, for reasons your readers don't care about and you can't easily explain. If your readers have heard of the Stanford parser, and previous work has used it, it's likely a good choice for your experiment.

Basically, if people are always using the new hotness outside of NLP, then those non-NLP researchers have to keep learning the new hotness! Ain't nobody got time for that.

I do think we're at a good "save point", though, where we should get people updated to the new technologies. Hence the blog post :)

As for use-cases, mostly people will use labelled dependency parses, because why not? And they're mostly used inside other NLP research, for instance I've been working on detecting disfluencies in conversational speech, there's increasing work on using this stuff in translation, information extraction, etc.

Really interesting.

Does it work on Windows too / does it rely on Unix-only constructs?

It should work fine.

...for some value of "parsing"

How many LOC in the libraries used to be able to write the implementation in 500 lines of Python?

Why not include some experiments with Lua and lpeg?

It would probably be faster than Java or Python.

And arguably Lua is easier to learn.

Maybe the work required (one-time cost) would be rewarded with significant gains.

As you can easily check yourself, it uses no external library whatsoever besides Python standard library. So the answer to the first question is zero.

The standard libraries count as an external library.

Why wouldn't they?

In most cases, the Python standard library is bundled with Python.

When I install Python I get about 14MB of stuff.

The interpreter is about 8K.

The libpython shared library is about 1.5MB.

If what you are referring to as the standard library is in that 1.5MB, then disregard my comment on LOC.

If it's in that remaining 12MB or so of stuff, then I'm wondering if LOC counts should include what is in there that is required for these programs to run.

Look at it this way. If I download 12MB of code and then I write 500 lines, does that mean I am a master of writing small, compact code?

Sure, if you ignore the 12MB I had to download first.

I'm not singling out Python. Perl, Ruby, etc. are equally large.

The point is you are downloading 1000's of LOC to enable you to write "short" programs.

Nothing wrong with that. But those 12MB that were needed beforehand... should we just ignore all that when we count LOC?

Maybe one has to do embedded work to have an appreciation for memory and storage limitations and thus the sheer size of these scripting libraries.

Yes, we should ignore library code LOC as there's no associated cognitive/maintenance overhead, which is what we are really trying to count. I have happily used (C)Python for a decade without peaking at the source. Same goes for, say, math.h

Interesting opinion.

The "overhead" I'm concerned with is based in hardware, not my own creativity.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact