
Parsing English with 500 lines of Python (2013) - adamnemecek
http://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
======
danso
It's worth checking out the OP's previous post on this: _A good POS tagger in
about 200 lines of Python_ : [http://honnibal.wordpress.com/2013/09/11/a-good-
part-of-spee...](http://honnibal.wordpress.com/2013/09/11/a-good-part-of-
speechpos-tagger-in-about-200-lines-of-python/)

The OP says his PyGreedyAP gets 96.8% accuracy in 12s (vs NLTK's 94% at 236s)

------
voltagex_
How does this relate/compare to NLTK?

[http://www.nltk.org/](http://www.nltk.org/)

~~~
syllogism
The truth is nltk is basically crap for real work, but there's so little NLP
software that's put proper effort into documentation that nltk still gets a
lot of use.

You can work your way down the vast number of nltk modules, and you'll find
almost none of them are useful for real work, and those that are, ship a host
of alternatives that are all much worse than the current state-of-the-art.

nltk makes most sense as a teaching tool, but even then it's mostly out of
date. The chapter on "Parsing" in the nltk book doesn't even really deal with
statistical parsing. The dependency parsing work referenced in this post is
almost all 1-3 years old, so obviously it isn't covered either.

As an integration layer, nltk is so much more trouble than it's worth. You can
use it to compute some of your scoring metrics, or read in a corpus, but...why
bother?

I'm slowly putting together an alternative, where you get exactly one
tokeniser, exactly one tagger, etc. All these algorithms have the same i/o, so
we shouldn't ask a user to choose one. We should just provide the best one.

My previous post, on POS tagging, shows that ntlk's POS tagger is incredibly
slow, and not very accurate:
[https://honnibal.wordpress.com/2013/09/11/a-good-part-of-
spe...](https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-
tagger-in-about-200-lines-of-python/) . nltk scores 94% in 3m56s, my 200 line
implementation scores 96.8% in 12s.

I used to use nltk for tokenisation and sentence-boundary detection, but this
library seems better for that:
[https://code.google.com/p/splitta/](https://code.google.com/p/splitta/)

~~~
shabadoop
I was thinking about working through the NLTK book once I'm finished with
Bishop's Pattern Recognition, would you be able to recommend an alternative?

~~~
habeanf
Dependency Parsing by Nivre et al was a good source for catching up from an
NLP course to state-of-the-art [http://www.amazon.com/Dependency-Synthesis-
Lectures-Language...](http://www.amazon.com/Dependency-Synthesis-Lectures-
Language-
Technologies/dp/1598295969/ref=sr_1_1?ie=UTF8&qid=1398699731&sr=8-1&keywords=dependency+parsing)

~~~
syllogism
It's still tough to recommend that, imo. If you could choose to beam it
straight into your head? Yeah, go ahead. But working through a book takes a
lot of time...It only gives you dependency parsing, and then you have to catch
up on the last five years of dependency parsing.

------
natch
As a Python newbie, just curious:

Python3 came out in 2008. Right now the year is 2014. Assuming this is pretty
new code, what reason could there possibly be for not using Python3 for this?

~~~
sqrt17
You're free to try and see if the code also works in Python3.

Python3 used to come without a lot of the "batteries" that make Python a
useful language for science-y stuff (numpy, matplotlib, Cython),and the
'improvements' that Python3 brings are not big enough that people would switch
over.

Contrast this to C++11, which brings real improvements to pain points that
existed before (i.e., areas of the STL that ought to have been standardized
but were not).

Contrast this to Java 6 (Generics) and Java 8 (Lambdas) which solve actual
perceived pain points that many people who program in Java are feeling.

The biggest pain point in Python2-the-language isn't any missing language
feature -- most people are happy with those since at least 2.6. Instead, it's
speed, and people are indeed transitioning part of their programs from Python2
to Cython. Python3 doesn't do anything for speed.

~~~
natch
No need to try, even a beginner can see that it won't.

------
xiaq
On a related note, parsing Chinese could be much harder since the first step,
identifying word boundaries [1], is already hard enough...

1\.
[https://en.wikipedia.org/wiki/Text_segmentation](https://en.wikipedia.org/wiki/Text_segmentation)

~~~
habeanf
State of the art research indicates that joint processing (segmentation,
tagging and syntactic analysis) achieves better results than a pipeline model.
For an example, see Hatori et al:
[http://aclweb.org/anthology//P/P12/P12-1110.pdf](http://aclweb.org/anthology//P/P12/P12-1110.pdf)

Also, OPs model only runs unlabeled dependency parsing. Most applications
require labeled dependency parsing, which is much harder. State of the art
results for English are currently ~93% established by Joakim Nivre and Yue
Zhang in
[http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.p...](http://www.sutd.edu.sg/cmsresource/faculty/yuezhang/acl11j.pdf)
and based on the zpar parser framework (see
[http://www.cl.cam.ac.uk/~sc609/pubs/cl11_early.pdf](http://www.cl.cam.ac.uk/~sc609/pubs/cl11_early.pdf)
).

zpar (
[http://sourceforge.net/projects/zpar/](http://sourceforge.net/projects/zpar/)
) is the fastest dependency parser I am aware of, and it achieves lower
parsing rates.

In all papers, note how many more feature templates are specified. More recent
work contains yet another order of magnitude more feature templates. I'm
betting python (w/ or w/o Cython) won't last very long as competition.

All that being said, the most significant problem in this part of NLP is that
the best corpus files required for training modern accurate models are very
expensive to license for both research and commercial purposes (tens if not
hundreds of thousands of $s).

~~~
syllogism
The Cython parser is basically an implementation of zpar, and achieves
slightly faster run-times, and slightly better accuracy (I've added some extra
features, and made some nips and tucks).

Note that the Stanford label set has 40 labels, so there are about 80 classes
to evaluate. The Penn2Malt scheme has 20 labels, so you need to be careful
which dependency scheme is being referenced when run-time figures are
reported.

The way the run-time cost works is, if you extract f features per transition,
for c classes, with a beam of size k and n words, you make O(cfkn) feature-
lookups, which is the main cost.

For the parser.py implementation, most of the speed is coming from greedy
search (k=1), and low number of classes (c=3, instead of c=80). Number of
feature templates, f, is similar between this parser and zpar. We could add
some more templates for label features, and do labelled parsing, and gain
about 1% accuracy here, at the cost of being about 40x slower. The only reason
I didn't was that it complicates the implementation and presentation slightly.
The implementation was all about the blog post.

The Cython parser does everything with C data structures, which are manually
memory managed. I don't think I'm paying any language overhead compared to a
C++ implementation. So you're absolutely right that as more feature templates
stack on, and you use more dependency labels, speed goes down. But, the Cython
parser has no problem relative to zpar in this respect.

~~~
habeanf
You can do better than O(cfkn). You can precompute the results of some
features seen together (i.e. S0w/S0t/S0wt), and store a ref to them on the
words. This saves feature lookups per feature. If you store them as vectors
for the classes, you can get a nice perf. boost from SIMD.. you can knock off
at least an order of magnitude if not two. something like O(c'f'kn), where c'
is the # of SIMD ops to aggregate pre computed weight vectors and f' is the
number of feature template groups. Also Yoav Goldberg mentioned feature
signatures in one of his papers which can do a bit better.

A significant problem with unlabeled dep. parsing is that you can't
differentiate important things like subject vs object dependents. In the
sentence "They ate the pizza with anchovies.", how would a program distinguish
between 'they' as the subject and 'pizza' as the object? In other words, who
ate what?

~~~
syllogism
Yes, you're probably aware of this paper, right?

Goldberg et al, "Efficient Implementation of Beam-Search Incremental Parsers".
ACL 2013.
[http://www.aclweb.org/anthology/P13-2111](http://www.aclweb.org/anthology/P13-2111)

I haven't been able to work out how to do the feature caching in a way that
won't ruin my implementation when I need to add more features.

I also get substantial benefit at high k from hashing the "kernel tokens" and
memoising the score for the state.

I did try the tree-structured stack that they recommend, but I didn't find any
run-time benefits from it, and the implementation kept confusing me. I might
have made a mistake, but I suspect it's because my state arrays are copied
with low-level malloc/free/memcpy, where they pay Python overhead on their
copies.

~~~
habeanf
Yes.

I didn't see noticeable improvements from TSS either. I did some performance
tuning - much more time goes to feature extraction and scoring. Can you
elaborate on what you mean by 'hashing the "kernel tokens" and memoising the
score for the state'? Are the kernel tokens something like the head of
stack/queue?

For feature caching, I went with a generic model for a feature template as a
combination of feature elements (for features like S0t+Q0t+Q1t) that have a
closed set, so the feature template is limited to a set that is a cartesian
product of the elements' sets. When you initialise parsing for a new sentence,
you can select a subset of the possibilities to generate a "submodel" for only
that sentence. That way you need much less memory. If you can pack it properly
you can get a lot of it into the lower level caches which should allow for
significant speed up.

~~~
syllogism
Thanks for the explanation of how your cache works. Will you be at ACL this
year?

The memoisation I refer to is called here:

[https://github.com/syllog1sm/redshift/blob/segmentation/reds...](https://github.com/syllog1sm/redshift/blob/segmentation/redshift/parser.pyx#L183)

What happens is, I extract the set of token indices for S0, N0, S0h, S0h2,
etc, into a struct SlotTokens. SlotTokens is sufficient to extract the
features, so I can use its hash to memoise an array of class scores. Cache
utilisation is about 30-40% even at k=8.

While I'm here...

[https://github.com/syllog1sm/redshift/blob/segmentation/reds...](https://github.com/syllog1sm/redshift/blob/segmentation/redshift/_parse_features.pyx)

The big enum names all of the atomic feature values that I extract, and places
their values into an array, context. So context[S0w] contains the word of the
token on top of the stack.

I then list the actual features as tuples, referring to those values. So I can
write a group of features with something like new_features = ((S0w, S0p),
(S0w,), (S0p,)). That would add three feature templates: one with the word
plus the POS tag, one with just the word, one with just the POS tag.

A bit of machinery in features.pyx then takes those Python feature
definitions, and compiles them into a form that can be used more efficiently.

~~~
habeanf
I won't be at ACL this year, maybe next. My advisor will be there though (Reut
Tsarfaty).

------
bjoernbu
I couldn't find any publication on the RedShift system that is part of the
author's current work. Any pointers?

~~~
syllogism
If you're looking for the academic papers, my Google Scholar page is here:
[http://scholar.google.com.au/citations?user=FXwlnmAAAAAJ&hl=...](http://scholar.google.com.au/citations?user=FXwlnmAAAAAJ&hl=en)
. The 2013 paper with Yoav Goldberg and Mark Johnson, and the 2014 paper with
Mark Johnson are the two main things I've published with the system. I have
another paper almost ready for publication, too.

~~~
bjoernbu
Yes, that was what i was looking for, thanks. I was interested in details on
the evaluation and why the Stanford parser is still widely accepted as state-
of-art, especially by non-NLP researchers that want to try some NLP features
to work with. By the way, what is a typical use-case for unlabled parses?

~~~
syllogism
Well, the subtle point is that the Stanford parser really _is_ a fine choice
for a lot of experiments...Even while it's far from state-of-the-art!

For researchers outside of NLP, it's often actually worse to have your parser
be 2% better than the previous work, for reasons your readers don't care about
and you can't easily explain. If your readers have heard of the Stanford
parser, and previous work has used it, it's likely a good choice for your
experiment.

Basically, if people are always using the new hotness outside of NLP, then
those non-NLP researchers have to keep learning the new hotness! Ain't nobody
got time for that.

I do think we're at a good "save point", though, where we should get people
updated to the new technologies. Hence the blog post :)

As for use-cases, mostly people will use labelled dependency parses, because
why not? And they're mostly used inside other NLP research, for instance I've
been working on detecting disfluencies in conversational speech, there's
increasing work on using this stuff in translation, information extraction,
etc.

------
nymph
...for some value of "parsing"

------
frik
Really interesting.

Does it work on Windows too / does it rely on Unix-only constructs?

~~~
adamnemecek
It should work fine.

------
uuid_to_string
How many LOC in the libraries used to be able to write the implementation in
500 lines of Python?

Why not include some experiments with Lua and lpeg?

It would probably be faster than Java or Python.

And arguably Lua is easier to learn.

Maybe the work required (one-time cost) would be rewarded with significant
gains.

~~~
sanxiyn
As you can easily check yourself, it uses no external library whatsoever
besides Python standard library. So the answer to the first question is zero.

~~~
uuid_to_string
The standard libraries count as an external library.

Why wouldn't they?

~~~
inportb
In most cases, the Python standard library is bundled with Python.

~~~
uuid_to_string
When I install Python I get about 14MB of stuff.

The interpreter is about 8K.

The libpython shared library is about 1.5MB.

If what you are referring to as the standard library is in that 1.5MB, then
disregard my comment on LOC.

If it's in that remaining 12MB or so of stuff, then I'm wondering if LOC
counts should include what is in there that is required for these programs to
run.

Look at it this way. If I download 12MB of code and then I write 500 lines,
does that mean I am a master of writing small, compact code?

Sure, if you ignore the 12MB I had to download first.

I'm not singling out Python. Perl, Ruby, etc. are equally large.

The point is you are downloading 1000's of LOC to enable you to write "short"
programs.

Nothing wrong with that. But those 12MB that were needed beforehand... should
we just ignore all that when we count LOC?

Maybe one has to do embedded work to have an appreciation for memory and
storage limitations and thus the sheer size of these scripting libraries.

~~~
kylebgorman
Yes, we should ignore library code LOC as there's no associated
cognitive/maintenance overhead, which is what we are really trying to count. I
have happily used (C)Python for a decade without peaking at the source. Same
goes for, say, math.h

~~~
uuid_to_string
Interesting opinion.

The "overhead" I'm concerned with is based in hardware, not my own creativity.

