
Dead Code Should Be Buried – Why I Didn't Contribute to NLTK - Smerity
http://spacy.io/blog/dead-code-should-be-buried/
======
stevenbird
NLTK has an active and growing developer community. We're grateful to Matthew
Honnibal for permission to port his averaged perceptron tagger, and it's now
included in NLTK 3.1.

Note that NLTK includes reference implementations for a range of NLP
algorithms, supporting reproducibility and helping a diverse community to get
into NLP. We provide interfaces for standard NLP tasks, and an easy way to
switch from using pure Python implementations to using wrappers for external
implementations such as the Stanford CoreNLP tools. We're adding "scaling up"
sections to the NLTK book to show how this is done.

[https://github.com/nltk/nltk](https://github.com/nltk/nltk) |
[https://pypi.python.org/pypi/nltk](https://pypi.python.org/pypi/nltk) |
[http://www.nltk.org/book_2ed/ch05.html#scaling-
up](http://www.nltk.org/book_2ed/ch05.html#scaling-up)

------
ben336
I hate this genre of post that basically follows the line: "I went to
<established project> and attempted to educate them. When they didn't listen I
went and built something better. Now its clear they should have listened to
me, and you should all abandon their software"

Almost always the scope of the new project is much smaller, different or much
less mature than the project being bashed. Open source projects are not
required to make changes to please any arbitrary user that wants to make
changes, even if it's to bring technical improevements.

In NLTK's case, they have a whole book written around their project.
Presumably significant changes to project structure and function would mean
heavy documentation/writing work, and might not fit the goals of their
project. Bashing them as a result just shows a complete lack of understanding
of how/why people write and maintain software.

~~~
Smerity
I disagree strongly. This is the same difference as Linux vs Minix. Minix
didn't want anything added as it was said to be for educational purposes even
though that didn't reflect the use cases.

The author points out that whilst the stated aim of NLTK is for education,
it's used for far more than that in industry and academia. You'll see it used
in papers, you'll see it as the basis of real world projects, etc. This
presents a problem if the aims of the project are different to the how the
project is used.

The biggest flag for me is, as pointed out in the blog post, when the project
doesn't even know how the part of speech (POS) model was trained[1]. That
means a lack of reproducibility[2]. Given POS tags are the first level of
almost any NLP task, this is strongly troubling.

[1]: "Where did the NLTK pos_tag model come from?"
[https://github.com/nltk/nltk/issues/1063](https://github.com/nltk/nltk/issues/1063)

[2]: The POS tags from NLTK are used for many papers and research - see
[https://scholar.google.com/scholar?as_ylo=2015&q=nltk&hl=en&...](https://scholar.google.com/scholar?as_ylo=2015&q=nltk&hl=en&as_sdt=0,5)

~~~
wisty
NLTK in research is probably mostly used as glue, its corpus interface, and
its standard wrappers to common libraries. Everyone using it for research will
do something like "I used data from NLTK, pushed it through my custom parser,
and here's how it compares to the wrapped parsers that NLTK also interfaces
with".

That's why the maintainer said, basically, "nope, we only implement the
standard algorithms". Most of the researchers want to get standard data, and
compare their new algorithm to the standard algorithms that every other
researcher uses.

~~~
syllogism
My blog post does explain the standard algorithms! Just, the ones that are
standard _now_. That reply actually made no sense. I guess the maintainer
thought that my post described novel research. It didn't.

There's now a ticket to implement the dynamic oracle, as I recommended:
[https://github.com/nltk/nltk/issues/905](https://github.com/nltk/nltk/issues/905)

------
Radim
Very relatable post. Isn't NLTK primarily a teaching / demonstration tool
though?

I just checked their website and the claim of _" NLTK is a leading platform
for building Python programs to work with human language data... a suite of
libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning"_ does sound a little odd. But I think everyone in the
industry knows NLTK's place and purpose -- you practically cannot avoid
finding out quickly. NLTK's scope is clearly too broad to be meaningfully
cutting edge at any one thing.

New libraries and implementations will always have an advantage. It's easier
to tout "simplicity and leanness" when you don't have to carry over all the
baggage and backward compatibility accumulated over the years.

For that reason, an occasional "complexity reset" is expected, and if a
library would not or can not do it, another library will. Will SpaCy's fate be
different, 10 years down the road?

~~~
syllogism
> Will SpaCy's fate be different, 10 years down the road?

Yes --- because I consolidate my algorithms and delete dead code. I've
probably written five or six times as much code as currently lives in spaCy.

I hope by then spaCy will be smaller, not bigger, as we reach a more concise
understanding of how to actually solve the problem. For instance, it's
reasonable to expect the boundary between the POS tagger, parser and entity
recogniser to disappear, in the same way that spaCy doesn't feature a separate
chunker or sentence boundary detector. I read these annotations off the parse
tree.

~~~
bobbyi_settv
A library that is constantly changing and removing "dead" code is a library
nobody can rely on for production applications.

At first the complaint about NLTK was that it was too academic and not
appropriate for real-world code, but no real-world code is going to rely on an
unreliable library that keeps changing how it works.

~~~
syllogism
You can maintain the API while overhauling the models underneath. spaCy so far
has had almost no API breakages.

For instance, you get sentences as follows:

    
    
        doc = nlp(u'Hello world. This is a document.')
        for sent in doc.sents:
            for word in sent:
                ...
    

It doesn't matter to users whether behind the scenes, the sentence boundaries
are being calculated from character heuristics, or from the syntactic parse.
It was the former, now it's the latter. Similarly, part-of-speech tags are
currently predicted in their own processing step. In future they may be
predicted jointly with the parsing. The API won't change.

Other libraries ask users to choose between a variety of different statistical
models, e.g. they ask you to specify that you want the "neural network
dependency parser", or the "probabilistic context-free grammar parser", or
whatever. By doing this they tie the API to those models.

spaCy just picks the best one and gives it to you. The benefit is that you
don't need to be informed when a new model is implemented, even if the change
is quite drastic. The modelling is a transient implementation detail, not
exposed in the API.

------
Osiris
Regarding dead (as in unused) code, I keep noticing the guys on my UI
development team commenting out code and then committing it to Git. I remind
them periodically that they can just delete the code and if they ever need it,
they can use Git to pull up historical versions of the file for reference.

~~~
a_bonobo
I think some people prefer commenting over deleting on version-control - that
way, they can see how this code used to look like in the past.

It would be nice to have an emacs or vim plugin in which you select a block of
code, and it slowly walks back the graph of commits, showing each commit for 5
seconds. That way you could nicely see how your code-block evolved over
time... (of course in many cases the code-block itself is useless without
context)

~~~
kyllo
Time-traveling git blame visualization could be rad too.

------
desilinguist
As someone who did contribute to NLTK quite a bit, it was quite useful back in
the day especially when I had to teach NLP/CL to linguistics (non-CS) graduate
students. I agree with Radim that NLTK has a purpose - and it's not to
implement the latest and the greatest NLP algorithms. I'm glad NLTK exists and
although it is not what I use today, I'm pretty sure whatever I do use today
(CoreNLP, gensim, etc.) will all be superseded by the next best thing a decade
from now.

------
stevenbird
I've updated the NLTK issue tracker with information about how the model for
NLTK's built-in POS tagger was trained:
[https://github.com/nltk/nltk/issues/1063#issuecomment-138005...](https://github.com/nltk/nltk/issues/1063#issuecomment-138005116)

The second edition of the book will include a "scaling up" section in most
chapters, which shows how to transition from NLTK's pure Python
implementations to NLTK's wrappers for the Stanford tools.

------
z92
I put all dead code in a file called "deadcode.c" and get done with it. If I
need it again, I can always copy from there. Easier than searching through git
history.

~~~
mahmud
Ouch, this sounds terrible. What if two functions/classes/top-level-constructs
have the same name (and signature)? Now you have a conflict and your project
won't build anymore.

Use version control, develop new features in branches, merge to master + tag.
There.

~~~
marvy
> What if two functions [...] have the same name?

Presumably, the compiler never sees deadcode.c. Or did I misunderstand the
question?

~~~
z92
deadcode.c is excluded from make. It's deadcode anyway.

------
skrebbel
I like the gist of this post, but it feels somewhat incomplete: NTLK is Apache
licensed and spaCy is a dual-licensed (AGPL or money) commercial product. It's
a good idea and an honest business, and I hope he succeeds, but I think it
would've been more honest if the article had reflected that.

------
elliptic
Can someone explain the following comments, for someone with some knowledge of
ML but none of NLP? "First, it's really much better to use Averaged
Perceptron, or some other method which can be trained in an error-driven way.
You don't want to do batch learning. Batch learning makes it difficult to
train from negative examples effectively, and this makes a very big difference
to accuracy" I thought that it was typical for suitably regularized batch
methods to modestly outperform or at least match (in terms of accuracy) online
methods, whose main advantage is their speed.

~~~
syllogism
Reading it back my comment wasn't the best explanation of the issue.

The reason is that what we're really doing here is predicting a structure (a
parse tree), but we've encoded the problem as a series of local steps. Think
of this like, what we want to do is navigate to a goal, and we'll do this by
predicting a series of local actions.

Try stepping through the decision process.[1] This should give you a feel for
the local decisions, and how they build the larger structure.

If we use an online learner, we can take advantage of an analytic method
introduced in 2012 of calculating the global loss of a local action (the
"dynamic oracle"), to do imitation learning.

Specifically, during training we generate examples with the parser, and label
them with this "dynamic oracle". A large batch size means we're generating the
examples with a model that's "out of date".

[1]
[http://spacy.io/displacy/?manual=Shift%20words%20onto%20the%...](http://spacy.io/displacy/?manual=Shift%20words%20onto%20the%20stack.%20Create%20left%20and%20right%20arcs%20between%20the%20word%20on%20top%20of%20the%20stack%20and%20the%20word%20at%20the%20start%20of%20the%20buffer.%20Pop%20words%20from%20the%20stack).

------
firebones
The theoretically "best" algorithm may not necessarily be the one that fits a
particular task or set of constraints the best. It is presumptuous of the
author to know what's best for every user of the toolkit.

I suggest that the author, being so wise in the ways of NLP science, channel
this outrage and write "NLTK: The Good Parts" to save the rest of the world
from stumbling blindly in the dark wilderness of ignorance.

------
analognoise
So rather than jump in and start adding documentation you blast the
developers, who are offering this stuff free and without warranty or implied
fitness for any purpose?

You can contribute by adding documentation where you see it lacking,
especially if you have domain specific knowledge that would help others.

Or you can blast the entire project, not help, and go write your own. The
thing that bothers me is that if you know enough, and it's mostly a teaching
tool (my understanding from other comments), you could greatly improve the
situation for the next guy by providing your enlightened input on the subject
in the form of documentation. So the whole damn community loses out on your
hard-earned understanding.

Meanwhile, 10 years from now, your project will be replaced, and if NLTK is
really a teaching tool, you won't even be a footnote (because teaching tools
don't die unless a whole field dies).

This smacks of the kind of "bubble" Silicon Valley entitlement that I can't
quite wrap my head around (I know, author isn't in SV, I just see this kind of
crap coming from there).

~~~
argonaut
The author clearly states why they didn't choose to contribute to NLTK: "You
can't contribute to a project if you believe that the first thing that they
should do is throw almost all of it away."

Whether or not you think that's actually true, if someone does believe that,
that's a good reason not to contribute to a project.

~~~
analognoise
Good point.

So I take umbrage with his belief that the whole thing (which is apparently
well and actively used, and could benefit from his input) should be thrown
out, and think he's petulant and certainly not a good collaborator (or
community actor).

------
snoitavla
NLTK would include state-of-art openly and "nicely" license implementation
soon:
[https://github.com/nltk/nltk/issues/1110](https://github.com/nltk/nltk/issues/1110)

------
retreatguru
New to NLP we tried NLTK first for a toy project and it was very slow and
inaccurate. Luckily we found spaCy, switched to it and sped things up 10x with
better accuracy and it was easier to use. Based on this experience I tend to
agree with the author.

------
latenightcoding
NLTK = education

OpenNLP = production

I thought that was a known fact

~~~
nkozyra
OpenNLP has never struck me as anywhere near as robust as NLTK.

~~~
danieldk
We use OpenNLP in production and it is very stable/robust (though, not exactly
cutting-edge anymore). We regularly push large corpora (e.g. German Wikipedia
or 20 years of newspaper text) through some OpenNLP-based services, without
any problems. This in contrast to some other tools, which I won't name, that
have horrible concurrency issues, etc.

~~~
brendano
It would be helpful if you named the other ones -- always useful to hear
examples of what works and doesn't.

