Hacker News new | comments | show | ask | jobs | submit login
Dead Code Should Be Buried – Why I Didn't Contribute to NLTK (spacy.io)
140 points by Smerity on Sept 5, 2015 | hide | past | web | favorite | 62 comments

NLTK has an active and growing developer community. We're grateful to Matthew Honnibal for permission to port his averaged perceptron tagger, and it's now included in NLTK 3.1.

Note that NLTK includes reference implementations for a range of NLP algorithms, supporting reproducibility and helping a diverse community to get into NLP. We provide interfaces for standard NLP tasks, and an easy way to switch from using pure Python implementations to using wrappers for external implementations such as the Stanford CoreNLP tools. We're adding "scaling up" sections to the NLTK book to show how this is done.

https://github.com/nltk/nltk | https://pypi.python.org/pypi/nltk | http://www.nltk.org/book_2ed/ch05.html#scaling-up

I hate this genre of post that basically follows the line: "I went to <established project> and attempted to educate them. When they didn't listen I went and built something better. Now its clear they should have listened to me, and you should all abandon their software"

Almost always the scope of the new project is much smaller, different or much less mature than the project being bashed. Open source projects are not required to make changes to please any arbitrary user that wants to make changes, even if it's to bring technical improevements.

In NLTK's case, they have a whole book written around their project. Presumably significant changes to project structure and function would mean heavy documentation/writing work, and might not fit the goals of their project. Bashing them as a result just shows a complete lack of understanding of how/why people write and maintain software.

I disagree strongly. This is the same difference as Linux vs Minix. Minix didn't want anything added as it was said to be for educational purposes even though that didn't reflect the use cases.

The author points out that whilst the stated aim of NLTK is for education, it's used for far more than that in industry and academia. You'll see it used in papers, you'll see it as the basis of real world projects, etc. This presents a problem if the aims of the project are different to the how the project is used.

The biggest flag for me is, as pointed out in the blog post, when the project doesn't even know how the part of speech (POS) model was trained[1]. That means a lack of reproducibility[2]. Given POS tags are the first level of almost any NLP task, this is strongly troubling.

[1]: "Where did the NLTK pos_tag model come from?" https://github.com/nltk/nltk/issues/1063

[2]: The POS tags from NLTK are used for many papers and research - see https://scholar.google.com/scholar?as_ylo=2015&q=nltk&hl=en&...

NLTK in research is probably mostly used as glue, its corpus interface, and its standard wrappers to common libraries. Everyone using it for research will do something like "I used data from NLTK, pushed it through my custom parser, and here's how it compares to the wrapped parsers that NLTK also interfaces with".

That's why the maintainer said, basically, "nope, we only implement the standard algorithms". Most of the researchers want to get standard data, and compare their new algorithm to the standard algorithms that every other researcher uses.

My blog post does explain the standard algorithms! Just, the ones that are standard now. That reply actually made no sense. I guess the maintainer thought that my post described novel research. It didn't.

There's now a ticket to implement the dynamic oracle, as I recommended: https://github.com/nltk/nltk/issues/905

Surely there's a place for technical criticism? If you think I've over-stepped that and made comments which reflect personally on the NLTK maintainers I'll apologise and revise. That wasn't my intention.

For a long time we've been in a situation where everyone experienced in NLP knows, but nobody says, that you should not use NLTK. That's not a healthy situation.

Having a book as baggage explains but does not excuse how out-of-date and low-quality NLTK's software is. The bottom-line is that in 2015 you can't go to NLTK and:

a) Learn how modern NLP is done;

b) Access a convenient toolkit of reliable, basic NLP components.

That's the mission statement, right? Well I think they don't achieve that, and that they do a disservice by pretending they do.

I have to disagree with the GP and agree with you...NLTK is a deservedly popular library and an attack on it is going to seem like an attack on its many hard-working volunteers...but good on you for putting the time into building an open source project that follows your vision of an alternative...that's far from being just a griper, and it's often the way that software in a field improves. I'm definitely going to try out spacy.

OP did not "go to <established project> and attemt to educate them". He wrote spaCy after looking at NLTK and noticed a difference in philosophy that could not be overcome by talking it over.

The NLTK project has grown a community that values choice and, well, history I guess? The author knows that and does not want to be part of it and that is totally fine. He seems to understand the project goals, he just doesn't agree with them, so started from scratch and wrote a post about it, presenting a different way to do things.

All of this is to say, I agree with some points you made but IMO this post is not part of the genre you mentioned.

Very relatable post. Isn't NLTK primarily a teaching / demonstration tool though?

I just checked their website and the claim of "NLTK is a leading platform for building Python programs to work with human language data... a suite of libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning" does sound a little odd. But I think everyone in the industry knows NLTK's place and purpose -- you practically cannot avoid finding out quickly. NLTK's scope is clearly too broad to be meaningfully cutting edge at any one thing.

New libraries and implementations will always have an advantage. It's easier to tout "simplicity and leanness" when you don't have to carry over all the baggage and backward compatibility accumulated over the years.

For that reason, an occasional "complexity reset" is expected, and if a library would not or can not do it, another library will. Will SpaCy's fate be different, 10 years down the road?

As you state from looking at the website, the stated purpose is for education, but NLTK is used (and advertised) as far more than that. Due to this mixed message, NLTK is used in places it shouldn't be - the author's main concern.

For me, NLTK also has issues in education. Teaching requires clarity - a complex codebase rarely allows for that. The author wrote an article "Parsing English in 500 lines of Python"[1] which does a great job of explaining, by being simple and lean, how to parse. Additionally, it achieved the same level of accuracy as the Stanford NLP parser - a larger and more complex parser.

That to me is the pinnacle of an educational objective - clear, concise, and practical.

[1]: http://spacy.io/blog/parsing-english-in-python/

> Will SpaCy's fate be different, 10 years down the road?

Yes --- because I consolidate my algorithms and delete dead code. I've probably written five or six times as much code as currently lives in spaCy.

I hope by then spaCy will be smaller, not bigger, as we reach a more concise understanding of how to actually solve the problem. For instance, it's reasonable to expect the boundary between the POS tagger, parser and entity recogniser to disappear, in the same way that spaCy doesn't feature a separate chunker or sentence boundary detector. I read these annotations off the parse tree.

A library that is constantly changing and removing "dead" code is a library nobody can rely on for production applications.

At first the complaint about NLTK was that it was too academic and not appropriate for real-world code, but no real-world code is going to rely on an unreliable library that keeps changing how it works.

You can maintain the API while overhauling the models underneath. spaCy so far has had almost no API breakages.

For instance, you get sentences as follows:

    doc = nlp(u'Hello world. This is a document.')
    for sent in doc.sents:
        for word in sent:
It doesn't matter to users whether behind the scenes, the sentence boundaries are being calculated from character heuristics, or from the syntactic parse. It was the former, now it's the latter. Similarly, part-of-speech tags are currently predicted in their own processing step. In future they may be predicted jointly with the parsing. The API won't change.

Other libraries ask users to choose between a variety of different statistical models, e.g. they ask you to specify that you want the "neural network dependency parser", or the "probabilistic context-free grammar parser", or whatever. By doing this they tie the API to those models.

spaCy just picks the best one and gives it to you. The benefit is that you don't need to be informed when a new model is implemented, even if the change is quite drastic. The modelling is a transient implementation detail, not exposed in the API.

It's easier to tout "simplicity and leanness" when you don't have to carry over all the baggage and backward compatibility accumulated over the years.

Well, why do we have to build large, clunky (NLP) libraries to start with? Build lean and mean components as UNIX programs or easily bindable libraries and use a reasonable input/output format. E.g. nearly every statistical dependency parser uses CoNLL-X for input/output. You'll have no trouble swapping out MaltParser, Turbo Parser, or my neural net dependency parser. They all use the same, boring, tabular format.

Sure, this could be more work for a beginner. So, a project could make a curated list / meta-package of components that are robust, state-of-the-art and work together.

We're in violent agreement.

It makes me sad to see neat, focused libraries have their mission blurred, API expanded, code base obfuscated... Until they satisfy everybody's use case, which is to say, they're useless. Some features are best left to user-land.

It's a non-trivial tradeoff, obviously: "Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can." (Zawinski's law)

I actually wrote a rant on this topic a while ago: http://rare-technologies.com/data-streaming-in-python-genera...

I definitely agree that the NLP library shouldn't have an opinion about your machine learning, shouldn't give you every accuracy measure under the sun, etc. My original vision for spaCy was smaller. But there are two problems.

1) If I ship you a statistical model, and it's late in a pipeline, like a parser, the earlier components in the pipeline are not swappable. If you change the tokenization, POS tagging, lemmatization etc, the parser model will give you worse output.

This isn't obvious to people, and the problem can be subtle. For instance, some NER models use POS tag features, others don't.

2) The output format isn't actually that convenient. It sucks that everyone has to write this tree processing code, and then aligning the tokenized output back to the original string is a pain, if you want to calculate mark-up.

Could not have said this better. New developers should not discount the benefits of hindsight, and humility goes a long way in the open source community.

NLTK is expansive, which seems to be the real, primary complaint. But it's also not $5,000 a year. I applaud the effort, and the product looks very good, but I don't think there was enough here to say, definitively, "don't use NLTK."

If you missed it, SpaCy is AGPLv3 as well. $5,000 a year doesn't seem so bad if you are doing something that can't work with the AGPLv3.

No, I saw that ... my point is that there's a profit motive that would make utilization of NLTK a non starter regardless of either platform's merits.

There's a pretty large gap between "$5,000 a year is reasonable" and "AGPLv3 is usable" (or any GPL variant).

Speaking personally, the overwhelming majority of projects I work on can't ever go near GPL, because the project itself does not want to catch the awful GPL virus. Even open-source projects, nearly all of the ones I touch are not GPL but are some other, much more sane, license like Apache, BSD, or MIT.

No one here is interested in reading inflammatory posts about tiresome old flame wars. You call it an awful virus, I call proprietary licenses a distasteful cancer. You call it insane, I call it entitled attitude to other peoples work.

At work and private life I don't include other peoples copyrighted work if I don't have a license for it. It's one simple rule. If I don't want to agree to the license because its too expensive or it put requirements on me which I refuse (like NDA's), then I have a simple choose to make. I can chose not use it, I can implement it myself, or I can hire someone else to do it.

You're arguing the wrong point. I'm not advocating for ignoring licenses, and I have no idea why you think I would. My point is entirely that the OP's claim that $5,000 a year if you can't use the AGPL is absolutely fucking stupid. There's tons of reasons why you can't use the AGPL and can't afford $5,000 a year. So go take your sanctimony elsewhere.

If you don't want to pay the price and refuse to accept the conditions in which it is given to you for free, then tough luck!. Don't come to me crying about how you want it for free while adding your own license on top because your work is special.

You really suck at reading comprehension. Please stop with this nonsense, it has no bearing whatsoever to any of my comments.

You seem very angry at people who offer you free code to use. Please don't post inflammatory comments in HN in that state.

Please stop posting complete drivel. You seem to be out to intentionally twist and discredit what I'm saying, and I can't figure out why.

You are angry and lashing which is clouding your ability to comprehend what people saying to you. This is a common trait with people who has strong religious attitude towards licenses, and demonstrate the reason why flame war topics never produce anything productive on HN.

Regarding dead (as in unused) code, I keep noticing the guys on my UI development team commenting out code and then committing it to Git. I remind them periodically that they can just delete the code and if they ever need it, they can use Git to pull up historical versions of the file for reference.

The thing is that "you can always find it back with Git" only holds when you're pretty proficient with Git (or know where to click in a UI like SourceTree, and even that takes a while to figure out). Many people basically don't get much further than commit, push, pull. For them, deleting means they need to go and ask someone like you for help to get things back. Commenting keeps them in control.

A rule that I like that is 100% enforceable: commented code is allowed, but there needs to be a comment above the code explaining why the code was commented. The effect in practice of this rule is that

    - People delete code more and comment it less
    - Readers know whether to need to pay attention to the commented code or not.
Point 1 happens the most. I begin to write down something like "keeping this code around because I'm not sure yet whether the new code is better" and then I backspace and delete the whole thing because I realise it clearly is.

to be fair, I think commenting code in a sense "works better" then "git revert" because you don't need to solve conflicts, you just keep writing code around the commented one.

I rationally know that I can remove and re-add it later and that commit/revert is better, but the pain associated with solving conflicts is strong enough that I subconsciously want to avoid it.

I think some people prefer commenting over deleting on version-control - that way, they can see how this code used to look like in the past.

It would be nice to have an emacs or vim plugin in which you select a block of code, and it slowly walks back the graph of commits, showing each commit for 5 seconds. That way you could nicely see how your code-block evolved over time... (of course in many cases the code-block itself is useless without context)

Time-traveling git blame visualization could be rad too.

You can basically do this in vim with https://github.com/tpope/vim-fugitive

:Gblame to show a git blame sidebar, and P to open the file at that commit.

I achieve the same effect regularly with hgview /path/to/file you can watch how the file changes commit by commit. Surely there's something similar for git? http://www.hgview.org/

Well I think its an out of sight, out of mind kind of thing. Something commented out is generally something you know you'll want to be referencing again. Coming back to something in the future you/your teammates might not even remember/realize that the deleted code was ever there.

Eh, think of it like a cache. "You shouldn't keep that in memory, you can get it from disk." You can lean in that direction, but it doesn't make sense as a hard rule.

I use cache style metrics for killing code too.

E.g. I might start a rewrite with:

  #if BLEEDING_EDGE ...new code... #else ...old code... #endif
Eventually moving the old code into a further removed #if 0 as the remaining bugs taper off (i.e. I'll just fix them instead of reverting back to the old code for a milestone) before deleting the old code outright.

Much like evicting data from L1, L2, L3, and finally main memory.

Sometimes I'll even write explicit comments to delete code by a certain date.

I can understand where they're coming from but, for the life of me, I can't recall the last time I needed to restore such code after a long gap. If the code's that interesting, they can either leave it on a feature branch or refactor it out into a reusable module or the like.

Second that, usually I see worthless snippets commented out. Those are good as reference for a couple of hours when you work on the code but when you are done the commented code is lie and new code is truth. Also I cannot imagine working on a piece of code without looking at git history to get context (commit messages linked issues in tracker) in this way commented code from past has no value for me at all.

The main problem I find is - when you decide you need to recall that deleted code, where will I find it?

With Git, I find the pickaxe tool ("git log -S") to be useful for this purpose. From the man page:


    Look for differences that change the number of occurrences of the specified string (i.e. addition/deletion) in a file. Intended for the scripter’s use.

    It is useful when you’re looking for an exact block of code (like a struct), and want to know the history of that block since it first came into being: use the feature iteratively to feed the interesting block in the preimage back into -S, and keep going until you get the very first version of the block.

In the repo.

As someone who did contribute to NLTK quite a bit, it was quite useful back in the day especially when I had to teach NLP/CL to linguistics (non-CS) graduate students. I agree with Radim that NLTK has a purpose - and it's not to implement the latest and the greatest NLP algorithms. I'm glad NLTK exists and although it is not what I use today, I'm pretty sure whatever I do use today (CoreNLP, gensim, etc.) will all be superseded by the next best thing a decade from now.

I've updated the NLTK issue tracker with information about how the model for NLTK's built-in POS tagger was trained: https://github.com/nltk/nltk/issues/1063#issuecomment-138005...

The second edition of the book will include a "scaling up" section in most chapters, which shows how to transition from NLTK's pure Python implementations to NLTK's wrappers for the Stanford tools.

I put all dead code in a file called "deadcode.c" and get done with it. If I need it again, I can always copy from there. Easier than searching through git history.

Ouch, this sounds terrible. What if two functions/classes/top-level-constructs have the same name (and signature)? Now you have a conflict and your project won't build anymore.

Use version control, develop new features in branches, merge to master + tag. There.

> What if two functions [...] have the same name?

Presumably, the compiler never sees deadcode.c. Or did I misunderstand the question?

deadcode.c is excluded from make. It's deadcode anyway.

I like the gist of this post, but it feels somewhat incomplete: NTLK is Apache licensed and spaCy is a dual-licensed (AGPL or money) commercial product. It's a good idea and an honest business, and I hope he succeeds, but I think it would've been more honest if the article had reflected that.

Can someone explain the following comments, for someone with some knowledge of ML but none of NLP? "First, it's really much better to use Averaged Perceptron, or some other method which can be trained in an error-driven way. You don't want to do batch learning. Batch learning makes it difficult to train from negative examples effectively, and this makes a very big difference to accuracy" I thought that it was typical for suitably regularized batch methods to modestly outperform or at least match (in terms of accuracy) online methods, whose main advantage is their speed.

Reading it back my comment wasn't the best explanation of the issue.

The reason is that what we're really doing here is predicting a structure (a parse tree), but we've encoded the problem as a series of local steps. Think of this like, what we want to do is navigate to a goal, and we'll do this by predicting a series of local actions.

Try stepping through the decision process.[1] This should give you a feel for the local decisions, and how they build the larger structure.

If we use an online learner, we can take advantage of an analytic method introduced in 2012 of calculating the global loss of a local action (the "dynamic oracle"), to do imitation learning.

Specifically, during training we generate examples with the parser, and label them with this "dynamic oracle". A large batch size means we're generating the examples with a model that's "out of date".

[1] http://spacy.io/displacy/?manual=Shift%20words%20onto%20the%....

The theoretically "best" algorithm may not necessarily be the one that fits a particular task or set of constraints the best. It is presumptuous of the author to know what's best for every user of the toolkit.

I suggest that the author, being so wise in the ways of NLP science, channel this outrage and write "NLTK: The Good Parts" to save the rest of the world from stumbling blindly in the dark wilderness of ignorance.

So rather than jump in and start adding documentation you blast the developers, who are offering this stuff free and without warranty or implied fitness for any purpose?

You can contribute by adding documentation where you see it lacking, especially if you have domain specific knowledge that would help others.

Or you can blast the entire project, not help, and go write your own. The thing that bothers me is that if you know enough, and it's mostly a teaching tool (my understanding from other comments), you could greatly improve the situation for the next guy by providing your enlightened input on the subject in the form of documentation. So the whole damn community loses out on your hard-earned understanding.

Meanwhile, 10 years from now, your project will be replaced, and if NLTK is really a teaching tool, you won't even be a footnote (because teaching tools don't die unless a whole field dies).

This smacks of the kind of "bubble" Silicon Valley entitlement that I can't quite wrap my head around (I know, author isn't in SV, I just see this kind of crap coming from there).

The author clearly states why they didn't choose to contribute to NLTK: "You can't contribute to a project if you believe that the first thing that they should do is throw almost all of it away."

Whether or not you think that's actually true, if someone does believe that, that's a good reason not to contribute to a project.

Good point.

So I take umbrage with his belief that the whole thing (which is apparently well and actively used, and could benefit from his input) should be thrown out, and think he's petulant and certainly not a good collaborator (or community actor).

NLTK would include state-of-art openly and "nicely" license implementation soon: https://github.com/nltk/nltk/issues/1110

New to NLP we tried NLTK first for a toy project and it was very slow and inaccurate. Luckily we found spaCy, switched to it and sped things up 10x with better accuracy and it was easier to use. Based on this experience I tend to agree with the author.

NLTK = education

OpenNLP = production

I thought that was a known fact

OpenNLP has never struck me as anywhere near as robust as NLTK.

We use OpenNLP in production and it is very stable/robust (though, not exactly cutting-edge anymore). We regularly push large corpora (e.g. German Wikipedia or 20 years of newspaper text) through some OpenNLP-based services, without any problems. This in contrast to some other tools, which I won't name, that have horrible concurrency issues, etc.

It would be helpful if you named the other ones -- always useful to hear examples of what works and doesn't.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact