Note that NLTK includes reference implementations for a range of NLP algorithms, supporting reproducibility and helping a diverse community to get into NLP. We provide interfaces for standard NLP tasks, and an easy way to switch from using pure Python implementations to using wrappers for external implementations such as the Stanford CoreNLP tools. We're adding "scaling up" sections to the NLTK book to show how this is done.
https://github.com/nltk/nltk | https://pypi.python.org/pypi/nltk |
Almost always the scope of the new project is much smaller, different or much less mature than the project being bashed. Open source projects are not required to make changes to please any arbitrary user that wants to make changes, even if it's to bring technical improevements.
In NLTK's case, they have a whole book written around their project. Presumably significant changes to project structure and function would mean heavy documentation/writing work, and might not fit the goals of their project. Bashing them as a result just shows a complete lack of understanding of how/why people write and maintain software.
The author points out that whilst the stated aim of NLTK is for education, it's used for far more than that in industry and academia. You'll see it used in papers, you'll see it as the basis of real world projects, etc. This presents a problem if the aims of the project are different to the how the project is used.
The biggest flag for me is, as pointed out in the blog post, when the project doesn't even know how the part of speech (POS) model was trained. That means a lack of reproducibility. Given POS tags are the first level of almost any NLP task, this is strongly troubling.
: "Where did the NLTK pos_tag model come from?" https://github.com/nltk/nltk/issues/1063
: The POS tags from NLTK are used for many papers and research - see https://scholar.google.com/scholar?as_ylo=2015&q=nltk&hl=en&...
That's why the maintainer said, basically, "nope, we only implement the standard algorithms". Most of the researchers want to get standard data, and compare their new algorithm to the standard algorithms that every other researcher uses.
There's now a ticket to implement the dynamic oracle, as I recommended: https://github.com/nltk/nltk/issues/905
For a long time we've been in a situation where everyone experienced in NLP knows, but nobody says, that you should not use NLTK. That's not a healthy situation.
Having a book as baggage explains but does not excuse how out-of-date and low-quality NLTK's software is. The bottom-line is that in 2015 you can't go to NLTK and:
a) Learn how modern NLP is done;
b) Access a convenient toolkit of reliable, basic NLP components.
That's the mission statement, right? Well I think they don't achieve that, and that they do a disservice by pretending they do.
The NLTK project has grown a community that values choice and, well, history I guess? The author knows that and does not want to be part of it and that is totally fine. He seems to understand the project goals, he just doesn't agree with them, so started from scratch and wrote a post about it, presenting a different way to do things.
All of this is to say, I agree with some points you made but IMO this post is not part of the genre you mentioned.
I just checked their website and the claim of "NLTK is a leading platform for building Python programs to work with human language data... a suite of libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning" does sound a little odd. But I think everyone in the industry knows NLTK's place and purpose -- you practically cannot avoid finding out quickly. NLTK's scope is clearly too broad to be meaningfully cutting edge at any one thing.
New libraries and implementations will always have an advantage. It's easier to tout "simplicity and leanness" when you don't have to carry over all the baggage and backward compatibility accumulated over the years.
For that reason, an occasional "complexity reset" is expected, and if a library would not or can not do it, another library will. Will SpaCy's fate be different, 10 years down the road?
For me, NLTK also has issues in education. Teaching requires clarity - a complex codebase rarely allows for that. The author wrote an article "Parsing English in 500 lines of Python" which does a great job of explaining, by being simple and lean, how to parse. Additionally, it achieved the same level of accuracy as the Stanford NLP parser - a larger and more complex parser.
That to me is the pinnacle of an educational objective - clear, concise, and practical.
Yes --- because I consolidate my algorithms and delete dead code. I've probably written five or six times as much code as currently lives in spaCy.
I hope by then spaCy will be smaller, not bigger, as we reach a more concise understanding of how to actually solve the problem. For instance, it's reasonable to expect the boundary between the POS tagger, parser and entity recogniser to disappear, in the same way that spaCy doesn't feature a separate chunker or sentence boundary detector. I read these annotations off the parse tree.
At first the complaint about NLTK was that it was too academic and not appropriate for real-world code, but no real-world code is going to rely on an unreliable library that keeps changing how it works.
For instance, you get sentences as follows:
doc = nlp(u'Hello world. This is a document.')
for sent in doc.sents:
for word in sent:
Other libraries ask users to choose between a variety of different statistical models, e.g. they ask you to specify that you want the "neural network dependency parser", or the "probabilistic context-free grammar parser", or whatever. By doing this they tie the API to those models.
spaCy just picks the best one and gives it to you. The benefit is that you don't need to be informed when a new model is implemented, even if the change is quite drastic. The modelling is a transient implementation detail, not exposed in the API.
Well, why do we have to build large, clunky (NLP) libraries to start with? Build lean and mean components as UNIX programs or easily bindable libraries and use a reasonable input/output format. E.g. nearly every statistical dependency parser uses CoNLL-X for input/output. You'll have no trouble swapping out MaltParser, Turbo Parser, or my neural net dependency parser. They all use the same, boring, tabular format.
Sure, this could be more work for a beginner. So, a project could make a curated list / meta-package of components that are robust, state-of-the-art and work together.
It makes me sad to see neat, focused libraries have their mission blurred, API expanded, code base obfuscated... Until they satisfy everybody's use case, which is to say, they're useless. Some features are best left to user-land.
It's a non-trivial tradeoff, obviously:
"Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can." (Zawinski's law)
I actually wrote a rant on this topic a while ago:
1) If I ship you a statistical model, and it's late in a pipeline, like a parser, the earlier components in the pipeline are not swappable. If you change the tokenization, POS tagging, lemmatization etc, the parser model will give you worse output.
This isn't obvious to people, and the problem can be subtle. For instance, some NER models use POS tag features, others don't.
2) The output format isn't actually that convenient. It sucks that everyone has to write this tree processing code, and then aligning the tokenized output back to the original string is a pain, if you want to calculate mark-up.
Speaking personally, the overwhelming majority of projects I work on can't ever go near GPL, because the project itself does not want to catch the awful GPL virus. Even open-source projects, nearly all of the ones I touch are not GPL but are some other, much more sane, license like Apache, BSD, or MIT.
At work and private life I don't include other peoples copyrighted work if I don't have a license for it. It's one simple rule. If I don't want to agree to the license because its too expensive or it put requirements on me which I refuse (like NDA's), then I have a simple choose to make. I can chose not use it, I can implement it myself, or I can hire someone else to do it.
A rule that I like that is 100% enforceable: commented code is allowed, but there needs to be a comment above the code explaining why the code was commented. The effect in practice of this rule is that
- People delete code more and comment it less
- Readers know whether to need to pay attention to the commented code or not.
I rationally know that I can remove and re-add it later and that commit/revert is better, but the pain associated with solving conflicts is strong enough that I subconsciously want to avoid it.
It would be nice to have an emacs or vim plugin in which you select a block of code, and it slowly walks back the graph of commits, showing each commit for 5 seconds. That way you could nicely see how your code-block evolved over time... (of course in many cases the code-block itself is useless without context)
:Gblame to show a git blame sidebar, and P to open the file at that commit.
E.g. I might start a rewrite with:
#if BLEEDING_EDGE ...new code... #else ...old code... #endif
Much like evicting data from L1, L2, L3, and finally main memory.
Sometimes I'll even write explicit comments to delete code by a certain date.
Look for differences that change the number of occurrences of the specified string (i.e. addition/deletion) in a file. Intended for the scripter’s use.
It is useful when you’re looking for an exact block of code (like a struct), and want to know the history of that block since it first came into being: use the feature iteratively to feed the interesting block in the preimage back into -S, and keep going until you get the very first version of the block.
The second edition of the book will include a "scaling up" section in most chapters, which shows how to transition from NLTK's pure Python implementations to NLTK's wrappers for the Stanford tools.
Use version control, develop new features in branches, merge to master + tag. There.
Presumably, the compiler never sees deadcode.c. Or did I misunderstand the question?
The reason is that what we're really doing here is predicting a structure (a parse tree), but we've encoded the problem as a series of local steps. Think of this like, what we want to do is navigate to a goal, and we'll do this by predicting a series of local actions.
Try stepping through the decision process. This should give you a feel for the local decisions, and how they build the larger structure.
If we use an online learner, we can take advantage of an analytic method introduced in 2012 of calculating the global loss of a local action (the "dynamic oracle"), to do imitation learning.
Specifically, during training we generate examples with the parser, and label them with this "dynamic oracle". A large batch size means we're generating the examples with a model that's "out of date".
I suggest that the author, being so wise in the ways of NLP science, channel this outrage and write "NLTK: The Good Parts" to save the rest of the world from stumbling blindly in the dark wilderness of ignorance.
You can contribute by adding documentation where you see it lacking, especially if you have domain specific knowledge that would help others.
Or you can blast the entire project, not help, and go write your own. The thing that bothers me is that if you know enough, and it's mostly a teaching tool (my understanding from other comments), you could greatly improve the situation for the next guy by providing your enlightened input on the subject in the form of documentation. So the whole damn community loses out on your hard-earned understanding.
Meanwhile, 10 years from now, your project will be replaced, and if NLTK is really a teaching tool, you won't even be a footnote (because teaching tools don't die unless a whole field dies).
This smacks of the kind of "bubble" Silicon Valley entitlement that I can't quite wrap my head around (I know, author isn't in SV, I just see this kind of crap coming from there).
Whether or not you think that's actually true, if someone does believe that, that's a good reason not to contribute to a project.
So I take umbrage with his belief that the whole thing (which is apparently well and actively used, and could benefit from his input) should be thrown out, and think he's petulant and certainly not a good collaborator (or community actor).
OpenNLP = production
I thought that was a known fact