
SpaCy: Industrial-strength NLP with Python and Cython - syllogism
http://honnibal.github.io/spaCy
======
jonstewart
NLTK has always seemed like a bit of a toy when compared to Stanford CoreNLP.
I'd be very curious to see performance/accuracy charts on a number of corpora
in comparison to CoreNLP.

The Cython implementation makes it somewhat believable that it's faster than
CoreNLP, but I'd also like to hear a deep-dive on why it's several times
faster, beyond that control over memory layout is the best way to win
performance (stipulated). In particular, it would be good to know whether
CoreNLP is doing more processing than spaCy or otherwise handling more
concerns.

Finally, I'd really love to see a feature table comparing spaCy with CoreNLP.

Compelling work!

~~~
danieldk
_The Cython implementation makes it somewhat believable that it 's faster than
CoreNLP, but I'd also like to hear a deep-dive on why it's several times
faster,_

Time complexity. The Stanford parser is a phrase structure parser that creates
dependencies as a post processing step. So, assuming that they use some
variation of CKY, the time complexity is O(N^3 |G|) where |G| is the size of
the grammar. This uses Nivre-style greedy parsing, which is O(N).

So, a slightly fairer comparison would be e.g. the Malt parser. Although this
will probably be better than that in terms of accuracy, since last time I
checked the Malt parser doesn't use dynamic oracles yet and by default doesn't
integrate Brown clusters or word embeddings (though you could do that
yourself). Though I wonder a bit about feature set construction, because in my
experience perceptrons are far more sensitive to adding 'wrong' features than
e.g. SVM classifiers. This becomes interesting especially when you train a
model for another language or dependency scheme annotation, since the features
that are relevant differ per set-up.

~~~
syllogism
That's not true --- I'm comparing against their neural-network shift-reduce
dependency parser, which is very fast. Actually I don't know of a faster
parser than theirs, other than spaCy.

~~~
cf
I'm curious how this parser compares to ClearNLP
[http://www.clearnlp.com/](http://www.clearnlp.com/) which is similarly a
shift-reduce parser.

------
danieldk
First of all: great work! Even though I think OpenNLP should also be
mentioned, since it's released under the commercially-friendly Apache 2
license, it's great that you provide this. Also, I think it is a shame that
science lost you (I assume) ;).

I think the other interesting problem to tackle currently is training data.
The situation for English is ok, if you have a couple of thousand of dollars
to spare for a commercial license (which may be problematic for
bootstrapping). But for many other languages there aren't even treebanks
available that can be used for commercial purposes.

It would be great if some annotation project started that aimed to provide
annotations under a liberal license.

(Ps. I have a statistical dependency parser written in Go, which I will
probably release soon in case anyone is interested ;).)

~~~
syllogism
Well, I'm distributing trained models with this. Users shouldn't need to
retrain unless they're doing research, in which case they should have access
to the data.

I agree that the data situation is troubling, though. I don't understand why
Google gave the English Web Treebank to the LDC. Why not just distribute it
themselves?

The LDC is really more of a problem than a help now. For instance, the
OntoNotes corpus costs $0 for non-commercial use. Great! How do you get it?
Send the LDC a fax, and when they get around to it, they send you a log in to
their ancient website.

It used to be a valuable service to host and distribute this data. Now, this
is no longer really the case, but it's still standard to distribute via them.

~~~
danieldk
_Well, I 'm distributing trained models with this. Users shouldn't need to
retrain unless they're doing research, in which case they should have access
to the data._

I haven't read LDC's license on Penn Treebank recently, but AFAIR you cannot
just redistribute models that were trained on the Penn Treebank. Or put
differently, you can distribute the model, but any users still have to obtain
a license for the treebank. That's why we are still stuck with the Brown
corpus and such.

 _I don 't understand why Google gave the English Web Treebank to the LDC. Why
not just distribute it themselves?_

Indeed.

~~~
danieldk
_I haven 't read LDC's license on Penn Treebank recently,_

Just checked the license and my memory seems to be correct:

[https://catalog.ldc.upenn.edu/license/ldc-non-members-
agreem...](https://catalog.ldc.upenn.edu/license/ldc-non-members-
agreement.pdf)

:(

~~~
syllogism
This is the non-commercial agreement --- I'm licensing the resources
commercially.

------
laGrenouille
The idea of implementing cutting-edge NLP algorithms is fantastic and greatly
needed. However I believe the multi-licensing will not be sustainable in the
long term. It limits the ability (and interest) in others contributing into
the library because you'll have to get a copyright transfer for any pull
requests in order to merge into the commercial branch. It seems difficult to
imagine one person being able to develop and maintain a library of this scope.
This is particularly true when they are dependent on it for income rather than
some tenured academic who can invest all of their time into the project
without much risk or need for short-term gains.

~~~
syllogism
The idea of a tenured academic spending all day coding their non-research
library is...let's just call it unrealistic. If you find an hour to write
code, it will not be during business hours.

I'm curious why you think there's a market failure here. If a library like
this can produce N salaries of value, then it should be able to earn N
salaries of revenue. Maybe it takes >N salaries of work to produce it --- in
which case, okay. That means this project is more costly than it is useful.

I definitely believe that market failures exist, and are quite common. But the
service I'm trying to sell is trying to create economic value very directly.
I'm not writing poetry here.

~~~
laGrenouille
I'm not saying that tenured faculty working on these projects is the norm;
that's not at all the case. It's just that the few instances were I know of
someone successfully building and maintaining a large low-level library like
this on their own come from dedicated academics who can de-facto make it the
majority of their job. The best example I know of being Tim Davis' SuiteSparse
C library for sparse matrix algebra.

Aside from that, the market failure I foresee for your project is the
following: Say you keep building this out and write a fantastic, state of the
art general purpose NLP python library. Now an a academic like myself comes
along and forks the AGPL version of your repository, contributing additional
functionality to the parts of the pipeline I am most familiar with. You cannot
re-incorporate my work into your commercial license (unless I sign those
rights away, which I won't), so now you're stuck trying to license an inferior
version of my fork. Meanwhile, since mine truly is just open source, my
version can freely accept both bug fixes and added functionality from one-off
contributors who are using said fork. Better yet, unless you change your
business model, I can continue to re-fold in any changes you make upstream, as
well as include parts of other GPL libraries that build up in the intervening
time.

Now, I'm not saying this is a perfect argument. Perhaps enough people are
still interested in paying you for a a commercial version of the original
software, but I think in the long run as the two version diverge that's
unlikely to generate sufficient revenue for you.

~~~
syllogism
My thinking is that there will be commercial contexts where an AGPL license is
a non-starter --- it's incompatible with the business model. If so, I think it
makes sense for them to buy a commercial license from me, even if it lacks
features in your fork. If they can't use AGPL code, your fork may as well not
exist, except as this tantalizing something-I-can-never-have that makes the
main library look worse.

I'd also note that you gain absolutely nothing from maintaining your separate
fork, other than the principle of the thing. I'm compelled to distribute the
code under the AGPL, just as you are. If your features are compelling you
could instead negotiate with me for a cut of the license fee.

------
krick
In fact I would wish for more thematic articles from people like OP, who know
the topic. It is easy to find some introductory course on NLP, but
introductory is introductory, and as OP states there's a visible gap between
what Google does in 2015, and what some GPL/MIT/BSD-licensed project does in
2015 in that area. While there's relatively large amount of material on DSP or
CV, all these linguistics-related areas seem to have quite a barrier to entry
even for those willing to learn.

~~~
Veratyr
I have a feeling that's because a great deal of NLP use is either proprietary
or academic. It's either buried away in a combination of academic papers and
academic heads or hidden away in a corporate codebase the world doesn't have
access to.

The best way as far as I'm aware of to get into NLP is to take a course at a
university. My university (University of Melbourne) was lucky to have an
undergraduate subject taught by one of the lead authors of NLTK (Stephen Bird)
and that was a great help. You can even take the subject on its own without
enrolling in a full course.

Plus there are books on it. One of the main ones I'm aware of (by Bird) is
[http://www.chegg.com/textbooks/natural-language-
processing-w...](http://www.chegg.com/textbooks/natural-language-processing-
with-python-1st-edition-9780596516499-0596516495)

~~~
a_bonobo
FYI, the book is available for free here under cc-by-nc-nd:
[http://www.nltk.org/book/](http://www.nltk.org/book/)

A Python 3 edition will be released next year.

------
fdb
So, how does this compare to Pattern
([http://www.clips.ua.ac.be/pages/pattern](http://www.clips.ua.ac.be/pages/pattern))
-- a (in my opinion) very high-quality, BSD-licensed data mining and NLP
library coming from the academic world?

~~~
syllogism
From a quick speed test on my laptop, Pattern is 48x slower at POS tagging,
and 8x slower at parsing. I last benchmarked its accuracy in 2013, where I
found it got 93.5% on the WSJ corpus, vs 97% for the state-of-the-art taggers
--- so twice as many errors. It was also more domain-dependent. Its parser
doesn't produce exactly the same representations as mine, so I can't easily
evaluate its accuracy. But, I doubt it's very high.

Pattern doesn't really use machine learning, just some pre-computed statistics
from the annotated data, and some hand-crafted rules. Machine learning is
good. It's really the right way to build these systems.

------
homarp
Only english, right ? Any plan for other languages ? (spanish, french, ...) ?

~~~
danieldk
One of the great things of modern NLP is that a system can easily be trained
for another language (assuming that you have training data, see my other
comment). Since this uses statistical NLP techniques, it should be easy to add
languages.

------
IanCal
This looks fantastic, best of luck with it!

Sorry you're dealing with so many licensing questions here but a quick
clarification:

> If their company is acquired, the license will be transferred to the company
> acquiring them. However, to use spaCy in another product, they will have to
> buy a second license.

Is the second license only required because they sold the company on (and the
license along with it), or is a license per product generally required? In
other words, if I buy a single license, can I make and sell two different
products?

~~~
syllogism
Actually I appreciate the license questions, because it makes it easier to
know how to re-write the docs to clarify.

One license allows you to develop one product. If you stop work on one thing
you can re-use the license on something else, though --- it would be silly to
ask at what point a change of focus becomes a different product.

This seemed the sanest way to do it. I think per-site, per-user etc licenses
are really stupid. The license then impinges on your technical decisions.

------
_glass
This is great thanks! I had a few weeks ago the problem of not being able to
get a passive verb parser in CoreNLP fast enough to work. Does SpaCy support
reduced passives?

~~~
syllogism
You can write rules to find them in the dependency parse, although the parse
tree won't necessarily be correct.

I've thought a lot about passive reduced relative clauses over the years ---
they were a big part of my PhD thesis. So I happen to know that the first one
in the WSJ data is wsj_0003.1. This isn't in the training or development data,
but it's in the same data set --- so, this is a fair but optimistic spot-
check. The sentence is:

> A form of asbestos once used to make Kent cigarette filters has caused a
> high percentage of cancer deaths among a group of workers exposed to it more
> than 30 years ago, researchers reported.

There are two reduced passives here --- "used" and "exposed", and a potential
(but unlikely) false positive in "reported".

spaCy correctly attached "exposed" to "workers", but didn't attach "used"
correctly --- it attached it to "reported" instead of "form". This doesn't
really make syntactic sense, but that's what it did --- the system's entirely
statistical; there's no grammar licensing certain attachments.

To see the parse, run:

    
    
       from spacy.en import English
       nlp = English()
       tokens = nlp(u'A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, researchers reported.')
        for word in tokens:
            print word.orth_, word.tag_, word.dep_, word.head.orth_

~~~
_glass
Thanks! spaCy seems to be really easy to use, but CoreNLP still yields better
results, cf.
[http://nlp.stanford.edu:8080/corenlp/process](http://nlp.stanford.edu:8080/corenlp/process)

------
jesuslop
Glad to see the progress and will pass the word. Maybe you want to comment
regarding semantic parsing, if your parser can help it in a pipeline, if one
can factor it through your parser, or maybe as in task 8 of semEval '14 [1],
you need to rethink your structure (dep. tree vs. dependency graph).

[1]
[http://alt.qcri.org/semeval2014/task8](http://alt.qcri.org/semeval2014/task8)
(with 2015 rerun)

------
Fede_V
This is a really neat project OP, and I hope you can make enough money to
sustain development. I really, really wish academia did a better job of
sponsoring people to maintain high quality software libraries, so you didn't
have to 'strike out' on your own though.

~~~
avinassh
I am willing to donate and I guess many people will be. OP should add Flattr
or Bitcoin wallet address. At least that should help this project.

edit: noticed OP offers trial license at $1:
[http://honnibal.github.io/spaCy/license.html](http://honnibal.github.io/spaCy/license.html)

~~~
syllogism
Better donation targets can be found at
[http://www.givewell.org](http://www.givewell.org) :)

I appreciate the thought, but if this isn't useful to support itself, then
obviously I was wrong, and I should find a more valuable project. But I don't
think that's the case --- I think this will help a lot of people build useful
products, so the commercial license should fund its development quite
adequately.

~~~
sanxiyn
Then don't consider it a donation, but as a separate pricing tier. Currently,
you are charging $5000 to those who don't want AGPL, and $0 to those who are
fine with AGPL. You can charge, say, $100 to those who are fine with AGPL but
want to pay.

If you want to increase sale, you could include donors' names in the
documentation in return to $100, for example.

------
ilyaeck
How about semantic parsing? It's a much less developed area than syntactic
parsing and ultimately that's where the real need is.

------
gpsarakis
Great work! Do you consider PyPy support also?

~~~
syllogism
I've considered it, yes --- but it's hard. Currently I segfault under PyPy.
I've got the learner and hash-table code working, but I need to debug the NLP.
I suspect it's the way I'm interning my strings.

------
wwwhizz
I didn't know what NLP is, maybe you should explain it (or simply write full
words in stead of the abbreviation) once in the beginning of your website.

~~~
andreasvc
Natural Language Processing.

~~~
semi-extrinsic
I clicked this post because I assumed Non-Linear Programming. Was
disappointed.

------
Bill_Cosby
For anyone who has used this, how does this compare to TextBlob or Gensim?

~~~
syllogism
TextBlob is a wrapper around NLTK and Pattern. Those libraries don't use very
sophisticated statistics, so in 2013 I wrote a small Python POS tagger for
TextBlob, which performs much better:
[https://honnibal.wordpress.com/2013/09/11/a-good-part-of-
spe...](https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-
tagger-in-about-200-lines-of-python/)

spaCy's POS tagger works like the one in the blog post, but it's implemented
in Cython, and has some extra features.

Pattern has some nice morphological processing features. I don't do
morphological generation, for instance, and I haven't hooked up the
morphological analysis to the Python API yet.

TextBlob also gives you a few extra bits and pieces, like a wrapper of the
Google translate API.

I'm really targeting the situation where you want to build a product around
some NLP. In this use-case, you need the NLP to be fast, you need it to be as
accurate as anyone knows how to make such a system, and you need it to be
entirely in your control.

As far as GenSim goes: it's good. It does different things from spaCy, though
--- topic modelling, etc. It would be nice to interoperate between the two
libraries. I have no plans to implement topic models.

~~~
Radim
...and I have no plans to add NLP tools in gensim. The connection between
gensim and tokenizing/tagging/parsing libs is intentionally loose and
flexible.

I'm a fan of "do one thing, do it well".

Having said that, it would be great to facilitate "spaCy + gensim" pipelines
for users.

For example, the "word vector representations" can be trained easily with
gensim, on arbitrary user-specified corpora, whereas spaCy loads something
pre-trained, in a specific format. Maybe room for some interoperability there?

------
ForHackernews
> But the academic code is always GPL, undocumented, unuseable, or all three.

I'm not sure why this author is further propagating FUD that suggests GPL code
is unsuitable for commercial use. Just because companies are irrationally
afraid of the GPL doesn't make it true.

~~~
syllogism
This was my understanding --- actually I designed the licensing structure of
this project around the assumption that companies would not want to use GPL
licensed code commercially. I offer an AGPL license, and offer a commercial
license for a fixed fee.

My understanding is that if you link to the library, your code must also be
GPL, which means that anyone linking to your code must be GPL, etc.

This is a problem if you're trying to sell your code. Probably you don't want
to make it GPL, and you probably don't want to force your customers to make
_their_ code GPL.

~~~
ForHackernews
> This is a problem if you're trying to sell your code.

I think relatively few tech companies are trying to sell their code directly
these days. Most are hoping to build a product and/or service and then charge
customers for access to it.

~~~
syllogism
An API or a web service counts as distribution under the AGPL. If you run such
a service, and you use spaCy --- either by linking the binary, or using it as
a network service --- you'll have to AGPL your code. Which introduces
equivalent restrictions on anyone who uses your service.

