
SpaCy v1.0: Deep Learning with custom pipelines and Keras - syllogism
https://explosion.ai/blog/spacy-deep-learning-keras
======
nl
I use Spacy quite a lot for work and pleasure. I also regularly use (deep
breath) NLTK/CoreNLP/OpenNLP/Rosette/OpenIE, and I've written bits and pieces
in the area, so I have some perspective.

When I want to do something, I always use Spacy first. It's the 95% solution -
it does 95% of the things you are ever likely to do. The only downside is that
I haven't worked out how to avoid the slow startup time.

NLTK has more stuff. Some of it is crap, but some is useful. For example, if
you want WordNet then NTLK is an easy way to get it.

CoreNLP gives you high quality POS tagging and named entity extraction. It's a
Java library (but can run as a REST service) so that is good if you are on the
JVM.

OpenNLP.. has an Apache license.

Rosette is expensive, and tuned entirely differently to any other text
processing library. Has great multi lingual support.

OpenIE is by far the best open information extraction toolkit.

Obviously this misses the 1000 pound gorilla: Syntax/Parsey McParseface. This
isn't exactly trivial to run yet outside of Docker. Slowly getting my stuff
together on this.

~~~
dirtyaura
May I ask what kind of work you do and what kind of NLP projects on your free
time?

~~~
nl
Work is running a "data science" R&D program across a few universities and an
engineering team.

Fun is lots of stuff. I have a question-answering thing which I'm slowly
making progress on. I've written various (old) open source text things which
still seem to be used. I do Kaggle. I wrote a contextual advertising server
for a cycling website I ran (and ended up licensing some of the tech). I have
a vague home research project trying to combine unstructured knowledge with
structured knowledge graphs. Probably other stuff I've forgotten about.

~~~
dirtyaura
Thanks! I've been playing with Spacy lately. I've a product idea that requires
NLP and I'm unfamiliar with NLP research and techniques. The idea itself is
rather complex to implement, so I'd like to have a couple of interesting
toy/side projects to hone my skills first. I hadn't thought Kaggle, that could
provide a few good projects for understanding NLP tech better.

------
butterm
What i love about spacy is their dependency parsing visualization tool[0]. Its
so much better than what Stanford offers.

Other than that, I find Spacy's philosophy of "one (best) way of doing
everything" a bit stifling. I don't think there is a "best" parser or "best"
named entity recognizer. A certain parser may perform very well in a domain
(for example, Tweeboparser [1] performs well with tweets) and perform very
badly in another. This is true for almost everything in NLP, and NLTK embraces
this diversity quite well. This is why NLTK is my go to tool when I want to do
something cutting edge in NLP.

[0]
[https://demos.explosion.ai/displacy/](https://demos.explosion.ai/displacy/)
[1]
[https://github.com/ikekonglp/TweeboParser](https://github.com/ikekonglp/TweeboParser)

~~~
syllogism
I definitely agree that the same weights won't be optimal for different
domains. If you need to parse tweets, you should have a tweet-trained model.
The tweet model probably shouldn't be thinking about Jane Austen novels. We
want to open a model store where you can buy language and domain specific
models.

I think 99% of the time there's one best algorithm, and even one best
implementation of it. It's the weights, and sometimes the features, that need
to vary.

Finally — I love displaCy too. Ines does great work :). Have you seen that we
open-sourced this recently? It's now very easy to run locally, and connect up
to the model you're developing. You can use this with any other parser, too.
[https://explosion.ai/blog/displacy-js-nlp-
visualizer](https://explosion.ai/blog/displacy-js-nlp-visualizer)

~~~
butterm
I am so glad that you guys open sourced displaCy. I would love to give it a
spin on my system. Kudos for all the great work you are doing!

------
visarga
There's also the python library Polyglot, featuring NLP for many languages
(depending on the task, from 15 to 135 languages):

[https://polyglot.readthedocs.io/en/latest/index.html](https://polyglot.readthedocs.io/en/latest/index.html)

------
grej
Very cool! I'm familiar with NLTK, but not SpaCy. It looks like the speed
benefits of SpaCy could help greatly in processing large volumes of
unstructured text.

I'm curious what type of feedback you've gotten from users that have migrated
from NLTK to SpaCy.

~~~
Smerity
I'm not the author but can speak to some of this. Some amount of this was
answered by the author himself in his blog post "Dead Code Should Be Buried –
Why I Didn't Contribute to NLTK"[1].

While there are portions on NLTK that are useful, it suffers from having too
many things under the hood and a lack of consistent maintenance. Many of them
are superglued external libraries while others have not been maintained or
have never been made to work well with other parts of NLTK.

The biggest example of this was that, for some time, it wasn't certain how the
NLTK part of speech tagger was trained[2]. While that has been remedied, the
remedy uses part of speech tagging code inspired via earlier work by the
author of spaCy[3][4].

spaCy has been made from the ground up to support more modern deep learning
features (easy integration of Keras is quite cool but I'm actually meaning the
default inclusion of word vectors which has existed since early spaCy) while
having a consistent and clean API that covers many sane use cases.

The API is highly Pythonic but the underlying performance critical components
are written in Cython, meaning very few parsers are actually able to beat
spaCy in speed (especially when using CPU and not GPU). Regarding speed,
multi-threading is also a relatively trivial process - Cython releases the GIL
around the syntactic parser. As a reference to speed, they used it to tag and
parse every Reddit comment from 2015 to then generate "sense2vec"[5] at over
100,000 words per second using 4 threads.

Finally, and most importantly, spaCy compares to and remains competitive with
state of the art results and all of it remains under an MIT license.

As you might have guessed, I'm a fan :)

Note: I was in the same NLP lab as Matthew Honnibal for my undergraduate so
will admit positive bias ;) This was long before spaCy was a thing however -
he was simply a fascinating and interesting guy for other reasons ^_^

[1]: [https://explosion.ai/blog/dead-code-should-be-
buried](https://explosion.ai/blog/dead-code-should-be-buried)

[2]:
[https://github.com/nltk/nltk/issues/1063](https://github.com/nltk/nltk/issues/1063)

[3]:
[https://github.com/nltk/nltk/pull/1143](https://github.com/nltk/nltk/pull/1143)

[4]: [https://explosion.ai/blog/part-of-speech-pos-tagger-in-
pytho...](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)

[5]: [https://explosion.ai/blog/sense2vec-with-
spacy](https://explosion.ai/blog/sense2vec-with-spacy)

