
NLP concepts with spaCy tutorial - jxub
https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3
======
JPKab
I love Spacy, and highly recommend it to anyone who needs to build production
NLP software.

It is truly groundbreaking, and a major improvement over NLTK. I also
recommend gensim, another phenomenal library for NLP.

~~~
wodenokoto
I've always understood nltk as a teaching tool, and been accordingly surprised
when I see people use it in production.

SpaCy really fills an important gap

------
Xeoncross
I would love to see more tutorials explaining how you can use these basic
building blocks (also covered by the great documentation) to find semantic
meaning for comparing sentences or understanding basic intent like
[https://explosion.ai/blog/chatbot-node-js-
spacy](https://explosion.ai/blog/chatbot-node-js-spacy) or some other useful
tool besides finding all the nouns in a document.

------
wyldfire
I think spaCy is pretty great. I'm curious to see more about how Prodigy
(Montani's annotation/training feature) turns out. [1]

Oh, I see now that it's quite a lot further along then when I last checked in.
~$400 for an individual license seems pretty fair IMO.

[1] [https://prodi.gy/](https://prodi.gy/)

~~~
syllogism
Thanks! I'll answer this, since Ines doesn't have an account here (I'm Matt).

We're really happy with how Prodigy's being received. It's only been on sale
two months, so I'm looking forward to hearing more success stories as people
finish their projects (and of course, feedback to change what needs to be
changed!).

You can read how FullFact used it to train claim identification models for
fact-checking here: [https://fullfact.org/blog/2018/feb/how-we-customised-
prodigy...](https://fullfact.org/blog/2018/feb/how-we-customised-prodigy-ai/)

Probably the best place to follow the progress is the support forum:
[https://support.prodi.gy/](https://support.prodi.gy/)

We're also working on more tutorial videos. This video shows the workflow for
training a new entity type:
[https://www.youtube.com/watch?v=l4scwf8KeIA](https://www.youtube.com/watch?v=l4scwf8KeIA)
. This is one of the bits of the tool we're particularly proud of --- you can
start off with a couple of seed terms, use word vectors to build up a larger
terminology list, and then turn that list into a set of pattern rules to start
boot-strapping a classifier. Prodigy will suggest phrases that match the
patterns as entities, and your answers are used to train the statistical
model. As you keep annotating, the model will start suggesting phrases too,
which you'll say yes or no to. Eventually the model basically takes over, and
you're mostly correcting its suggestions.

------
f00_
After reading "Backprop as Functor: A compositional perspective on supervised
learning", it came to my attention that the spaCy backend (thinc) is built
with higher order functions instead of a computational graph (unlike
tensorflow, chainer, or pytorch)

[https://github.com/explosion/thinc#no-computational-graph--
j...](https://github.com/explosion/thinc#no-computational-graph--just-higher-
order-functions)

[https://arxiv.org/abs/1711.10455](https://arxiv.org/abs/1711.10455)

Could someone give me some more detail?

~~~
syllogism
You know, I'm still not 100% certain whether there's a substantive difference
between the "computational graph" perspective and this "functor" approach.
Actually the feeling is sort of eerily familiar, because I spent most of my
PhD confused about whether these grammar formalisms I was working with where
really just notational variants, or whether there were significant
differences. About the grammar formalisms, I ended up deciding that in theory
there wasn't, in practice there sort of was.

About these neural networks, I think it's "just" implementation. Here's the
linear layer implementation in Chainer:
[https://github.com/chainer/chainer/blob/master/chainer/funct...](https://github.com/chainer/chainer/blob/master/chainer/functions/connection/linear.py)

We have the forward and backward pass organized as class methods here, and the
intermediate state from the forward pass is saved into attributes in the
instance. So on each call to the network, we make an instance of this
LinearFunction class.

In terms of what's being computed, there's really no difference between this
and what happens when you call a layer in Thinc. It's just that the state gets
captured in the outer scope of the closure. Maybe Thinc's way has a little
less overhead, if there are fewer levels of indirection. Thinc uses the
Chainer folks' GPU library --- so, unsurprisingly if you define the same
network, the benchmarks are very similar.

On the other hand...I do think the implementation matters! Here's a difference
for you: if the library approaches it as "we're going to build a computational
graph, and execute it", then the library is going to steal the control flow.
If the library tells you "here are some functions, and some higher order
functions to compose them", you have more access. PyTorch and Chainer doesn't
steal the control flow to nearly the extent that Tensorflow does, but they
still build up and tear down the state in their objects, and that makes it
harder to intrude.

(I'm the author of spaCy and Thinc)

~~~
jph00
To clarify the grandparent comment : pytorch also supports a functional
approach rather than a computational graph. There's even a pytorch functional
model zoo :)

------
artpar
This is a docker image exposing spacy ( +wordnet +neuralcoref) over http

[https://hub.docker.com/r/artpar/languagecrunch/](https://hub.docker.com/r/artpar/languagecrunch/)

[https://github.com/artpar/languagecrunch](https://github.com/artpar/languagecrunch)

------
the_duck

        There is no such thing as a sentence, or a phrase, or a part
        of speech, or even a "word"---these are all pareidolic
        fantasies occasioned by glints of sunlight we see reflected
        on the surface of the ocean of language; fantasies that we
        comfort ourselves with when faced with language's infinite
        and unknowable variability.
    

'Pareidolic' is my new favourite word.

[https://en.wikipedia.org/wiki/Pareidolia](https://en.wikipedia.org/wiki/Pareidolia)

~~~
ramblenode
This is a kind of tautology. Words and phrases exists not because of the whims
of grammarians but because they are psychological realities for people. They
are concepts represented in the mind and brain which exist whether or not
people are explicitly aware of their existence--and this is demonstrated in a
large psycho- and neurolinguistic literature. This is different from a man on
the moon where we are talking about the anthropomorphism of bits of rock and
dust--whether the ontology of such a thing is a human face or just bits of
rock and dust.

~~~
posterboy
I don't know, the moon face seems real enough at the moment of seeing it, too.

Linking this to signal theory and the Fourier transform, one point to consider
is that solutions are only true in the infinite limit, so a word, a phrase is
never enough to represent reality. A sense of continuity is real enough, but
discontinuity, too, although I can't position that in a psychological frame.
Or Neurological. But speaking with the y combinator in mind, I don't think
words are the fixpoints of thought, but feelings are. Maybe onomatopoetic
names are and familiar faces are close enough.

------
wyldfire
Is there anyone out there who has tried linking named entities with an
ontology? All I've seen is research but I'm curious if anyone's done any
practical work in this area. Even if it was a narrowly-scoped ontology it
might be pretty interesting.

~~~
Radim
There's a number of tools that do _entity linking_ (that's the phrase you're
looking for), including some open source ones.

We've evaluated and used quite a few over the years: there's Dexter [0] by
Diego Ceccarelli, Semanticizer by UvA [1] and DBpedia Spotlight [2] and a few
others. We've used them for various linking tasks, such as detecting "work
skills" in plain text (HR domain) or detecting drug names (medical domain).

The amount to which these tools allow "customization" (ease of plugging in
your own ontology, support for input format and disambiguation signals)
differs. Either way, even though this research includes open source code, the
code is more of the "research prototype" kind. Don't expect a plug&play
optimized production tool.

[0] [http://dexter.isti.cnr.it/](http://dexter.isti.cnr.it/)

[1]
[https://github.com/semanticize/semanticizer](https://github.com/semanticize/semanticizer)

[2] [https://github.com/dbpedia-spotlight/dbpedia-
spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight)

~~~
vadimberman
Would you be interested in a plug&play optimised production tool if it
required licensing?

------
amirouche
I don't understand the hype around spacy especially when it relies on (closed
source) annotated corpus to do all of its job. The default models fail in
simple cases. And it takes a lot to train a new model on new corpus.

I am wondering how people solve actual problems with spacy. What are the use
cases? Is Spacy used in question answering sytems or summarization pipelines?
Maybe conceptual search?

I prefer the approach of link grammar / relex which are based on dictionaries
/ grammars, it seems easier and less error prone.

Prodigy is genius! More tools like that will be built in the next few years...

~~~
bosie
> I prefer the approach of link grammar / relex which are based on
> dictionaries / grammars, it seems easier and less error prone.

Can you link to something open source that outperforms spacy on the basic NLP
tasks (NER, POS, dependency parsing)?

~~~
amirouche
all at the same time, I agree that there is no equivalent; that said, for each
tasks there is better tools.

