Industrial-Strength Natural Language Processing in Python

syllogism · on March 15, 2017

Hi all,

Ironic timing here! We're just preparing the 1.7 release, which has a lot of nice changes, including the option of a much smaller model for English (50mb), to help people test faster.

This means that if you install the library right now, you'll have to redownload the data once the new version is released.

So, maybe wait until tomorrow to get started? Definitely our most ambivalent front-paging yet!

nickdavidhaynes · on March 15, 2017

Not sure exactly why this was posted today, since spaCy has been around at least a couple years, but - spaCy is a great tool, and I have a ton of respect for Matthew Honnibal, the main developer.

Coincidentally, I wrote a blog post [1] that went up just this morning that, in part, compares spaCy with the other giant in the Python NLP ecosystem, NLTK. TLDR - I think that, right now, the majority of users are better served by spaCy than NLTK.

[1] https://automatedinsights.com/blog/the-python-nlp-ccosystem-...

alando46 · on March 15, 2017

Nice write up!

est · on March 15, 2017

It only supports English and German. However you can try add other languages here https://spacy.io/docs/usage/adding-languages

syllogism · on March 15, 2017

We've got tokenizers and language data (e.g. stop lists) for a number of languages now. The statistical models for these will be coming soon.

bryanrasmussen · on March 16, 2017

but not tomorrow? If there are nltk packages for a language how difficult is it to move to Spacy?

I'm mainly interested in the Danish language right now, although I might also have a use for Italian so it would be nice to know so as to budget my time.

zeratul · on March 15, 2017

Ask HN: Could you suggest a fast library for converting documents into a sparse matrix representation (e.g., COO or CSR) in any programming language? I'm guessing C beats most of the implementation? But there is also the issue of efficient n-gram hashing/indexing.

syllogism · on March 15, 2017

Scikit-Learn's text vectorizer stuff is good. In spaCy you can do:

    import spacy
    import numpy
    from spacy.attrs import LOWER, IS_STOP

    nlp = spacy.load('en')
    doc = nlp(u'The quick brown fox...')
    array = doc.to_array([LOWER, IS_STOP])
    content = array[1, numpy.nonzero(array[0])]

Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing:

    cdef void get_tokens(uint64_t* content, Doc doc) nogil:
        for i in range(doc.length):
            token = &doc.c[i] 
            if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
                content[i] = token.lex.lower

mkesper · on March 15, 2017

Spacy seems quite fast while being reasonably accurate: https://spacy.io/docs/api/#benchmarks

nreece · on March 15, 2017

Does spaCy have a C# .NET wrapper, or can it be used from other languages/frameworks through a REST API?

I'm using the CoreNLP C# wrapper, so I'm wondering if something similar (.NET Core compatible) is available/doable for spaCy?

deepnotderp · on March 15, 2017

NLTK is pretty good as well.

Vaskivo · on March 15, 2017

I find That SpaCy is much more "plug and play" or "fire and forget".

NLTK is better as a learning tool and for messing around. SpaCy is better if you just want something that "just works".

snackai · on March 15, 2017

Really great tool, I currently work on a project that makes use of Spacy. Can't wait to push it into production.