
Industrial-Strength Natural Language Processing in Python - federicoponzi
https://spacy.io/
======
syllogism
Hi all,

Ironic timing here! We're just preparing the 1.7 release, which has a lot of
nice changes, including the option of a much smaller model for English (50mb),
to help people test faster.

This means that if you install the library right now, you'll have to
redownload the data once the new version is released.

So, maybe wait until tomorrow to get started? Definitely our most ambivalent
front-paging yet!

------
nickdavidhaynes
Not sure exactly why this was posted today, since spaCy has been around at
least a couple years, but - spaCy is a great tool, and I have a ton of respect
for Matthew Honnibal, the main developer.

Coincidentally, I wrote a blog post [1] that went up just this morning that,
in part, compares spaCy with the other giant in the Python NLP ecosystem,
NLTK. TLDR - I think that, right now, the majority of users are better served
by spaCy than NLTK.

[1] [https://automatedinsights.com/blog/the-python-nlp-
ccosystem-...](https://automatedinsights.com/blog/the-python-nlp-ccosystem-a-
short-and-very-opinionated-guide)

~~~
alando46
Nice write up!

------
est
It only supports English and German. However you can try add other languages
here [https://spacy.io/docs/usage/adding-
languages](https://spacy.io/docs/usage/adding-languages)

~~~
syllogism
We've got tokenizers and language data (e.g. stop lists) for a number of
languages now. The statistical models for these will be coming soon.

~~~
bryanrasmussen
but not tomorrow? If there are nltk packages for a language how difficult is
it to move to Spacy?

I'm mainly interested in the Danish language right now, although I might also
have a use for Italian so it would be nice to know so as to budget my time.

------
zeratul
Ask HN: Could you suggest a fast library for converting documents into a
sparse matrix representation (e.g., COO or CSR) in any programming language?
I'm guessing C beats most of the implementation? But there is also the issue
of efficient n-gram hashing/indexing.

~~~
syllogism
Scikit-Learn's text vectorizer stuff is good. In spaCy you can do:

    
    
        import spacy
        import numpy
        from spacy.attrs import LOWER, IS_STOP
    
        nlp = spacy.load('en')
        doc = nlp(u'The quick brown fox...')
        array = doc.to_array([LOWER, IS_STOP])
        content = array[1, numpy.nonzero(array[0])]
    

Personally I normally work in Cython when it needs to be fast. I find this
more productive and more readable than trying to guess what numpy operations
will be fast. So I would be doing:

    
    
        cdef void get_tokens(uint64_t* content, Doc doc) nogil:
            for i in range(doc.length):
                token = &doc.c[i] 
                if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
                    content[i] = token.lex.lower

------
nreece
Does spaCy have a C# .NET wrapper, or can it be used from other
languages/frameworks through a REST API?

I'm using the CoreNLP C# wrapper, so I'm wondering if something similar (.NET
Core compatible) is available/doable for spaCy?

------
deepnotderp
NLTK is pretty good as well.

~~~
Vaskivo
I find That SpaCy is much more "plug and play" or "fire and forget".

NLTK is better as a learning tool and for messing around. SpaCy is better if
you just want something that "just works".

------
snackai
Really great tool, I currently work on a project that makes use of Spacy.
Can't wait to push it into production.

