Hacker News new | past | comments | ask | show | jobs | submit login
Industrial-Strength Natural Language Processing in Python (spacy.io)
169 points by federicoponzi on March 15, 2017 | hide | past | favorite | 13 comments



Hi all,

Ironic timing here! We're just preparing the 1.7 release, which has a lot of nice changes, including the option of a much smaller model for English (50mb), to help people test faster.

This means that if you install the library right now, you'll have to redownload the data once the new version is released.

So, maybe wait until tomorrow to get started? Definitely our most ambivalent front-paging yet!


Not sure exactly why this was posted today, since spaCy has been around at least a couple years, but - spaCy is a great tool, and I have a ton of respect for Matthew Honnibal, the main developer.

Coincidentally, I wrote a blog post [1] that went up just this morning that, in part, compares spaCy with the other giant in the Python NLP ecosystem, NLTK. TLDR - I think that, right now, the majority of users are better served by spaCy than NLTK.

[1] https://automatedinsights.com/blog/the-python-nlp-ccosystem-...


Nice write up!


It only supports English and German. However you can try add other languages here https://spacy.io/docs/usage/adding-languages


We've got tokenizers and language data (e.g. stop lists) for a number of languages now. The statistical models for these will be coming soon.


but not tomorrow? If there are nltk packages for a language how difficult is it to move to Spacy?

I'm mainly interested in the Danish language right now, although I might also have a use for Italian so it would be nice to know so as to budget my time.


Ask HN: Could you suggest a fast library for converting documents into a sparse matrix representation (e.g., COO or CSR) in any programming language? I'm guessing C beats most of the implementation? But there is also the issue of efficient n-gram hashing/indexing.


Scikit-Learn's text vectorizer stuff is good. In spaCy you can do:

    import spacy
    import numpy
    from spacy.attrs import LOWER, IS_STOP

    nlp = spacy.load('en')
    doc = nlp(u'The quick brown fox...')
    array = doc.to_array([LOWER, IS_STOP])
    content = array[1, numpy.nonzero(array[0])]
Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing:

    cdef void get_tokens(uint64_t* content, Doc doc) nogil:
        for i in range(doc.length):
            token = &doc.c[i] 
            if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
                content[i] = token.lex.lower


Spacy seems quite fast while being reasonably accurate: https://spacy.io/docs/api/#benchmarks


Does spaCy have a C# .NET wrapper, or can it be used from other languages/frameworks through a REST API?

I'm using the CoreNLP C# wrapper, so I'm wondering if something similar (.NET Core compatible) is available/doable for spaCy?


NLTK is pretty good as well.


I find That SpaCy is much more "plug and play" or "fire and forget".

NLTK is better as a learning tool and for messing around. SpaCy is better if you just want something that "just works".


Really great tool, I currently work on a project that makes use of Spacy. Can't wait to push it into production.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: