Ironic timing here! We're just preparing the 1.7 release, which has a lot of nice changes, including the option of a much smaller model for English (50mb), to help people test faster.
This means that if you install the library right now, you'll have to redownload the data once the new version is released.
So, maybe wait until tomorrow to get started? Definitely our most ambivalent front-paging yet!
Not sure exactly why this was posted today, since spaCy has been around at least a couple years, but - spaCy is a great tool, and I have a ton of respect for Matthew Honnibal, the main developer.
Coincidentally, I wrote a blog post [1] that went up just this morning that, in part, compares spaCy with the other giant in the Python NLP ecosystem, NLTK. TLDR - I think that, right now, the majority of users are better served by spaCy than NLTK.
but not tomorrow? If there are nltk packages for a language how difficult is it to move to Spacy?
I'm mainly interested in the Danish language right now, although I might also have a use for Italian so it would be nice to know so as to budget my time.
Ask HN: Could you suggest a fast library for converting documents into a sparse matrix representation (e.g., COO or CSR) in any programming language? I'm guessing C beats most of the implementation? But there is also the issue of efficient n-gram hashing/indexing.
Personally I normally work in Cython when it needs to be fast. I find this more productive and more readable than trying to guess what numpy operations will be fast. So I would be doing:
cdef void get_tokens(uint64_t* content, Doc doc) nogil:
for i in range(doc.length):
token = &doc.c[i]
if Lexeme.c_check_flag(token.lex.flags, IS_STOP):
content[i] = token.lex.lower
Ironic timing here! We're just preparing the 1.7 release, which has a lot of nice changes, including the option of a much smaller model for English (50mb), to help people test faster.
This means that if you install the library right now, you'll have to redownload the data once the new version is released.
So, maybe wait until tomorrow to get started? Definitely our most ambivalent front-paging yet!