Hacker News new | past | comments | ask | show | jobs | submit login
SpaCy 3.0 (github.com/explosion)
484 points by syllogism on Feb 1, 2021 | hide | past | favorite | 81 comments

I have been using Spacy3 nightly for a while now. This is game changing.

Spacy3 practically covers 90% of NLP use-cases with near SOTA performance. The only reason to not use it would be if you are literally pushing the boundaries of NLP or building something super specialized.

Hugging Face and Spacy (also Pytorch, but duh) are saving millions of dollars in man hours for companies around the world. They've been a revelation.

Everything in the above paragraph sounds like a hyped overstatement. None of it is.

As someone that's worked on some rather intensive NLP implementations, Spacy 3.0 and HuggingFace both represent the culmination of a technological leap in NLP that started a few years ago with the advent of transfer learning in NLP. The level of accessibility to the masses these libraries offer is game-changing and democratizing.

Can you help, please?

I want to use AI to translate (localize) messages for free software, in my case, Ukrainian language. My plan to improve quality of automated translation is to translate from similar languages in parallel, i.e. give a same message in English, Russian, Polish, and expect message in Ukrainian as output.

Where I should start? Which libraries to use? How to connect them? How to train them?

MarianMT has a lot of pre-trained models from one language to another, here is one for Polish to Ukrainian


and English to Ukrainian


You can test the models with the form input on the right-hand side

And here's a site with docs and code examples for basic usage in your app


You can also use Marian's training code on PyTorch with custom source and target texts.

Translation from just English language creates a lot of ambiguity. The idea is to use two or more input languages to reduce ambiguity.

Currently, volunteers manually translate from English to all other languages. It's possible to use AI to help, but quality of translation is too low. Dumb Translation Memory database is much more helpful than AI translator.

In the new system, one volunteer will translate messages from English to his native language, while all others will be able to use this translation as the additional constraint, to improve quality of translation to their native language. More input languages -> more constraints -> better quality of translation.

My best idea, so far, is to use a multilingual tokenizer (SpaCy looks good) and a linear transformer, because linear transformers are able to accept large inputs, with thousands of tokens. IMHO, I can input multiple translations of messages (they are short), and define loss as expected Ukrainian translation.

However, I'm completely new to this field. I completed just one AI project so far: recognition of animals at video. I have no idea how to start.

For example, I have no idea how to hint transformer that input messages are the same, just in 4 different languages. Should I interleave messages, like "Cannot Не можу open відкрити file файл : : ", or put them aside, via separator: "\0Cannot open file: \0Не можу відкрити файл: \0", or create few independent inputs? Or use a memory and improve quality of output message incrementally?

Any hints where to start?

I've been using LASER from Facebook Research via https://github.com/yannvgn/laserembeddings to accept multi-lingual input in front of the the domain-specific models for recommendations and stuff (that are trained on English annotated examples).

This sounds interesting. Can you share more please? It sounds like there is some multilingual input text on the basis of which you make recommendations, but I think you would have called that a search engine rather than recommender.

That's true, I'm making recommendations based on Multinomial Naive Bayes (and SGDClassifier) over custom TF-IDF bags of words, so it is like search plus text classification. And some endpoints do just check the cosine or Jaccard distance between things. There is a lot of overlap between search and NLP.

My approach to AI is somewhat conservative because of working in a law-adjacent field where explainability is paramount. When it comes to getting "smart" I prefer forward-chaining logic over facts, and facts include predictions from models too. But at least there is a "judge"/engine to coordinate how the predictions from the ensemble of models maps to actions. I love me some pertained neural nets, but use them more as black box appliances.

Which forward-chaining engine do you use? Something based on prolog?

Currently https://github.com/nilp0inter/experta but https://github.com/noxdafox/clipspy seems nice, I just shied away from using it due to uneasiness about FFI and debugging, even though the original CLIPS is still awesome and has a very interesting manual.

There's also https://github.com/jruizgit/rules but haven't tried it yet.

Interesting. I guess your driving factor was that you can use those directly from Python. How's the performance (with many many rules)?

Yes exactly, I want to be able to do rich auditing of the predictions. Not sure about performance yet, still prototyping!

I use HuggingFace for most of the NLP modeling work, it seems like Spacy 3.0 provides a framework for many different libraries including HuggingFace, StreamLit, Ray and WnB, but if you know how to use these individually, what does Spacy add to this?

I gave some thoughts earlier in the thread about how we see spaCy fitting in to the NLP ecosystem. Here’s another answer that’s maybe more direct, through an example.

Let’s say you wanted to build a system where you used an efficient bag-of-words text classifier to select paragraphs that might have information of interest, and then you wanted to run an entity recognizer and recognise relation triples between predicates and pairs of entities. When extracting the triples, you want to use the lemma of the relation word, so you want to map “dove” to “dive” when it’s a verb etc. It’s possible to build this system directly using PyTorch modules for the various model parts, but you’ll need to write the various bits of logic to string together the model predictions yourself, and for tasks like lemmatization that are pretty easy, you’ll struggle to find existing systems that you’ll actually want to use. A lemmatization system that’s published for PyTorch will probably be designed for languages where lemmatization is really hard, but for English it’s really easy.

spaCy has a good architecture and API for this system level stuff, where you’re putting together models into practical solutions. It also has a Doc object that makes it really easy to actually work with the system outputs, especially to relate multiple levels of annotation to each other.

Partly because orchestrating a number of models is kind of a hassle in lower-level frameworks, a lot of guides will encourage you to take entirely joint approaches to this type of system. In theory you can bypass problems like the lemmatization and NER if you take a sequence-to-sequence approach, and generate the relation triples as just arbitrary data. But this has a lot of limitations. It’s difficult to express structural constraints that you know should hold about the triples, the system will be much much slower, and might require vastly more training data. It’s also difficult to divide the task up between different people, and it’ll be difficult to analyse the system errors, iterate on individual parts, or inject rule logic to ensure certain invariants about the output. All these facts about joint approaches increase the risk of the project failing; they add large uncertainties that keep projects from getting out of the prototype stage.

i am curious, what kind of project are you working on?

Man I wish they could be compensated remotely in proportion to that. Matthew Honnibal and team are wizards who have been working really hard for a really long time.

Maybe there's a comp strategy i don't know about - but they've created SO much value for the world.

Thanks for the love :). For the record yes we've been working hard, but also yes, we've been doing well from it.

I will say that people are using spaCy for free because that is what we asked them to do. I chose to make the library free and open-source when I first released it because I had the idea that I would be able to make that work out for me, if I could make this thing that would be useful to people and if they could be convinced to adopt it. And in order to convince people to adopt it, we've been telling people that spaCy will stay free and that we'll continue to work on it. So everything's going to plan here. Even if things weren't working out well for us (and they are), the fault would be entirely ours. I don't think we'd have any right to suddenly say, "Oh none of you jerks are paying, how unfair".

(For the record, we make money from sales of our annotation tool, Prodigy: https://prodi.gy . If you're reading this and you like spaCy, check it out ;)

I want to personally thank you for your work, and let you know I couldn’t have done an important project of mine if spaCy didn’t exist, and if it were not a free resource.

Your project was 1 of the 2 instrumental tools in my project to structure the transcripts of every word said on the floor of the New York State Senate over the past ~30 years in order to develop a topic-based “proximity” heuristic (based on CorEx, the 2nd instrumental tool) for which state senators were focused on which issues, based on the things they actually said on the record, not based on their press statements or their voting records (the latter of which doesn’t capture all the information you’d hope it would due to procedural nuances too obscure to detail here).

Thank you. Thank you, thank you, thank you.

This sounds super cool, is there a public link to your work? I'd love to check it out

I haven’t written about it or made results publicly available yet, but I do intend to. I can make a note of the email in your profile and ping you with a link when it is available if you’d like.

How about posting it to HN?

Oh wow, Prodigy looks amazing. I need a tool for audio annotation, and looks like you guys built just the right thing for me.

Please invest more in SEO; I didn't find you guys two weeks ago when I researched different options for audio annotation :D.

The only thing that Prodigy is missing is a team based workflow. We've been on the beta list for awhile for it, and are excited for it to come out- but without having a concept of users we've had to use other tools that aren't as polished on the annotation side but which hit our compliance needs.

This. Wholly agree. Currently running a large labelling task with 12 labelers.

Using Amazon ground truth which works fine (although seems quite MVP outside the core functionality e.g wrt reporting or user creation).

What tool have you had success with?

Does it have a question/answering component that's trainable/finetune-able on custom datasets?

Ok so I’ve evaluated spacy a few years ago, but nowadays we’re using huggingface’s transformers / tokenizers / etc to train our own language models + fine tuned models. I see there’s now transformer based pipeline support, how do the two relate?

Phrased differently, how does spacy fit in with today’s world of transformers? Would it still be interesting for me?

I have lots of experience with both, and I use both together for different use cases. SpaCy fills the need of predictable/explainable pattern matching and NER - and is very fast and reasonably accurate on a CPU. Huggingface fills the need for task based prediction when you have a GPU.

Huggingface fills the need for task based prediction when you have a GPU.

With model distillation, you can make models that annotate hundreds of sentences per second on a single CPU with a library like Huggingface Transformers.

For instance, one of my distilled Dutch multi-task syntax models (UD POS, language-specific POS, lemmatization, morphology, dependency parsing) annotates 316 sentences per second with 4 threads on a Ryzen 3700X. This distilled model has virtually no loss in accuracy compared to the finetuned XLM-RoBERTa base model.

I don't use Huggingface Transformers, but ported some of their implementations to Rust [1], but that should not make a big difference since all the heavy lifting happens in C++ in libtorch anyway.

tl;dr: it is not true that tranformers are only useful for GPU prediction. You can get high CPU prediction speeds with some tricks (distillation, length-based bucketing in batches, using MKL, etc.).

[1] https://github.com/tensordot/syntaxdot/tree/main/syntaxdot-t...

Is there a standard template for creating a distilled model? I didn’t see a public hugging face implementation, just the models

Interesting. Did you start from a Distilled base model (like DistilRoBerta), or did you distill your fine-tuned model?

Sorry for the late reply. I distilled from my own finetuned XLM-RoBERTa model.

The improved transformers support is definitely one of the main features of the release. I'm also really pleased with how the project system and config files work.

If you're always working with exactly one task model, I think working directly in transformers isn't that different from using spaCy. But if you're orchestrating multiple models, spaCy's pipeline components and Doc object will probably be helpful. A feature in v3 that I think will be particularly useful is the ability to share a transformer model between multiple components, for instance you can have an entity recogniser, text classifier and tagger all using the same transformer, and all backpropagating to it.

You also might find the projects system useful if you're training a lot of models. For instance, take a look at the project repo here: https://github.com/explosion/projects/tree/v3/benchmarks/ner.... Most of the readme there is actually generated from the project.yml file, which fully specifies the preprocessing steps you need to build the project from the source assets. The project system can also push and pull intermediate or final artifacts to a remote cache, such as an S3 bucket, with the addressing of the artifacts calculated based on hashes of the inputs and the file itself.

The config file is comprehensive and extensible. The blocks refer to typed functions that you can specify yourself, so you can substitute any of your own layer (or other) functions in, to change some part of the system's behaviour. You don't _have_ to specify your models from the config files like this --- you can instead put it together in code. But the config system means there's a way of fully specifying a pipeline and all of the training settings, which means you can really standardise your training machinery.

Overall the theme of what we're doing is helping you to line up the workflows you use during development with something you can actually ship. We think one of the problems for ML engineers is that there's quite a gap between how people are iterating in their local dev environment (notebooks, scrappy directories etc) and getting the project into a state that you can get other people working on, try out in automation, and then pilot in some sort of soft production (e.g. directing a small amount of traffic to the model).

The problem with iterating in the local state is that you're running the model against benchmarks that are not real, and you hit diminishing returns quite quickly this way. It also introduces a lot of rework.

All that said, there will definitely be usage contexts where it's not worth introducing another technology. For instance, if your main goal is to develop a model, run an experiment and publish a paper, you might find spaCy doesn't do much that makes your life easier.

SpaCy and HuggingFace fulfill practically 99% of all our needs for NLP project at work. Really incredible bodies of work.

Also, my team chat is currently filled with people being extremely stoked about the SpaCy + FastAPI support! Really hope FastAPI replaces Flask sooner rather than later.

As someone who never used NLP, what is it used for? Or better what is SpaCy used for - I know that it can generate Texts... but how would I use it in a business?

Obviously, it depends, but assuming you do have an NLP use case already, there are certain things that you will almost certainly have to do in your preprocessing regardless of your task. For example, sentence parsing. Writing your own basic sentence parser is fairly easy. Writing your own _good_ sentence parser is a nightmare akin to trying to parse HTML with regex. SpaCy provides a very good one for you. Down the line it will help you train a model on various tasks in a compute efficient manner as well. This is just a small example.

Thanks for the explanation. Guss I never had an NLP usecase. So SpaCy can take a text or sentences apart and knows what the parts "mean" but I can't think of a problem to solve with this. Maybe summarize a text or something, but I guess I am not imaginative enough

There's some native SpaCy + FastAPI integration being created?!

This sounds mindblowing, off to Google I go...

With the new "spaCy project" it's easy to generate the boilerplate for many spaCy related projects, including a FastAPI app to serve models.

But apart from that, there's actually not really much to integrate, as it all just works.

Both libraries are actually (well ) designed to be decoupled. So that you can mix them independently with anything else.

Note: I'm the creator of FastAPI, and also a developer at Explosion (spaCy's home).

The author of FastAPI https://twitter.com/tiangolo is a Spacy employee

Yep, that's me, I work at Explosion (spaCy's home)!

Big fan my dude! While FastAPI is amazing, the docs for it are a work of art. I know a few people that have just used the FastAPI docs to learn what API's are and how they work, nevermind how use FastAPI itself.

Thanks for saying that! :)

Thank you Matthew, Ines, Sofie and Adriane for spaCy. It is a fundamental piece for me, both for work in Academia and in Industry.

I'm curious what sort of NLP use cases people are solving. How are people finding business value in these models and pipelines? We have looked at a number of uses and have found it hard to make a case for ROI. Wondering what's been working for folks.

We use it in healthcare to pluck out named entities that kicks off an email to the person mentioned.

It helps to be able to deliver value at a low bar. This stuff takes a lot of time and energy to improve upon, and it won't "just work". Search is usually a good place to start. Just make one little process a tiny bit better. Make sure you're tracking data very well, because if you're not already at a place to be making data-driven decisions you can't possibly take advantage of machine learning. After all that, just iterate until you've found a better problem to tackle with your newfound capabilities :)

Agreed. The larger and deeper the model, the more the use cases seem to become negative ROI. I think for business use there will have to be an effort to compress the architecture considerably

I stumbled over SpaCy when looking for something to extract key words and numbers from sentences, however it looked a bit daunting and/or overkill. Think recipes or similar, turning "take three tablespoons of sugar" into [3, 'tablespoons', 'sugar'] or similar.

Should I give it another shot or are there libraries more suited for this than just plain regexp galore?

I did that years ago for some project. For recipes you can probably get away with regular expressions.

But with Spacy you could tokenize the sentence, then tag each work with the Part of Speech it is, and then find patterns eg. Verb Number Noun Preposition Noun

Then match the first noum against a list of measurements (tablespoon, teaspoon, tbsp...) and extract the rest of the components.

This is exactly what you want: https://github.com/facebook/duckling

That looks great, time to dip my feet in the Haskell pond then. Thanks!

Thanks to the SpaCy team! I spent a lot of time over about 20 years working on my own NLP tools. I stopped doing that and mostly now just use SpaCy (and sometimes Huggingface and Apple’s NLP models).

Have you compared SpaCy with Apple's NLP on M1? I presume GPU is not being used by SpaCy on M1.

Given that SpaCy uses PyTorch, that is being worked on.

I have used both on M1, but compared them.

I think I read somewhere that spaCy was going to have named entity disambiguation at some point, with named entities having links to knowledge bases like Wikidata or DBpedia. That’s something that paid NER services but that I haven’t found in open source libs, and would be really interesting IMO.

There's a component for Entity Linking available in spaCy, but you have to train it yourself, as the use-cases (type of entities, type of knowledge base etc) can vary greatly. See more here: https://spacy.io/api/entitylinker

there is also BLINK by Facebook


I'll add my hats off to to @ines and the spaCy team. It's super impressive. There's also a (free) orientation course I'd recommend at https://course.spacy.io/

Please note that Explosion does not like redistribution of SpaCy, they expect everyone to only use the builds they produce, so it would not be a good idea to package it for your favourite distro.

I'm sorry that this conflicted with your plans, but I feel strongly that distributing Python libraries via system package managers such as apt is very bad for users. The pain is felt especially by users who are relatively new to Python, who will end up with their system Python in a confusing state that is difficult to correct.

We of course encourage anyone to clone the repo or install from an sdist if they want to compile from source. In fact you can do the following:

    git clone https://github.com/explosion/spaCy
    cd spaCy
This will build you a standalone executable file, in the pex format, that only depends on your system Python and does not install any files into your system. You can copy this artifact into your bin and use it as a command-line application.

That is sad, I would use SpaCy more if it had Debian packages (in particular in Debian). Python stuff packaged by Debian seems to work very well for me and has been for years.

At $work we have internal packages of SpaCy and dependencies, not yet updated to 3.0 though. It was relatively easy to package using py2dsc from stdeb.

Pretty poor choice of license if they wanted me to care about their builds, tbh.

Their concern is poor user experience when the documentation on the web doesn't match what versions are being redistributed by other folks.

@hannibal Would love to discuss how we could extend spacy as a powerful engine to also support processing from layouted documents. Just imagine how powerful it would be if you could throw a PDF document into it and it would preprocess it to text + layout, e.g. Paragraph and perform the next steps like extracting the right paragraph or date, adress, etc. I would love to provide/support that transformation. I have been doing similar things using rasa_nlu+spacy.

Excited for this release and I will start integrating this in my own information extraction pipelines immediately. Thanks, Explosion team, got your stickers on my notebook!

The new configuration approach looks familiar to AllenNLP and that's great. Loose-coupling of model submodules with flexible config should be standard in NLP. I am happy that more libraries are integrating these concepts.

I wonder if it is sheer coincidence that SpaCy is pronounced the way the russian word "спасай" is. It means "rescue" (v.)

This sort of this could be used to create a "plain language" CLI right? So you could have the usual flags etc and then a separate version for less tech literate people that allows something like "list hidden files" (I know `ls -a` isn't particularly hard to remember I just need a contrived example).

That's really cool to see how accuracy of pre-trained models is improving by simply switching to v3.0

I've been using the v3 nightly version for 2 months and it works like a charm. I'm now training models with v3 and using them in production without any issue.

Great job!

I am not sure if SpaCy does it but is there some free and open-source framework capable of speech synthesis comparable to the level of AWS Polly or alike?

So with SpaCy 3.0, HuggingFace, do we still have a reason to use NLTK? Or they complement each other? Right now, I lost track of the progress in NLP.

NLTK is showing its age. In my information extraction pipelines, the heavy lifting for modelling is done by SpaCy, AllenNLP, and Huggingface (and Pytorch or TF ofc).

I only use NLTK since it has some base tools for low-resource languages for which noone has pretrained a transformer model or for specific NLP-related tasks. I still use their agreement metrics module, for instance. But that's about it. Dep parsing, NER, lemmatising and stemming is all better with the above mentioned packages.

Is there any framework similar to SpaCy or HugginFaces but for images?

I haven't had a situation to use it, but I think Kornia looks cool: https://github.com/kornia/kornia

Super excited to see improvement in NER accuracy in SpaCy 3.0.

Has anyone tried it on Raspberry Pi, will it work well?

> spaCy is a library for advanced Natural Language Processing in Python and Cython.

I actually submitted this with (Python Natural Language Processing) after it, but it got edited away. I always find it hard to predict the preferred title style here...

I see that by default the trf model is roberta_base https://spacy.io/models/en#en_core_web_trf

Is there an easy way to use xlnet (from transformers) for pos tagging, dep parsing, etc? Btw it would have been a smarter default as it scores more sota results on paperswithcode.com

Yes you can easily train with xlnet instead of roberta-base --- just write the different model name in the config file (or pass a different string value, if doing it from Python). You can find an example config file here: https://github.com/explosion/projects/blob/v3/benchmarks/ner...

I actually didn't see a performance improvement when using XLNet over roberta-base. I always wondered about this; ages ago I looked into it and I wasn't sure that the preprocessing details in the transformers version were entirely correct.

Given very similar accuracies from XLNet and RoBERTa, I preferred RoBERTa for the following reasons:

* I've never been able to understand the XLNet paper :(. I spent some time trying when it was released, but I just didn't really get it, not anything close to the level where I'd be able to implement it, anyway.

* Standardising on BERT architecture has some advantages. If we mostly use BERT, we have a better chance of using faster implementations. Mostly nobody is training new XLNet models, whereas many new BERT models are being trained.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact