Spacy3 practically covers 90% of NLP use-cases with near SOTA performance. The only reason to not use it would be if you are literally pushing the boundaries of NLP or building something super specialized.
Hugging Face and Spacy (also Pytorch, but duh) are saving millions of dollars in man hours for companies around the world. They've been a revelation.
As someone that's worked on some rather intensive NLP implementations, Spacy 3.0 and HuggingFace both represent the culmination of a technological leap in NLP that started a few years ago with the advent of transfer learning in NLP. The level of accessibility to the masses these libraries offer is game-changing and democratizing.
I want to use AI to translate (localize) messages for free software, in my case, Ukrainian language. My plan to improve quality of automated translation is to translate from similar languages in parallel, i.e. give a same message in English, Russian, Polish, and expect message in Ukrainian as output.
Where I should start? Which libraries to use? How to connect them? How to train them?
and English to Ukrainian
You can test the models with the form input on the right-hand side
And here's a site with docs and code examples for basic usage in your app
You can also use Marian's training code on PyTorch with custom source and target texts.
Currently, volunteers manually translate from English to all other languages. It's possible to use AI to help, but quality of translation is too low. Dumb Translation Memory database is much more helpful than AI translator.
In the new system, one volunteer will translate messages from English to his native language, while all others will be able to use this translation as the additional constraint, to improve quality of translation to their native language. More input languages -> more constraints -> better quality of translation.
My best idea, so far, is to use a multilingual tokenizer (SpaCy looks good) and a linear transformer, because linear transformers are able to accept large inputs, with thousands of tokens. IMHO, I can input multiple translations of messages (they are short), and define loss as expected Ukrainian translation.
However, I'm completely new to this field. I completed just one AI project so far: recognition of animals at video. I have no idea how to start.
For example, I have no idea how to hint transformer that input messages are the same, just in 4 different languages. Should I interleave messages, like "Cannot Не можу open відкрити file файл : : ", or put them aside, via separator: "\0Cannot open file: \0Не можу відкрити файл: \0", or create few independent inputs? Or use a memory and improve quality of output message incrementally?
Any hints where to start?
My approach to AI is somewhat conservative because of working in a law-adjacent field where explainability is paramount. When it comes to getting "smart" I prefer forward-chaining logic over facts, and facts include predictions from models too. But at least there is a "judge"/engine to coordinate how the predictions from the ensemble of models maps to actions. I love me some pertained neural nets, but use them more as black box appliances.
There's also https://github.com/jruizgit/rules but haven't tried it yet.
Let’s say you wanted to build a system where you used an efficient bag-of-words text classifier to select paragraphs that might have information of interest, and then you wanted to run an entity recognizer and recognise relation triples between predicates and pairs of entities. When extracting the triples, you want to use the lemma of the relation word, so you want to map “dove” to “dive” when it’s a verb etc.
It’s possible to build this system directly using PyTorch modules for the various model parts, but you’ll need to write the various bits of logic to string together the model predictions yourself, and for tasks like lemmatization that are pretty easy, you’ll struggle to find existing systems that you’ll actually want to use. A lemmatization system that’s published for PyTorch will probably be designed for languages where lemmatization is really hard, but for English it’s really easy.
spaCy has a good architecture and API for this system level stuff, where you’re putting together models into practical solutions. It also has a Doc object that makes it really easy to actually work with the system outputs, especially to relate multiple levels of annotation to each other.
Partly because orchestrating a number of models is kind of a hassle in lower-level frameworks, a lot of guides will encourage you to take entirely joint approaches to this type of system. In theory you can bypass problems like the lemmatization and NER if you take a sequence-to-sequence approach, and generate the relation triples as just arbitrary data. But this has a lot of limitations. It’s difficult to express structural constraints that you know should hold about the triples, the system will be much much slower, and might require vastly more training data. It’s also difficult to divide the task up between different people, and it’ll be difficult to analyse the system errors, iterate on individual parts, or inject rule logic to ensure certain invariants about the output. All these facts about joint approaches increase the risk of the project failing; they add large uncertainties that keep projects from getting out of the prototype stage.
Maybe there's a comp strategy i don't know about - but they've created SO much value for the world.
I will say that people are using spaCy for free because that is what we asked them to do. I chose to make the library free and open-source when I first released it because I had the idea that I would be able to make that work out for me, if I could make this thing that would be useful to people and if they could be convinced to adopt it. And in order to convince people to adopt it, we've been telling people that spaCy will stay free and that we'll continue to work on it. So everything's going to plan here. Even if things weren't working out well for us (and they are), the fault would be entirely ours. I don't think we'd have any right to suddenly say, "Oh none of you jerks are paying, how unfair".
(For the record, we make money from sales of our annotation tool, Prodigy: https://prodi.gy . If you're reading this and you like spaCy, check it out ;)
Your project was 1 of the 2 instrumental tools in my project to structure the transcripts of every word said on the floor of the New York State Senate over the past ~30 years in order to develop a topic-based “proximity” heuristic (based on CorEx, the 2nd instrumental tool) for which state senators were focused on which issues, based on the things they actually said on the record, not based on their press statements or their voting records (the latter of which doesn’t capture all the information you’d hope it would due to procedural nuances too obscure to detail here).
Thank you. Thank you, thank you, thank you.
Please invest more in SEO; I didn't find you guys two weeks ago when I researched different options for audio annotation :D.
Using Amazon ground truth which works fine (although seems quite MVP outside the core functionality e.g wrt reporting or user creation).
What tool have you had success with?
Phrased differently, how does spacy fit in with today’s world of transformers? Would it still be interesting for me?
With model distillation, you can make models that annotate hundreds of sentences per second on a single CPU with a library like Huggingface Transformers.
For instance, one of my distilled Dutch multi-task syntax models (UD POS, language-specific POS, lemmatization, morphology, dependency parsing) annotates 316 sentences per second with 4 threads on a Ryzen 3700X. This distilled model has virtually no loss in accuracy compared to the finetuned XLM-RoBERTa base model.
I don't use Huggingface Transformers, but ported some of their implementations to Rust , but that should not make a big difference since all the heavy lifting happens in C++ in libtorch anyway.
tl;dr: it is not true that tranformers are only useful for GPU prediction. You can get high CPU prediction speeds with some tricks (distillation, length-based bucketing in batches, using MKL, etc.).
If you're always working with exactly one task model, I think working directly in transformers isn't that different from using spaCy. But if you're orchestrating multiple models, spaCy's pipeline components and Doc object will probably be helpful. A feature in v3 that I think will be particularly useful is the ability to share a transformer model between multiple components, for instance you can have an entity recogniser, text classifier and tagger all using the same transformer, and all backpropagating to it.
You also might find the projects system useful if you're training a lot of models. For instance, take a look at the project repo here: https://github.com/explosion/projects/tree/v3/benchmarks/ner.... Most of the readme there is actually generated from the project.yml file, which fully specifies the preprocessing steps you need to build the project from the source assets. The project system can also push and pull intermediate or final artifacts to a remote cache, such as an S3 bucket, with the addressing of the artifacts calculated based on hashes of the inputs and the file itself.
The config file is comprehensive and extensible. The blocks refer to typed functions that you can specify yourself, so you can substitute any of your own layer (or other) functions in, to change some part of the system's behaviour. You don't _have_ to specify your models from the config files like this --- you can instead put it together in code. But the config system means there's a way of fully specifying a pipeline and all of the training settings, which means you can really standardise your training machinery.
Overall the theme of what we're doing is helping you to line up the workflows you use during development with something you can actually ship. We think one of the problems for ML engineers is that there's quite a gap between how people are iterating in their local dev environment (notebooks, scrappy directories etc) and getting the project into a state that you can get other people working on, try out in automation, and then pilot in some sort of soft production (e.g. directing a small amount of traffic to the model).
The problem with iterating in the local state is that you're running the model against benchmarks that are not real, and you hit diminishing returns quite quickly this way. It also introduces a lot of rework.
All that said, there will definitely be usage contexts where it's not worth introducing another technology. For instance, if your main goal is to develop a model, run an experiment and publish a paper, you might find spaCy doesn't do much that makes your life easier.
Also, my team chat is currently filled with people being extremely stoked about the SpaCy + FastAPI support! Really hope FastAPI replaces Flask sooner rather than later.
This sounds mindblowing, off to Google I go...
But apart from that, there's actually not really much to integrate, as it all just works.
Both libraries are actually (well ) designed to be decoupled. So that you can mix them independently with anything else.
Note: I'm the creator of FastAPI, and also a developer at Explosion (spaCy's home).
Should I give it another shot or are there libraries more suited for this than just plain regexp galore?
But with Spacy you could tokenize the sentence, then tag each work with the Part of Speech it is, and then find patterns eg. Verb Number Noun Preposition Noun
Then match the first noum against a list of measurements (tablespoon, teaspoon, tbsp...) and extract the rest of the components.
We of course encourage anyone to clone the repo or install from an sdist if they want to compile from source. In fact you can do the following:
git clone https://github.com/explosion/spaCy
The new configuration approach looks familiar to AllenNLP and that's great. Loose-coupling of model submodules with flexible config should be standard in NLP. I am happy that more libraries are integrating these concepts.
Is there an easy way to use xlnet (from transformers) for pos tagging, dep parsing, etc?
Btw it would have been a smarter default as it scores more sota results on paperswithcode.com
I actually didn't see a performance improvement when using XLNet over roberta-base. I always wondered about this; ages ago I looked into it and I wasn't sure that the preprocessing details in the transformers version were entirely correct.
Given very similar accuracies from XLNet and RoBERTa, I preferred RoBERTa for the following reasons:
* I've never been able to understand the XLNet paper :(. I spent some time trying when it was released, but I just didn't really get it, not anything close to the level where I'd be able to implement it, anyway.
* Standardising on BERT architecture has some advantages. If we mostly use BERT, we have a better chance of using faster implementations. Mostly nobody is training new XLNet models, whereas many new BERT models are being trained.
I've been using the v3 nightly version for 2 months and it works like a charm. I'm now training models with v3 and using them in production without any issue.
I only use NLTK since it has some base tools for low-resource languages for which noone has pretrained a transformer model or for specific NLP-related tasks. I still use their agreement metrics module, for instance. But that's about it. Dep parsing, NER, lemmatising and stemming is all better with the above mentioned packages.