EDIT: On a related note, when I was an undergrad there was a group on campus that was doing research on how humans repair garden-path sentences when their first reading is incorrect. They were measuring ERPs to see if something akin to a backtracking algorithm was used + eye-tracking to see which word/words triggered the repair. I graduated before the work was complete, but I might go digging for it to see if it was ever published.
EDIT 2: Apparently this is better researched than I realized: https://en.wikipedia.org/wiki/P600_(neuroscience)
The solutions to these low-level problems seem to be unimportant for high level tasks. Not to mention that error propagates. Error on part-of-speech tagging will propagate to dependency parsing that uses that info, and eventually this error will affect NER or entity/relationship extraction and similar.
eg spaCy has just finished their Chinese models in beta
Hopefully Stanza provides for supplemental training for their models.
I looked a bit for a workaround but finally decided to just do the work in NLTK.
That's one disadvantage of Spacy.
# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm
Edit: And they also have a pipeline abstraction , so you could say that HuggingFace is also a full toolkit.
I was asked by a friend about Stanza in a private DM, I'll paste the answer here as I think others might find it helpful:
Q: are stanza models more accurate and consistent than
spacy as this tweet claims?
A: Yeah definitely, our models are quite a bit behind
state-of-the-art atm because we're still optimized for
CPU. We're hoping to have a spacy-nightly up soon that
builds on the new version of Thinc.
The main thing we want to do differently is having
shared encoding layers across the pipeline, with several
components backproping to at least some shared layers of
that. So that took a fair bit of redesign, especially to
make sure that people could customize it well.
We never released models that were built on wide and
deep BiLSTM architectures because we see that as an
unappealing speed/accuracy trade-off. It also makes the
architecture hard to train on few examples, it's very
hyper-parameter intensive which is bad for Prodigy.
Their experiments do undercount us a bit, especially
since they didn't use pretrained vectors, while they
did use pretrained vectors for their own and Flair's
models. We also perform really poorly on the CoNLL-03
task. I've never understood why --- I hate that dataset.
I looked at it and it's like, these soccer match
reports, and the dev and test sets don't correlate well.
So I've never wanted to figure out why we do poorly on
that data specifically.
I hope we can get back to them with some updates for specific figures, or perhaps some datasets can be shown as missing values for spaCy. Running experiments with a bunch of different software and making sure it's all 100% compatible is pretty tedious, and it won't add much information. The bottom-line anyone should care about is, "Am I likely to see a difference in accuracy between Stanza and spaCy on my problem". At the moment I think the answer is "yes". (Although spaCy's default models are still cheaper to run on large datasets).
We're a bit behind the current research atm, and the improvements from that research are definitely real. We're looking forward to releasing new models, but in the meantime you can also use the Stanza models with very little change to your spaCy code, to see if they help on your problem.
How the language model-based tokenizers fare on domain-specific documents, since language models don't have context for unknown tokens.
Are language model-based tokenizers any better at identifying abbreviations than rule-based ones?