
Stanza: A Python natural language processing toolkit for many human languages - BerislavLopac
https://arxiv.org/abs/2003.07082
======
mylampisawesome
I was rather curious to see how this handled Garden Path Sentences[0]. For
"The old man the boat.", Stanza interprets "man" as a noun rather than a verb.
Similarly, for "The complex houses married and single soldiers and their
families." "houses" is also interpreted as a noun rather than a verb. These
sentences are mostly corner-cases, but was an interesting little experiment
nonetheless.

[0] [https://en.wikipedia.org/wiki/Garden-
path_sentence](https://en.wikipedia.org/wiki/Garden-path_sentence)

~~~
eindiran
Most humans struggle when reading garden-path sentences, so I would be quite
impressed if an NLP toolkit handled them easily out-of-the-box.

EDIT: On a related note, when I was an undergrad there was a group on campus
that was doing research on how humans repair garden-path sentences when their
first reading is incorrect. They were measuring ERPs to see if something akin
to a backtracking algorithm was used + eye-tracking to see which word/words
triggered the repair. I graduated before the work was complete, but I might go
digging for it to see if it was ever published.

EDIT 2: Apparently this is better researched than I realized:
[https://en.wikipedia.org/wiki/P600_(neuroscience)](https://en.wikipedia.org/wiki/P600_\(neuroscience\))

~~~
felixyz
Very interesting! Thanks for the link, and if you find out more about the
research the group at you uni was doing, please post!

------
dcsan
This does use spaCy for english tokenization, but has a bit wider support for
languages in general.

eg spaCy has just finished their Chinese models in beta Discussion:
[https://github.com/howl-
anderson/Chinese_models_for_SpaCy/is...](https://github.com/howl-
anderson/Chinese_models_for_SpaCy/issues/26)

~~~
rpedela
You can use spacy for english tokenization or you can use their neural model.
The neural model will generally do better, especially sentence segmentation,
but will be slower.

------
sireat
Nice to see an alternative or rather a complementary library to the well worn
nltk.

Hopefully Stanza provides for supplemental training for their models.

~~~
JPKab
NLTK, for my uses, has been dethroned by Spacy for years now. I'm very curious
to see how Stanza compares. It looks like it's built on PyTorch, so very
interested to check it out.

~~~
zwaps
Interestingly I had trouble using Spacy as it requires an internet connection
to AWS to load their models and therefore work. This was blocked by deep
packet inspection in my use case.

I looked a bit for a workaround but finally decided to just do the work in
NLTK.

That's one disadvantage of Spacy.

~~~
dcsan
you can absolutely download the models locally:

# download best-matching version of specific model for your spaCy installation

    
    
      python -m spacy download en_core_web_sm

------
sdan
This is basically hugging face with more languages?

~~~
scribu
The BERT multilingual models from HuggingFace [1] have even more languages
than Stanza (100+ vs. 66) .

Edit: And they also have a pipeline abstraction [2], so you could say that
HuggingFace is also a full toolkit.

[1]:
[https://huggingface.co/transformers/multilingual.html](https://huggingface.co/transformers/multilingual.html)

[2]:
[https://huggingface.co/transformers/main_classes/pipelines.h...](https://huggingface.co/transformers/main_classes/pipelines.html)

------
techwizrd
How does this differ from stanfordnlp (the previous version)? Is this simply a
rebranding of stanfordnlp?

------
axegon_
First glance at the examples in the repo (not in depth by any means), I'm
getting a lot of spacy vibes.

------
kernelsanderz
Do these libraries use deep learning models? It's not entirely clear from the
doco.

------
syllogism
You can also try out Stanza in spaCy --- Ines updated the spacy-stanfordnlp
wrapper to use the new version pretty much immediately:
[https://github.com/explosion/spacy-
stanza](https://github.com/explosion/spacy-stanza)

I was asked by a friend about Stanza in a private DM, I'll paste the answer
here as I think others might find it helpful:

    
    
        Q: are stanza models more accurate and consistent than
        spacy as this tweet claims?
    
        A: Yeah definitely, our models are quite a bit behind 
        state-of-the-art atm because we're still optimized for 
        CPU. We're hoping to have a spacy-nightly up soon that
        builds on the new version of Thinc.
    
        The main thing we want to do differently is having 
        shared encoding layers across the pipeline, with several
        components backproping to at least some shared layers of
        that. So that took a fair bit of redesign, especially to
        make sure that people could customize it well.
    
        We never released models that were built on wide and 
        deep BiLSTM architectures because we see that as an 
        unappealing speed/accuracy trade-off. It also makes the
        architecture hard to train on few examples, it's very 
        hyper-parameter intensive which is bad for Prodigy.
    
        Their experiments do undercount us a bit, especially 
        since they didn't use pretrained vectors, while they 
        did use pretrained vectors for their own and Flair's 
        models. We also perform really poorly on the CoNLL-03 
        task. I've never understood why --- I hate that dataset.
        I looked at it and it's like, these soccer match 
        reports, and the dev and test sets don't correlate well.
        So I've never wanted to figure out why we do poorly on 
        that data specifically.
    

As an example of what I mean by "under counting", we can get to 78% on the
GermEval data, while their table has as on 68%, while FLAIR and Stanza are on
85%. So we're still behind, but by less. The thing is, the difference between
85 and 78 is actually quite a lot -- probably more than most people would
intuit.

I hope we can get back to them with some updates for specific figures, or
perhaps some datasets can be shown as missing values for spaCy. Running
experiments with a bunch of different software and making sure it's all 100%
compatible is pretty tedious, and it won't add much information. The bottom-
line anyone should care about is, "Am I likely to see a difference in accuracy
between Stanza and spaCy on my problem". At the moment I think the answer is
"yes". (Although spaCy's default models are still cheaper to run on large
datasets).

We're a bit behind the current research atm, and the improvements from that
research are definitely real. We're looking forward to releasing new models,
but in the meantime you can also use the Stanza models with very little change
to your spaCy code, to see if they help on your problem.

------
justlexi93
I'd like to know:

How the language model-based tokenizers fare on domain-specific documents,
since language models don't have context for unknown tokens.

Are language model-based tokenizers any better at identifying abbreviations
than rule-based ones?

