Hacker News new | past | comments | ask | show | jobs | submit login
Stanza: A Python natural language processing toolkit for many human languages (arxiv.org)
186 points by BerislavLopac 11 months ago | hide | past | favorite | 22 comments

I was rather curious to see how this handled Garden Path Sentences[0]. For "The old man the boat.", Stanza interprets "man" as a noun rather than a verb. Similarly, for "The complex houses married and single soldiers and their families." "houses" is also interpreted as a noun rather than a verb. These sentences are mostly corner-cases, but was an interesting little experiment nonetheless.

[0] https://en.wikipedia.org/wiki/Garden-path_sentence

Most humans struggle when reading garden-path sentences, so I would be quite impressed if an NLP toolkit handled them easily out-of-the-box.

EDIT: On a related note, when I was an undergrad there was a group on campus that was doing research on how humans repair garden-path sentences when their first reading is incorrect. They were measuring ERPs to see if something akin to a backtracking algorithm was used + eye-tracking to see which word/words triggered the repair. I graduated before the work was complete, but I might go digging for it to see if it was ever published.

EDIT 2: Apparently this is better researched than I realized: https://en.wikipedia.org/wiki/P600_(neuroscience)

Very interesting! Thanks for the link, and if you find out more about the research the group at you uni was doing, please post!

I think my eyes always track back when I realize something is wrong in how I have interpreted the sentence.

I think that's too high a bar. I didn't interpret either one of those sentences the first time I read it either. It would be obtuse to expect even a "human-level" AI to get these right. Though you could fix it to get it right by backtracking to see if there are alternate solutions that generate complete parses.

But still, this kind of analysis (part-of-speech, dependency parsing, etc.) was deemed useless with NN transformer models.

The solutions to these low-level problems seem to be unimportant for high level tasks. Not to mention that error propagates. Error on part-of-speech tagging will propagate to dependency parsing that uses that info, and eventually this error will affect NER or entity/relationship extraction and similar.

This does use spaCy for english tokenization, but has a bit wider support for languages in general.

eg spaCy has just finished their Chinese models in beta Discussion: https://github.com/howl-anderson/Chinese_models_for_SpaCy/is...

You can use spacy for english tokenization or you can use their neural model. The neural model will generally do better, especially sentence segmentation, but will be slower.

Nice to see an alternative or rather a complementary library to the well worn nltk.

Hopefully Stanza provides for supplemental training for their models.

NLTK, for my uses, has been dethroned by Spacy for years now. I'm very curious to see how Stanza compares. It looks like it's built on PyTorch, so very interested to check it out.

For a lot of things, yes, but NLTK still has a much bigger and better variety of tokenizers.

Interestingly I had trouble using Spacy as it requires an internet connection to AWS to load their models and therefore work. This was blocked by deep packet inspection in my use case.

I looked a bit for a workaround but finally decided to just do the work in NLTK.

That's one disadvantage of Spacy.

you can absolutely download the models locally:

# download best-matching version of specific model for your spaCy installation

  python -m spacy download en_core_web_sm

If I recall correctly, you can pre-download the models you need, which might be suitable for your use case.

gensim and spacy have been in that space for a while now as well.

This is basically hugging face with more languages?

The BERT multilingual models from HuggingFace [1] have even more languages than Stanza (100+ vs. 66) .

Edit: And they also have a pipeline abstraction [2], so you could say that HuggingFace is also a full toolkit.

[1]: https://huggingface.co/transformers/multilingual.html

[2]: https://huggingface.co/transformers/main_classes/pipelines.h...

How does this differ from stanfordnlp (the previous version)? Is this simply a rebranding of stanfordnlp?

First glance at the examples in the repo (not in depth by any means), I'm getting a lot of spacy vibes.

Do these libraries use deep learning models? It's not entirely clear from the doco.

You can also try out Stanza in spaCy --- Ines updated the spacy-stanfordnlp wrapper to use the new version pretty much immediately: https://github.com/explosion/spacy-stanza

I was asked by a friend about Stanza in a private DM, I'll paste the answer here as I think others might find it helpful:

    Q: are stanza models more accurate and consistent than
    spacy as this tweet claims?

    A: Yeah definitely, our models are quite a bit behind 
    state-of-the-art atm because we're still optimized for 
    CPU. We're hoping to have a spacy-nightly up soon that
    builds on the new version of Thinc.

    The main thing we want to do differently is having 
    shared encoding layers across the pipeline, with several
    components backproping to at least some shared layers of
    that. So that took a fair bit of redesign, especially to
    make sure that people could customize it well.

    We never released models that were built on wide and 
    deep BiLSTM architectures because we see that as an 
    unappealing speed/accuracy trade-off. It also makes the
    architecture hard to train on few examples, it's very 
    hyper-parameter intensive which is bad for Prodigy.

    Their experiments do undercount us a bit, especially 
    since they didn't use pretrained vectors, while they 
    did use pretrained vectors for their own and Flair's 
    models. We also perform really poorly on the CoNLL-03 
    task. I've never understood why --- I hate that dataset.
    I looked at it and it's like, these soccer match 
    reports, and the dev and test sets don't correlate well.
    So I've never wanted to figure out why we do poorly on 
    that data specifically.
As an example of what I mean by "under counting", we can get to 78% on the GermEval data, while their table has as on 68%, while FLAIR and Stanza are on 85%. So we're still behind, but by less. The thing is, the difference between 85 and 78 is actually quite a lot -- probably more than most people would intuit.

I hope we can get back to them with some updates for specific figures, or perhaps some datasets can be shown as missing values for spaCy. Running experiments with a bunch of different software and making sure it's all 100% compatible is pretty tedious, and it won't add much information. The bottom-line anyone should care about is, "Am I likely to see a difference in accuracy between Stanza and spaCy on my problem". At the moment I think the answer is "yes". (Although spaCy's default models are still cheaper to run on large datasets).

We're a bit behind the current research atm, and the improvements from that research are definitely real. We're looking forward to releasing new models, but in the meantime you can also use the Stanza models with very little change to your spaCy code, to see if they help on your problem.

I'd like to know:

How the language model-based tokenizers fare on domain-specific documents, since language models don't have context for unknown tokens.

Are language model-based tokenizers any better at identifying abbreviations than rule-based ones?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact