Hacker News new | comments | show | ask | jobs | submit login
Show HN: Languagecrunch – NLP server Docker image (docker.com)
152 points by artpar 7 months ago | hide | past | web | favorite | 24 comments


Relevant links for anyone interested:

* spaCy on Github: https://github.com/explosion/spacy

* NER demo: https://demos.explosion.ai/displacy-ent/

* Neural coref by HuggingFace: https://huggingface.co/coref/

* Accuracy of built-in spaCy models: https://spacy.io/usage/facts-figures

Last time I calculated, the lowest cost way to run spaCy in the cloud was on Google Compute Engine ns1-standard pre-emptable instances. It should be over 100x cheaper per document than using Google, Amazon or Microsoft's cloud NLP APIs. Accuracy will depend on your problem, but if you have your own training data, performance should be similar.

I've run spaCy on small $10/m linux VPS instances. What do you mean by cheapest? I'm sure you're right, but what volume are you referring too?

I'm referring to best price per word when the service is continually active. Like, if you want to parse a web dump, what type of instance do you provision a bunch of?

This coref thing [1] is great, the visualization is handy. I'd love to see it combined with NER [2].

[1] https://goo.gl/JnS95x

[2] https://goo.gl/k5YRjN

GCE's pre-emptable instances are so much easier to use and manage compared to AWS's spot instances. I made a rule to make stateless services only just to leverage these.

What do you need on the docker host machine to run this? Any specific docker version? GPU?

Also, it would be useful to see the Dockerfile or script that generated this img.

No special requirement. I run it on a ubuntu 16 server for production and locally on osx.

Added GPU would probably help both spacy and neuralcoref in performance.

> Also, it would be useful to see the Dockerfile or script that generated this img.

Will put it on github shortly.

Seems interesting! How can it be used with other languages than English?

spaCy only works with English, German, Spanish, Portuguese, French, Italian and Dutch.

FastText for example has pretrained embeddings for 294 languages: https://github.com/facebookresearch/fastText/blob/master/pre...

Google's Parsey McParseface handles POS tagging for 53 languages: https://github.com/tensorflow/models/blob/f87a58cd96d45de73c...

So spacy has support for these languages [1] and wordnet has support for these [2], but neuralcoref (pronoun resolution endpoint) is available only for english.

This current docker image is not exposing those other languages but I can expose them in an update if it helps a lot of people.

[1] https://spacy.io/usage/models [2] http://compling.hss.ntu.edu.sg/omw/

Thanks for the insights. Could you please share the Dockerfile so that one can make the other languages work?

SpaCy models for different languages and how to use them: https://spacy.io/usage/models

The demo examples are wrong or don't make much sense.

"Donald Trump's administration" is not a person.

In the following example, "The currency" is not a subject and "India" is not an object.

I don't know how much useful information is extracted by this system.

That example is a tweet, which the syntax and NER models haven't been trained on. You can make calls to `nlp.update()` to improve it on your own data. We also have an annotation tool, https://prodi.gy , to more quickly create training data.

(I'm the author of spaCy, not this Docker container.)

SpaCy is wonderful, I've used it a lot over the years and I have high confidence in its output.

I just wish the author of this docker container chose demo sentences that advertised it better.

> "The currency" is not a subject and "India" is not an object.

But "subject" and "object" is for indicating the Subject-Verb-Predicate (object) of the sentence and not as in literal object ?

"India" is neither the predicate nor the object of the sentence.

You are correct. That is clearly a wrong example. Will change that.

Also that issue is in my code (poor naming choice). Will put up the code on github soon. Hope that will help.

@artpar can you post some docs on the endpoints and how they should be used. I want to tie this into a speech to text system but i need more api info

Added docs on the bottom of the readme.


Cool! What corpus was this trained on?

Using the "en_core_web_lg" for spacy [1] and neuralcoref along with the pre-trained models on github [2].

[1] https://spacy.io/models/en [2] https://github.com/huggingface/neuralcoref/tree/bee05b1b55e3...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact