
Prodigy: A new tool for radically efficient machine teaching - Young_God
https://explosion.ai/blog/prodigy-annotation-tool-active-learning
======
plusepsilon
I don't think (some) people understand; a slick data annotation tool like this
is vastly more useful than the 20th variant of GAN that DeepMind produces :)

~~~
transcranial
Totally, I think people have this weird sense of entitlement when it comes to
high-quality datasets without the commensurate respect for how they're created
or the level of effort that goes into them.

Fei-Fei Li gives a good sense for this in her history of ImageNet [1][2].

[1] [https://qz.com/1034972/the-data-that-changed-the-
direction-o...](https://qz.com/1034972/the-data-that-changed-the-direction-of-
ai-research-and-possibly-the-world/)

[2] [http://image-
net.org/challenges/talks_2017/imagenet_ilsvrc20...](http://image-
net.org/challenges/talks_2017/imagenet_ilsvrc2017_v1.0.pdf)

------
AndrewKemendo
Looks promising and definitely a needed tool. I signed up for the beta and I
used the demo version and have a couple of thoughts.

1\. This seems closer to a reinforcement learning system than a pure
annotation system. That seems to be by design, however based on the demo, I am
not able to change or add to the annotations as I go, which is a big
limitation. It's just yes, no (no feedback), ignore and undo. This is in
contrast to something like the VGG annotations system:
[http://www.robots.ox.ac.uk/~vgg/software/via/via.html](http://www.robots.ox.ac.uk/~vgg/software/via/via.html)

2\. I don't see an actual annotations capability for images in the demo. Not
sure if that is just a pretotype page, but IMO image
classification/segmentation is the place where this tool would really benefit
the community.

3\. It's unclear to me how or if I retrieve my trained model or even just the
annotated structure (.csv?, .json?) from this system. Do I get a .pb somehow
that I can import into TF or am I locked into an API with my new model served
from Prodigy? My guess would be the latter.

I think what this wants to be is a human validation system for training, which
also improves the Prodigy nets through crowd sourcing. Definitely a win-win in
the short term, but it has the limitations of the initial model and the
ability for the user/client to tweak the system and output the results.

Matroid is doing something similar here, but I have been unimpressed with
their offering so far.

~~~
syllogism
Thanks for the engaging questions! Reading between the lines, I think there's
an important point that hasn't come across. Prodigy isn't SaaS --- it's a
library you download and run. You can extend and customise every aspect of it,
and there's definitely no lock-in. The model (and annotations) never have to
leave your servers.

For the specific questions:

1\. The built-in web views all have binary annotation interfaces. This is more
of a design choice than a fundamental limitation, and the front-end is
extensible --- you can add your own web views if you need to.

The binary interface is sort of a position statement. We think this is The
Way, so we want you to try it. We'll have more input components in future, but
at the start we want to guide people towards the intended workflow.

2\. The beta focuses on NLP support, but there's a front-end for image
classification, and a workflow page: [https://prodi.gy/docs/workflow-image-
classification](https://prodi.gy/docs/workflow-image-classification) .

3\. You can usually get some accuracy improvement by retraining once all the
annotations are available. I've not found a streaming SGD algorithm that works
as well as the simple iterate-and-shuffle batch process. Batch training also
lets you tune the hyper-paramers. You can read more about this here:
[https://prodi.gy/docs/workflow-named-entity-
recognition#trai...](https://prodi.gy/docs/workflow-named-entity-
recognition#training)

I would suggest writing a Prodigy recipe to do the batch training. That way
you can pass in the dataset ID, instead of exporting the annotations. There's
no problem with exporting the annotations and running a script, though. Again
--- it's all on your computer. You can run it however you like.

~~~
AndrewKemendo
_I think there 's an important point that hasn't come across. Prodigy isn't
SaaS --- it's a library you download and run._

You're right, I totally missed that. Re-reading that comes through but it is
definitely different than what I would have expected.

Thanks for the response, I'll dig in further.

------
gh1
I love how the SpaCy related websites are always so well designed. Their
dependency graph visualizer is just amazing. I know that Ines is behind that
one, but don't know about the other stuff.

Now coming back to the topic, I have so far just used Jupyter Notebooks and
spreadsheets to do annotations and by golly, it is an extremely boring and
tedious process. This looks like a fun tool to try out for my next NLP related
project. Might spice things up!

But I hope that like all SpaCy related ideas, it doesn't assume too much about
the problem at hand. I usually use NLTK instead of SpaCy because it allows me
to be very flexible, except for the sentence tokenizer, where SpaCy's accuracy
is hard to beat.

~~~
syllogism
Explosion is just me and Ines -- so yes, the pages are all made by Ines (with
the great illustrations by Frederique Matti). Ines also wrote the bulk of the
code for Prodigy itself.

~~~
gh1
Hats off to you Ines and Frederique. You guys do really great work.

------
infinitone
So this isn't OSS? Seems atypical in the ML community.

For those looking for alternative OSS solutions: BRAT, labellmg are decent.

------
theincredulousk
Guess the radical efficiency didn't carry over to their web server

~~~
blueyes
> spaCY, the leading open-source NLP tool?

Sounds like marketing BS. what about OpenNLP and Stanford's for NLP?

~~~
kafkaesq
_spaCY, the leading open-source NLP tool?_

Agreed, the description is definitely cringe-worthy.

As if whoever wrote that wasn't aware that these are _language geeks_ they're
marketing to.

~~~
retainingwall
Self-respecting language geeks keep up with the times. What's your case for
"leading open-source"? Here's a look at Spacy blowing Stanford Core NLP out of
the water (via github stars, you can take a look at commits and more from the
same tool):
[https://www.datascience.com/trends?trends=4812,7214,7165&tre...](https://www.datascience.com/trends?trends=4812,7214,7165&trend_names=spacy-
io/spacy+explosion/spacy+stanfordnlp/corenlp&avg=7&scaling=absolute&metric=Stars)

~~~
kafkaesq
Actually I don't follow any of these tools closely to know whether they're
currently "leading" or not.

It's just that, wording-wise "the leading open source X" exudes marketing-
speak, which I find language geeks tend to have robust antibodies against.

This kind of lingo work (sort of) for the market, say, MongoDB is in. But for
the users of these tools, I suspect not so much.

------
rayuela
That's a nice UX but the flurry of initial upvotes on this looks kinda fishy,
especially given that it's just annotation software.

~~~
Gimpei
I'm a data scientist and getting annotations for our data is one of our most
onerous issues. I upvoted this. If it works well, I could see myself using it
all the time. Making a model that gets you most of the way there is the easy
part; getting clean, annotated data. Uggh.

~~~
IanCal
Yep agreed, we've had to build similar things internally.

Getting labelled data is a pain.

------
imh
Since syllogism is participating in this thread, what kind of active learning
are you using? I'm always hesitant to use anything except for IWAL since most
of the more common ones aren't actually consistent. Even then, then payoff
tends to be kinda disappointing.

(But I'm definitely not an expert)

~~~
syllogism
Yes, it uses importance weighted active learning. You can set the priorities
yourself, but the default built-in sorter is just uses distance from 0.5.
There's a random component to help make sure the model doesn't get stuck
asking the wrong questions.

~~~
imh
Thanks for the answer :)

Great work here, btw. It's refreshing to see emphasis on "label some damned
data" and work towards making that easy.

------
michaelbarton
This looks interesting because it add the ability to put the user in the loop
of fixing/annotating the problematic observations relatively easily. I like
the example of Tinder for data.

Are the examples picked those that have the highest objective function error
rate, or something similar?

Does this apply only to text classification problems? Are there examples where
this could be applied to tabular data?

------
dustinkirkland
This could be a headline from 1987 :-) (cue the dialup modem sound)

------
38kkdiu
Looks very nice, although it always takes me a bit to figure out what they're
talking about with these sorts of things because I have to remind myself that
most ML/DL stuff is supervised. What I research is unsupervised.

They kind of have this weird dissing of unsupervised scenarios, though. It's
not like supervised or unsupervised is better or worse, they're just
surrounding different problems. They can talk up their product without needing
to criticize a problem domain.

It's like if you were making motors for boats, and then started talking about
"these crazy people who think it's better to fly." ???

~~~
syllogism
I see how that came across as obnoxious, so thanks for the perspective.

I do think there's a pretty common failure mode for teams who don't have much
experience with ML, though. Teams who don't have much experience with ML often
take "We don't have much data" as a parameter of their problem, and don't see
that this is something they can decide to change. This can lead to a lot of
time spent experimenting with different unsupervised approaches that are a
poor fit for what they're trying to do.

------
visarga
How many languages are supported? I see many more languages in Google's
Syntaxnet library. What's keeping you from having the same list of 40
languages for POS tagging?

[https://github.com/tensorflow/models/blob/master/syntaxnet/g...](https://github.com/tensorflow/models/blob/master/syntaxnet/g3doc/universal.md)

~~~
syllogism
The UD treebanks have made it very easy to offer lots POS and dependency
parsing models under a CC-by-NC license. We'll be putting up more of these for
download as spaCy 2 stabilises.

We're mostly worried about saying we "support" a language when we've just
trained a tagger on a UD treebank, though. We like at least having the stop
words and tokenizer exceptions filled in by a native speaker, so the usual
flow has been that someone needs the functionality, and they make a pull
request.

If you just need the UD model for say, Bulgarian, you can do:

    
    
        python -m spacy train xx /path/to/output_model /path/to/bulgarian-train.conllu /path/to/bulgarian-dev.conllu --no-entities
    

We don't have a spacy.bg.Bulgarian language class yet, so you can either add
one, or use the multi-language class, which usually works OK.

------
Gallactide
I've been pulling my hair out and losing sleep over a specific problem I need
to solve for a client. This tool, along with the linked spaCy lib have not
only reduced the complexity of the task to be manageable, have also
drastically reduced the projected completion time. In other words, Holy shit
thank you OP.

------
technologia
I like the simpler annotation UI, you can get more of your team active with
annotation in a Mechanical Turk fashion.

------
syllogism
[http://mirror.explosion.ai/blog/prodigy-annotation-tool-
acti...](http://mirror.explosion.ai/blog/prodigy-annotation-tool-active-
learning)

Sorry about the poor performance on the site! We got complacent because all of
our sites are 100% static.

------
Xeoncross
To me, Matthew and Ines are to NLP as Bernstein & co are to cryptography.

------
sgt101
this one is good too
[https://hazyresearch.github.io/snorkel/](https://hazyresearch.github.io/snorkel/)

------
En_gr_Student
So this is just fluff?

~~~
madisonmay
Nope, not just fluff. New ML model architectures get too much hype, while it's
relatively simple tools like this that actually make the difference in whether
or not ML can applied to industry problems. The low hanging fruit in the ML
industry are in workflow tools rather than novel model architectures. I have a
huge amount of respect for the folks at explosion.ai, largely because their
solutions are consistently good in practice rather than good in theory.

~~~
aub3bhat
You might be interested in Deep Video Analytics, its a Visual Data Analytics
platform that I am building. [1]

[1]
[https://github.com/AKSHAYUBHAT/DeepVideoAnalytics](https://github.com/AKSHAYUBHAT/DeepVideoAnalytics)

------
SubiculumCode
I work with data as a neuroscientist, but I haven't used ML. What is an an
annotation in this context?

~~~
stephengillie
I have an NLP bot as a hobby, but use old-fashioned statistics instead of ML.
It looks like annotations here mean manually training the bot by seeding the
learning data with hints.

~~~
SubiculumCode
A utility for labeling training data for supervised learning then?

~~~
IanCal
Yes, with a bit of a twist. As I understand it, it'll keep retraining the
model and asking you to label the examples it's least sure about. This is a
lot faster and better than randomly labelling your data or trying to do it
all.

------
erik14th
Wow, been a while since I've touched UX this good, loved the themes, is this
open source?

------
ipunchghosts
link dead already

~~~
syllogism
Working for me?

~~~
ipunchghosts
working now

------
egor598
Site down?

~~~
syllogism
It's a 100% static site, but Apache is still struggling :(. Should have used a
bigger droplet...Sorry!

~~~
carbocation
Turn off KeepAlive, if it's on.

~~~
user5994461
Bad advice. That's likely to make it a lot worse.

~~~
carbocation
I disagree with you based on experience, but you don't have to take my word
for it. 'patio11 also has had some experience here:
[http://www.kalzumeus.com/2010/06/19/running-apache-on-a-
memo...](http://www.kalzumeus.com/2010/06/19/running-apache-on-a-memory-
constrained-vps/)

~~~
user5994461
Also disagree with you based on experience ;)

patio11 blog is HTTP only but this blog is HTTPS.

HTTPS without keepalive is likely to kill any cheap VPS, establishing HTTPS
connection is intensive.

That being said, the core of the issue is that they should use nginx (or
apache in mpm-events). And they should have cloudflare.

~~~
patio11
_establishing HTTPS connection is intensive_

More intensive than adding numbers together, sure, but computers are pretty
fast. If you're doing 100k connections a second you might have to give some
thought to that. Meanwhile, if you have KeepAlive on, 2~5 _clients_ per second
will kill you.

------
caycep
this brings back dialup/bbs memories...

------
zhte415
Very wordy.. not very efficient..

