
Snorkel AI: Putting Data First in ML Development - polm23
https://www.snorkel.ai/07-14-2020-snorkel-ai-launch.html
======
mactournier
They totally ignored the open source package in their site.
[https://github.com/snorkel-team/snorkel](https://github.com/snorkel-
team/snorkel) Moreover, based on the commit history in github, and their
README.md , my understanding is that they will stop supporting their open
source repo. Best of luck for them.

~~~
kmax12
If anyone is looking for an open source library in this space, I work on one
called Compose
([https://github.com/FeatureLabs/compose](https://github.com/FeatureLabs/compose)).

With compose, a user defines a labeling function, and then compose scans the
historical data looking for training examples to train a machine learning
model.

The library has evolved as we apply it to more and more real world use cases,
but it is based on approach in this paper from 2016:
[https://dai.lids.mit.edu/wp-
content/uploads/2016/08/07796929...](https://dai.lids.mit.edu/wp-
content/uploads/2016/08/07796929.pdf).

~~~
eggie5
can you do multi-label w/ Compose? Snorkel only supports single-label.

~~~
jeff-hernandez
Yes, you can represent the labeling function as a class and use its methods to
represent each label individually.

------
ZeroCool2u
Myself and a few colleagues attended the _absolutely packed_ presentation the
Snorkel folks did at ODSC in Boston last year and came away so convinced by
this approach that we actually constructed our interns summer project around
using Snorkel for multi-label classification on complex very domain specific
financial documents. The project was successful and both us and our intern
were very happy.

The main lessons learned from this exercise helped us identify where our
efforts would be shifted when using Snorkel. Of course there's never any free
lunch, but Snorkel has what I believe to be a very reasonable and effective
trade off. Snorkel provides both a huge decrease in overall costs, but
critically it shifts costs towards the front of the development process.
Writing a good set of labeling functions is a non-trivial piece of work. It
requires the data scientist to have deep domain experience or a few fairly
large blocks of time collaborating with and learning from a business user that
is a already a domain expert. It has the upside though of forcing the data
scientist to get a solid foundation of this domain knowledge, which I feel
often times is underestimated in many ML projects.

Anyways, congrats to the team! Looking forward to checking out your future
work.

~~~
eggie5
can you educate me on how you do multi-label w/ snorkel? As far as I can
understand that's one it's largest drawbacks.

~~~
ZeroCool2u
Fair question. I didn't personally supervise our last intern, it was my turn
the summer before, so I'm not as deeply familiar with it. Now that you bring
this up though, I think perhaps I may have misspoken. When I said muti-label,
I think that was our goal originally, but because of the constraints of
Snorkel you mentioned, we ended up reframing the problem into many single
class models instead. They would both work, but because of how our business
users worked, multi-label wasn't super important. For example, not all
business users are interested in every label, so I think what happened was
more than one model was trained, one for each label, and then ensembled based
on the business users interests. Our final output allowed users to effectively
sort, filter, and search documents based on any combination of these labels.
Keep in mind too, some of these labels are fairly abstract, so just one of
them was fairly powerful by itself and could perhaps power an entire team in
some cases. I hope that helps, I'm sorry I can't go into too much more detail.

~~~
eggie5
yeah, you can do single-label w/ snorkel, but not multi-label. Multi-label
snorkel would be the killer feature bc making the negatives (ie for a softmax)
is very hard especially when you work w/ user-interaction systems with an
unknown negative distribution.

~~~
ajratner
You can always do multi-label as a multi-task learning model (or just a set of
binary models), which is something we (and many others) have explored before!
A lot of the adjustments for mainline Snorkel have to do with (A) the
semantics of the labeling functions (need to be able to express that something
is _not_ class A and/or have a general per-class prior) and (B) all the infra
to support what is just now a bunch of independent per-label binary tasks, at
base

~~~
eggie5
Snorkel has a label mutual exclusion assumption right?

My core problem is a multi-label problem, but my snorkel data, from the
LabelModel is inherently single-label (mutually exclusive). What is the
prevailing recommendation to do multi-label w/ Snorkel? Is the below what you
are currently recommending?

For a given, k-wise multi-label problem:

1\. Generate k binary datasets w/ LabelModel 2\. Train k separate binary
classifiers for each respective dataset 3\. At inference/prediction time pass
input though the k classifiers and get scores.

Is this what the current recommendation is? Create a set of binary
classifiers?

------
nobodywillobsrv
What does it actually do? I know "label data" but anything can do that.

Is it just a pipeline system with some helpers for running a couple of ML
related functions?

Is it UI based?

Where do you run it?

I know these areas well and got nearly nothing from reading the splash page on
the site.

~~~
gas9S9zw3P9c
I also have problems understanding what exactly it does. I just briefly
skimmed the paper, but it seems like the idea is as follows. Assuming you
don't have ground truth labels for your data:

1\. Generate many different nosiy labels for your data by writing functions.
These don't need to be correct, but they should make uncorrelated errors. They
are basically domain knowledge you have of your data.

2\. Snorkel takes the output of these functions, and based on their
(dis)agreement, builds a generative probabilistic model to uncorrelate your
labels, which may have had some overlap in the errors.

3\. You train your final discriminative model on the output of that
probabilistic model.

So, the main idea is to create many noisy labels instead of relying on a
single high-quality label and Snorkel does the hard work of figuring out how
to smartly combine these labels so you can train on something clean.

~~~
ajratner
Yup! The LFs can also actually be correlated- just not _too_ correlated
(roughly, think of needing at least three mostly uncorrelated cliques, to be
precise).

Part of the high level description, though, is that a lot of different parts
and lines of work are integrated into Snorkel Flow beyond just this original
programmatic labeling idea. So also programmatic operators for data
augmentation, "slicing" or partitioning of data, and the overall end-to-end
platform (UI + SDK) supporting iterative development of ML models via this
paradigm of programmatic training data.

------
ca_parody
Forgive me - but how does this avoid the chicken&egg problem here. Without
digging through the promo copy, why would one programmatically label training
data to do ML on if they have such a program to label data...

~~~
sriku
Labels are knowledge about data. If you already know some rules that work
reasonably well based on your domain experience, then Snorkel lets you capture
those as "labeling functions" that may not cover the whole ground or can be
"noisy". Snorkel can then build a model to label your data accounting for the
"noise". Combining that with some "gold" labels (done by humans), you can use
the generated labels on a large data set to build a higher quality model that
generalizes better. This is similar to how you can take several low quality
models and by virtue of them having expertise over different parts of the
data, build an "ensemble" model that performs better than any of them.

Imho, Snorkel kinds of tools ("weak supervision") are game changers for ML ..
though the biggies get all the press. So I'm excited to see this end to end
direction taken by the team.

~~~
master_yoda_1
is not this done for years and called synthetic data generation, simulation
etc.

~~~
sriku
Not data generation. Label generation. .. but the charitable interpretation of
your question is valid - we've been doing such ensembling to make higher
quality models for some time now. It's getting some good structure, practice
and tooling around it is what I feel.

~~~
master_yoda_1
Yeh then advertise it as a tool rather than AI. The problem is that sorkle is
trying to sell snake oil on the name of Stanford and AI. Under the hood it is
just a data generation pipeline. Remember you can't put label on random data.
So "Not data generation. Label generation" is totally does not make any sense
and sound to me like "brown sugar".

------
eggie5
anyone figure out how to do multi-label classification w/ snorkel? It seems
like it's current formulation only supports single-label, ie softmax.

I find in practice, especially w/ user-interaction, system most problems are
not single-label, but multi-label. Also, in the single-label setting it's
often necessary to define an negative "OTEHR" class which is very difficult to
define w/ snorkel in my experience.

~~~
ajratner
Single label has been the applicable one for most of the applications we've
tackled to date, but agreed that multi-label is also very important! More
coming here soon...

~~~
eggie5
Thanks Alex. I'm sure you can relate that when you have a an unbounded input
distribution (like w/ user-interaction systems), defining that other class w/
current snorkel is difficult/impossible.

~~~
ajratner
Yeah definitely- and would love to chat sometime, as this is a space I've at
least had less direct hands-on interaction with. There's a line of work in the
ML literature on "Positive unlabeled (PU) learning"\--basically, setting where
there are only positive labels or abstains--with a lot of theoretical ties to
what our stuff rests on, I think a tie in here is interesting. Of course, most
of these approaches rely on some (to varying degrees) hidden and very strong
distributional assumption... anyway looking forward to a chat!

~~~
eggie5
Thanks for the lead on PU Learning.

I signed up for a demo of the new platform, looking forward to chatting. Me
and a colleague from work spoke w/ Henry last year about a potential
partnership but I guess it got lost in the mix...

------
tadkar
One of the things I’m really curious about is how Snorkel deals with poor
labelling functions. More generally, labelling functions are another data
source for the model and are just as susceptible to corruption and other real
world issues like completeness, bias, counter factual issues and repetition.
Perhaps even more so because these are manually constructed. For example, you
can imagine that the person writing labelling functions writes effectively the
same rule many times over. My understanding of the paper is that Snorkel would
then weight this repeated labelling very heavily. I think weak supervision
techniques (at least the ones under the Hazy Research umbrella) require a
degree of skill in machine learning that is easy to underestimate if you just
think about the problem as an issue of domain understanding (or writing
labelling functions in Snorkel terms)

~~~
sanxiyn
My understanding is different. I think you are talking about correlated data
source. Snorkel's surprising innovation is that it does NOT overweigh
correlated data source.

~~~
tadkar
So, I am specifically talking about the scenario where all your _labelling
functions_ are highly correlated and there is little or no ground truth data
to come up with empirical weights for each of the labelling functions. An
example is the scenario where you have the label functions: x>5, x>4.99,
x>5.01 for some feature x. I am really struggling to see how Snorkel can
correct for the correlation, especially given the relatively simple generative
model in section 2.2 of the paper.
[https://arxiv.org/pdf/1711.10160.pdf](https://arxiv.org/pdf/1711.10160.pdf)

~~~
polm23
The Snorkel paper doesn't cover this in depth, the math is all in this paper:

[https://arxiv.org/abs/1703.00854](https://arxiv.org/abs/1703.00854)

I can't say I followed all the proofs, but it seems that under certain limited
assumptions about labelling functions they prove their generative function can
do well.

Reading Snorkel it initially sounded like magic in the bad way, but this does
make it clear that if your labelling functions are garbage or have certain
kinds of problems there's nothing they can do about it.

Even leaving aside the generative model I think the focus on function-based
data bootstrapping is great, which is why I've been following Snorkel's
projects for a while.

------
nl
I spent some time trying Snorkel (the open source version) and its predecessor
DeepDive.

It was extremely complicated to get it to do anything beyond the demos, and I
was never successful to get it to do anything useful.

I ended up implementing some of the ideas myself, but I can't say I had any
great success.

~~~
eggie5
That's too bad. Myself and colleagues have had good success using the current
snorkel package.

------
woeirua
I can see how this would work for tabular and text data, where the labeling
functions are well-defined. I don't understand, _at all_ , how this would work
with computer vision tasks where heuristics are pretty much impossible to
define.

That said, I don't see anything here that would prevent you from using a pre-
trained conv net as a labeling function, but I expect that multiple conv nets
trained on a small corpus of data would be biased and make correlated errors,
which violate their assumptions.

This looks super powerful in some cases, but I'm just not seeing how it can
possibly generalize to every ML problem.

~~~
ajratner
First: Snorkel Flow _absolutely_ does not generalize to every ML problem :).
IMO defining where different systems and approaches do and don't work best is
one of the most important and most challenging problems in ML systems
research- as noted, we've worked to give detail on this for Snorkel over the
years... no perfect answers, but some notes below:

\- As you imply, a lot has to do with the available sources of input signal-
whether labeling functions, or 'transformation functions' for data
augmentation, or other ops we've worked on... the input is obviously key.

\- For data modalities like image, video, etc: Often the most successful
approach is to (A) rely on some pre-processed features or "primitives" and
write labeling functions over these- as my co-founder Paroma in particular has
published about over the years- and/or (B) use metadata

\- External models are definitely expressable as labeling functions, and we've
worked on exactly that problem of modeling (local) biases and correlations!

~~~
woeirua
Does it work for semantic segmentation? That's really where I'm struggling to
see how this could work.

~~~
ajratner
More advanced structured prediction tasks are still definitely on the cutting
edge- mainly IMO down to defining the semantics of the programmatic user input
like labeling functions for these kinds of tasks. Some recent work
([http://cs.brown.edu/people/sbach/files/safranchik-
aaai20.pdf](http://cs.brown.edu/people/sbach/files/safranchik-aaai20.pdf)) has
extended these semantics for sequence tagging, as an example- so some exciting
moves in this direction!

------
ianhorn
I have a question for the snorkel folks if they're lurking here. With noisy
label functions for large amounts of data, I could easily see the cases where
the label functions fail correlating with classes that are already having a
hard time (disadvantaged/underrepresented/marginalized groups, etc). Has there
been any work on using these tools while making sure to avoid dangerous
biases? It seems like the kind of tool that could amplify problems while still
being really useful for an average case.

~~~
ajratner
Lurking for a few more min... great question! Dealing with both class
imbalance and issues of pernicious biases in both underlying data
distributions and training labels is an extremely important topic. Our
underlying theory deals with local biases (e.g. individual labeling functions
or sources of training signal being biased) but systemic biases (e.g. the user
driving the system being biased) are certainly tougher.

One important and practical answer that we've found: with an approach like in
Snorkel Flow, you can _inspect_ the source of the training data and _correct_
it if biased- which you just can't do with e.g. a million hand labeled
training data points. So in practice this is a big advantage we've found.

On the theory / research side, this is definitely an area we want to pursue
further!

~~~
ianhorn
Thanks for the answer! The local vs systemic bias thing is particularly
interesting for a system like snorkel. I have a clarifying question.

I'm imagining a dumb example like recipes, where "1 tsp salt" is a common
format for ingredient. I'd imagine that the majority of ingredients follow
that format, so it'd be a natural function to write. I'd also imagine that
there's a correlation between following that format and being a recipe with a
european background.

Generalize that a little bit, and almost by definition the simplest N rules
that get the most coverage will cover the majority cases best. Being outside
the majority cases is probably correlated with most "human issues," defined
however you want. Being an artifact of the properties of what the simplest N
rules cover, I'm not clear whether it'd be defined as local or systemic in the
sense you've worked on.

I'm curious whether this falls under the theory you've worked on already or
the theory you're talking about pursuing in the future. If it's something
you've worked on already, I'd be very interested in reading what you have.

~~~
ajratner
Great question! Let me rephrase so you can confirm my understanding: I have
some labeling functions (LFs) that are far more accurate on a majority subset
of the data than on one or more minority subsets or "slices" of the data...
and these subsets are not necessarily correlated with the class labels, so
this isn't a traditional class imbalance problem...

We've actually done some recent work on this
([https://papers.nips.cc/paper/9137-slice-based-learning-a-
pro...](https://papers.nips.cc/paper/9137-slice-based-learning-a-programming-
model-for-residual-learning-in-critical-data-slices)) where we have users
define these critical "slices" approximately so that the model being trained
can pay special attention to them (extra representation layers) so they don't
get drowned out by the majority subsets/slices. But definitely a lot more to
do in this area!

~~~
ianhorn
Cool idea, and thanks for the answer! I'll have to look more closely at the
paper :)

------
jeff-hernandez
As another option, Compose is a ML tool for labeling data. It structures the
labeling process and integrates easily with Featuretools which automates
feature engineering. Definitely worth checking out!

[1]
[https://github.com/FeatureLabs/compose](https://github.com/FeatureLabs/compose)
[2]
[https://github.com/FeatureLabs/featuretools](https://github.com/FeatureLabs/featuretools)

~~~
eggie5
can you do multi-label w/ Compose? Snorkel only supports single-label.

~~~
jeff-hernandez
Yes, you can represent the labeling function as a class and use its methods to
represent each label individually.

------
abrazensunset
Anyone here with practical experience using [Flying
Squid]([https://github.com/HazyResearch/flyingsquid](https://github.com/HazyResearch/flyingsquid))
over the open-source Snorkel library? I'm curious if this platform re-uses
some of that line of research or if it's not practical for some reason.

------
swyx
for those who want more on Snorkel, I recommend checking out Chris Re's prior
talks:

\-
[https://www.youtube.com/watch?v=yu15Nf5eJEE](https://www.youtube.com/watch?v=yu15Nf5eJEE)
(14 min)

\- [https://www.cs.ucla.edu/upcoming-events/cs-201-jon-postel-
di...](https://www.cs.ucla.edu/upcoming-events/cs-201-jon-postel-
distinguished-lecture-deepdive-and-snorkel-dark-data-systems-to-answer-
macroscopic-questions-christopher-re-stanford-university/) (1hr talk -
snorkel's predecessor was deepdive)

I went down the rabbit hole in this space about 3-4 years ago, and really got
the message around "Dark Data". he's onto something huge and I regret not
pursuing it further due to self doubt. hedge funds should be eating this up as
well.

------
ipsum2
Does anyone use Snorkel ([https://github.com/snorkel-
team/snorkel](https://github.com/snorkel-team/snorkel))? From what I can tell,
it seems like a research product. Not sure if any company uses them in a
production environment.

~~~
gas9S9zw3P9c
There are some case studies on the website [0], but I would be wary of them.
My experience from working in these kind of research environments is that the
case studies are consulting jobs where a few PhD students do work for a
company, and in return you can use the company name and write on your website
whatever you want. They are probably not actually using the product. I'd be
interested to hear about actual users and their experience.

[0] [https://www.snorkel.ai/case-studies](https://www.snorkel.ai/case-studies)

------
ajratner
Hi all, this is Alex from the Snorkel team- thanks for all the great comments!
Excited to respond to a few questions directly, but first highlighting some up
here:

\- _Where to find more about the core Snorkel concepts:_ We've published 36+
peer-reviewed papers, along with blog posts, talks, office hours, etc over the
years (see
[https://www.snorkel.ai/technology](https://www.snorkel.ai/technology) and
[https://www.snorkel.ai/case-studies](https://www.snorkel.ai/case-studies)),
so I'll defer somewhat to those... but of course, academic papers can be
painful to read (even when you wrote them!), so happy to also answer questions
here.

\- _What Snorkel Flow is:_ Snorkel Flow is an end-to-end ML development
platform based around the core idea that training data is the most important
(and often ignored) part of ML systems today, and that you can label, build,
and manage it programmatically with the right supporting techniques. This is
based on our research at Stanford, where we spent several years exploring the
basic question: can we enable subject matter expert users to train ML models
with things like rules, heuristics, and other noisy sources of signal,
expressed as "labeling functions" and other types of programmatic ops (ex:
'label this document X if it overlaps with dictionary Y'), _instead_ of having
to hand-label training data. This type of input, often termed "weak
supervision", ends up requiring a lot of work to deal with as it is much
noisier than hand-labeled data (eg the labeling functions can be inaccurate,
differ in coverage and expertise, have tangled correlations, etc) but can be
very powerful if you model it right! And Snorkel Flow specifically is focused
on actually making the broader end-to-end process of building and managing ML
with programmatic training data usable in production, rather than just on
exploring the algorithmic and theoretical ideas as was the goal of our
research/OSS code over the years!

\- _Why train a model if you have a programmatic way to label the data:_ In
Snorkel, the basic idea is to label some portion of the data with labeling
functions (usually it's hard to label all of the data- hence the need for ML),
and then use ML to generalize beyond the LFs. In this sense Snorkel is an
attempt to bridge rules-based approaches (high precision but low recall) and
stats learning-based approaches (good at generalizing). This is also useful in
"cross-modal" cases where you can write LFs over one feature set not available
at inference time, but use them to train a model that does work on the
servable/inference time features (e.g. text to image is one recent example
[https://www.cell.com/patterns/fulltext/S2666-3899(20)30019-2](https://www.cell.com/patterns/fulltext/S2666-3899\(20\)30019-2)).
But, of course, we believe in an empirical process all the way, which is
another reason we like the Snorkel approach: if you can write a perfect set of
labeling functions, then great- you don't need a fancy ML model, stop there!

\- _Does Snorkel work??:_ As an ML systems researcher, I'm always a bit
perplexed by this question... the relevant questions for any system or
approach are usually 'When/where might it be expected to be useful, and what
are the relevant tradeoffs?' We've done our best to answer these questions
over the years with theory, empirical studies, etc (see links above), and of
course its very case specific. But one thing I'll note is that Snorkel is not
a push-button automagic approach that takes in garbage and produces gold. It's
our attempt to define a new input / development paradigm for ML--one which
we've shown can often be orders of magnitude more efficient--but like any
development process, it requires effort and infrastructure to use most
successfully! Which is a big part of why we've built Snorkel Flow- to support
and accelerate this new kind of ML development process.

\- _Who uses Snorkel?_ A few that have a published record: Google, Intel,
Microsoft, Grubhub, Chegg, IBM... and many others at very large and smaller
orgs that are not public

\- _What is going to happen with the OSS:_ The OSS project will remain up and
open under Apache 2.0, same as all of the other research work we've put out
over the years! See our community spectrum chat for more.

------
king_magic
Yeaaaah, I’m skeptical. Especially since they seem to dance around what
Snorkel actually does at pretty much every opportunity.

~~~
eggie5
Snorkel does weak supervision for you. It takes your unlabeled data and use
defined labeling functions (LFs), maps the LFs on the data and then de-
correlates relates everything to give you a dataset that you can use for
multi-class single-label supervision.

It's very powerful.

------
edshiro
I presume this does not apply to computer vision datasets? Frankly I am still
confused at what exactly Snorkel does.

~~~
eggie5
you have a dataset of images and you write code (labeling functions LF) to
label the images. Snorkel handles the pipeline but more importantly corrects
the conflicts/correlations between the LFs. The output is a supervised dataset
w/ mutually exclusive labels a la softmax classification.

the labels are noisy, but you have a quantity that you could not get by
humans, AND at a faster/cheaper rate. they provide analysis arguing that, for
discriminative models, quantity CAN outweigh quality.

to your point it's not typically used w/ the image-only modality. It's mostly
used where there is some meta-data attached.

------
andrewmutz
I don't understand why this approach is valuable. If you can label your
examples using traditional software, why would you build a model? Why not just
use the label function in whatever context you intended the model to exist?

------
master_yoda_1
Looks like they selling snake oil on the name of Stanford and AI :)

~~~
eggie5
Why do think that? It's a very powerful and useful technique. You can get
supervised labels in a day (ostensibly for free) vs paying humans to do it and
waiting...

