
Ask HN: What Is the Point of Snorkel? - willj
I’ve read a bit about Snorkel, and I listened to an episode of Software Engineering Daily [1] this week, and this question has been bugging me. In Snorkel, you define labeling functions to provide labels to lots of data, rather than manually label it, and then use this data to train machine learning models. My question is, why even train a machine learning model if you have the functions to classify a dataset? You essentially have a decision tree, and it seems silly to train a model from scratch.
======
btown
While I haven't used the Snorkel library proper, I've used automated features
in machine learning models in a similar way before.

The key insight is: _labeling functions are assumed to be noisy and
inaccurate._

[https://www.snorkel.org/use-cases/01-spam-
tutorial](https://www.snorkel.org/use-cases/01-spam-tutorial) is worth
rereading with this in mind (IMO, it kind of buries the lede on _why_ this is
an important distinction in a hand-wavy way). It's all about diminishing
returns: if you're writing a manual labeling function for production, you're
going to quickly get to something that may work 50-80% of the time, and you'd
spend a LOT more time on the edge cases. So let a machine learning model
figure out the edge cases for you!

Snorkel ensures your labeling function isn't taken as gospel; the training
process will do its best to follow its guidance, but it will be willing to say
"this NotSpam label provided by the function was probably totally wrong in
this case, because the rest of the text content in this example feels a LOT
like all the other messages that the function labeled as Spam, and boosting
(literally or figuratively) the strength of my conviction on my deep-text-
analysis insights, which say that horrible grammar is more spammy than not,
will make me perform better on the test set overall, even if I sacrifice this
example."

And you can have formalisms for how these are labels with uncertainty
attached, and feed this information to the pipeline.
[https://ajratner.github.io/](https://ajratner.github.io/) has a lot of peer-
reviewed research on how this works, and
[https://link.springer.com/article/10.1007/s00778-019-00552-1...](https://link.springer.com/article/10.1007/s00778-019-00552-1?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst)
has a number of figures that may be illustrative as well.

~~~
willj
That makes a whole lot of sense, I didn’t think about the
“noisiness”/imperfection of the labels. That’s a great point. Thanks!

