Hacker News new | past | comments | ask | show | jobs | submit login
Snorkel AI: Putting Data First in ML Development (snorkel.ai)
81 points by polm23 on July 16, 2020 | hide | past | favorite | 66 comments



They totally ignored the open source package in their site. https://github.com/snorkel-team/snorkel Moreover, based on the commit history in github, and their README.md , my understanding is that they will stop supporting their open source repo. Best of luck for them.


If anyone is looking for an open source library in this space, I work on one called Compose (https://github.com/FeatureLabs/compose).

With compose, a user defines a labeling function, and then compose scans the historical data looking for training examples to train a machine learning model.

The library has evolved as we apply it to more and more real world use cases, but it is based on approach in this paper from 2016: https://dai.lids.mit.edu/wp-content/uploads/2016/08/07796929....


can you do multi-label w/ Compose? Snorkel only supports single-label.


Yes, you can represent the labeling function as a class and use its methods to represent each label individually.



Myself and a few colleagues attended the _absolutely packed_ presentation the Snorkel folks did at ODSC in Boston last year and came away so convinced by this approach that we actually constructed our interns summer project around using Snorkel for multi-label classification on complex very domain specific financial documents. The project was successful and both us and our intern were very happy.

The main lessons learned from this exercise helped us identify where our efforts would be shifted when using Snorkel. Of course there's never any free lunch, but Snorkel has what I believe to be a very reasonable and effective trade off. Snorkel provides both a huge decrease in overall costs, but critically it shifts costs towards the front of the development process. Writing a good set of labeling functions is a non-trivial piece of work. It requires the data scientist to have deep domain experience or a few fairly large blocks of time collaborating with and learning from a business user that is a already a domain expert. It has the upside though of forcing the data scientist to get a solid foundation of this domain knowledge, which I feel often times is underestimated in many ML projects.

Anyways, congrats to the team! Looking forward to checking out your future work.


Thanks for this kind note! Would love to chat sometime and hear about your findings working with Snorkel- pros, cons, feature requests, etc. Multi-label is one on our list and under development (see comments below and elsewhere), would be interested to trade notes!


can you educate me on how you do multi-label w/ snorkel? As far as I can understand that's one it's largest drawbacks.


Fair question. I didn't personally supervise our last intern, it was my turn the summer before, so I'm not as deeply familiar with it. Now that you bring this up though, I think perhaps I may have misspoken. When I said muti-label, I think that was our goal originally, but because of the constraints of Snorkel you mentioned, we ended up reframing the problem into many single class models instead. They would both work, but because of how our business users worked, multi-label wasn't super important. For example, not all business users are interested in every label, so I think what happened was more than one model was trained, one for each label, and then ensembled based on the business users interests. Our final output allowed users to effectively sort, filter, and search documents based on any combination of these labels. Keep in mind too, some of these labels are fairly abstract, so just one of them was fairly powerful by itself and could perhaps power an entire team in some cases. I hope that helps, I'm sorry I can't go into too much more detail.


yeah, you can do single-label w/ snorkel, but not multi-label. Multi-label snorkel would be the killer feature bc making the negatives (ie for a softmax) is very hard especially when you work w/ user-interaction systems with an unknown negative distribution.


You can always do multi-label as a multi-task learning model (or just a set of binary models), which is something we (and many others) have explored before! A lot of the adjustments for mainline Snorkel have to do with (A) the semantics of the labeling functions (need to be able to express that something is not class A and/or have a general per-class prior) and (B) all the infra to support what is just now a bunch of independent per-label binary tasks, at base


Snorkel has a label mutual exclusion assumption right?

My core problem is a multi-label problem, but my snorkel data, from the LabelModel is inherently single-label (mutually exclusive). What is the prevailing recommendation to do multi-label w/ Snorkel? Is the below what you are currently recommending?

For a given, k-wise multi-label problem:

1. Generate k binary datasets w/ LabelModel 2. Train k separate binary classifiers for each respective dataset 3. At inference/prediction time pass input though the k classifiers and get scores.

Is this what the current recommendation is? Create a set of binary classifiers?


What does it actually do? I know "label data" but anything can do that.

Is it just a pipeline system with some helpers for running a couple of ML related functions?

Is it UI based?

Where do you run it?

I know these areas well and got nearly nothing from reading the splash page on the site.


I also have problems understanding what exactly it does. I just briefly skimmed the paper, but it seems like the idea is as follows. Assuming you don't have ground truth labels for your data:

1. Generate many different nosiy labels for your data by writing functions. These don't need to be correct, but they should make uncorrelated errors. They are basically domain knowledge you have of your data.

2. Snorkel takes the output of these functions, and based on their (dis)agreement, builds a generative probabilistic model to uncorrelate your labels, which may have had some overlap in the errors.

3. You train your final discriminative model on the output of that probabilistic model.

So, the main idea is to create many noisy labels instead of relying on a single high-quality label and Snorkel does the hard work of figuring out how to smartly combine these labels so you can train on something clean.


Yup! The LFs can also actually be correlated- just not too correlated (roughly, think of needing at least three mostly uncorrelated cliques, to be precise).

Part of the high level description, though, is that a lot of different parts and lines of work are integrated into Snorkel Flow beyond just this original programmatic labeling idea. So also programmatic operators for data augmentation, "slicing" or partitioning of data, and the overall end-to-end platform (UI + SDK) supporting iterative development of ML models via this paradigm of programmatic training data.


The description reminds me a bit of learning classifier systems.

Edit: and a bit of fuzzy rule systems. Which just goes to suggest that I am probably well out of my depth.


The former project is a python package that labels your data using a weak supervision technique. It's not just a pipeline, it's a sophisticated algorithm that helps combine multiple competing labeling functions by removing reweighing based on correlations vs a naive majority-voting scheme.

When you look at ML models as commodities and the fact that you spend most of your time getting data, cleaning data or labeling data it leads to what they call Data Programming. I imagine this will be a UI where you can manage your dataset, by monitoring something they call Critical Slices.


Forgive me - but how does this avoid the chicken&egg problem here. Without digging through the promo copy, why would one programmatically label training data to do ML on if they have such a program to label data...


That's a really good question. I took a class with one of the professors who started Snorkel.

The way he broke it down was either you can incorporate rules into your data or into your model. Because we want our model to be as general purpose as possible, it turns out you can squeeze some extra performance by "bronze/copper" quality data with handwritten rules in your dataset.

You can think of the model getting an extra boost from the latent knowledge within the rules.


Their paper explains it - https://arxiv.org/abs/1711.10160

Snorkel itself has been a open source package for a while - https://github.com/snorkel-team/snorkel

This new announcement is about Snorkel Flow


Labels are knowledge about data. If you already know some rules that work reasonably well based on your domain experience, then Snorkel lets you capture those as "labeling functions" that may not cover the whole ground or can be "noisy". Snorkel can then build a model to label your data accounting for the "noise". Combining that with some "gold" labels (done by humans), you can use the generated labels on a large data set to build a higher quality model that generalizes better. This is similar to how you can take several low quality models and by virtue of them having expertise over different parts of the data, build an "ensemble" model that performs better than any of them.

Imho, Snorkel kinds of tools ("weak supervision") are game changers for ML .. though the biggies get all the press. So I'm excited to see this end to end direction taken by the team.


is not this done for years and called synthetic data generation, simulation etc.


Not data generation. Label generation. .. but the charitable interpretation of your question is valid - we've been doing such ensembling to make higher quality models for some time now. It's getting some good structure, practice and tooling around it is what I feel.


Yeh then advertise it as a tool rather than AI. The problem is that sorkle is trying to sell snake oil on the name of Stanford and AI. Under the hood it is just a data generation pipeline. Remember you can't put label on random data. So "Not data generation. Label generation" is totally does not make any sense and sound to me like "brown sugar".


I saw a talk on Snorkel a few years back, so I don't remember perfectly, but it seemed to be an iterative process. It's a tool for you to build and refine simple rules. If you have ingredients, a simple heuristic "<number> <units> <ingredient>" will get a lot of them, but there are tons of edge cases. With more heuristics, you might get lots of those, and so. I think it was a tool to help you explore and iterate on those heuristic labeling functions quickly. Then you can label the stuff that's hard in a more expensive way or something. I thought of it as noisily hand labeling sets of examples at a time rather than single examples at a time. This is all memory from a random conference talk or paper or something years ago so take it with some big grains of salt. I do clearly remember thinking it seemed really cool at the time.


A human will label data according to hand-rules or heuristics. What's the difference is a program labels data according to hand-rules or heuristics.

The down-stream discriminates model's goal is to generalize via supervision.


anyone figure out how to do multi-label classification w/ snorkel? It seems like it's current formulation only supports single-label, ie softmax.

I find in practice, especially w/ user-interaction, system most problems are not single-label, but multi-label. Also, in the single-label setting it's often necessary to define an negative "OTEHR" class which is very difficult to define w/ snorkel in my experience.


Single label has been the applicable one for most of the applications we've tackled to date, but agreed that multi-label is also very important! More coming here soon...


Thanks Alex. I'm sure you can relate that when you have a an unbounded input distribution (like w/ user-interaction systems), defining that other class w/ current snorkel is difficult/impossible.


Yeah definitely- and would love to chat sometime, as this is a space I've at least had less direct hands-on interaction with. There's a line of work in the ML literature on "Positive unlabeled (PU) learning"--basically, setting where there are only positive labels or abstains--with a lot of theoretical ties to what our stuff rests on, I think a tie in here is interesting. Of course, most of these approaches rely on some (to varying degrees) hidden and very strong distributional assumption... anyway looking forward to a chat!


Thanks for the lead on PU Learning.

I signed up for a demo of the new platform, looking forward to chatting. Me and a colleague from work spoke w/ Henry last year about a potential partnership but I guess it got lost in the mix...


One of the things I’m really curious about is how Snorkel deals with poor labelling functions. More generally, labelling functions are another data source for the model and are just as susceptible to corruption and other real world issues like completeness, bias, counter factual issues and repetition. Perhaps even more so because these are manually constructed. For example, you can imagine that the person writing labelling functions writes effectively the same rule many times over. My understanding of the paper is that Snorkel would then weight this repeated labelling very heavily. I think weak supervision techniques (at least the ones under the Hazy Research umbrella) require a degree of skill in machine learning that is easy to underestimate if you just think about the problem as an issue of domain understanding (or writing labelling functions in Snorkel terms)


My understanding is different. I think you are talking about correlated data source. Snorkel's surprising innovation is that it does NOT overweigh correlated data source.


So, I am specifically talking about the scenario where all your labelling functions are highly correlated and there is little or no ground truth data to come up with empirical weights for each of the labelling functions. An example is the scenario where you have the label functions: x>5, x>4.99, x>5.01 for some feature x. I am really struggling to see how Snorkel can correct for the correlation, especially given the relatively simple generative model in section 2.2 of the paper. https://arxiv.org/pdf/1711.10160.pdf


The Snorkel paper doesn't cover this in depth, the math is all in this paper:

https://arxiv.org/abs/1703.00854

I can't say I followed all the proofs, but it seems that under certain limited assumptions about labelling functions they prove their generative function can do well.

Reading Snorkel it initially sounded like magic in the bad way, but this does make it clear that if your labelling functions are garbage or have certain kinds of problems there's nothing they can do about it.

Even leaving aside the generative model I think the focus on function-based data bootstrapping is great, which is why I've been following Snorkel's projects for a while.


I spent some time trying Snorkel (the open source version) and its predecessor DeepDive.

It was extremely complicated to get it to do anything beyond the demos, and I was never successful to get it to do anything useful.

I ended up implementing some of the ideas myself, but I can't say I had any great success.


That's too bad. Myself and colleagues have had good success using the current snorkel package.


I can see how this would work for tabular and text data, where the labeling functions are well-defined. I don't understand, at all, how this would work with computer vision tasks where heuristics are pretty much impossible to define.

That said, I don't see anything here that would prevent you from using a pre-trained conv net as a labeling function, but I expect that multiple conv nets trained on a small corpus of data would be biased and make correlated errors, which violate their assumptions.

This looks super powerful in some cases, but I'm just not seeing how it can possibly generalize to every ML problem.


First: Snorkel Flow absolutely does not generalize to every ML problem :). IMO defining where different systems and approaches do and don't work best is one of the most important and most challenging problems in ML systems research- as noted, we've worked to give detail on this for Snorkel over the years... no perfect answers, but some notes below:

- As you imply, a lot has to do with the available sources of input signal- whether labeling functions, or 'transformation functions' for data augmentation, or other ops we've worked on... the input is obviously key.

- For data modalities like image, video, etc: Often the most successful approach is to (A) rely on some pre-processed features or "primitives" and write labeling functions over these- as my co-founder Paroma in particular has published about over the years- and/or (B) use metadata

- External models are definitely expressable as labeling functions, and we've worked on exactly that problem of modeling (local) biases and correlations!


Does it work for semantic segmentation? That's really where I'm struggling to see how this could work.


More advanced structured prediction tasks are still definitely on the cutting edge- mainly IMO down to defining the semantics of the programmatic user input like labeling functions for these kinds of tasks. Some recent work (http://cs.brown.edu/people/sbach/files/safranchik-aaai20.pdf) has extended these semantics for sequence tagging, as an example- so some exciting moves in this direction!


At the ODSC presentation I went to last year where the team presented they actually used a vision problem as their canonical example. It's hard to grasp without a concrete example, but it makes a lot of sense the way they explained it.

For example, lets assume you want to identify something like a lung tumor. So you have many MRI images and they're all largely the same template of image. Using traditional image processing software like open CV, it's suprisingly easy to do more coarse grained tasks programmatically, like say, search this image for any circle that's brighter than the surrounding tissue and has a radius greater than say x.yz mm. If you find one, that function returns True if not False. That x.yz mm number is what you get from your radiologists that you work with to help you develop the labeling functions and this is just _one_ of the labeling functions. But basically it turns out if you construct a few of these functions with the help of domain experts and then use those functions all together with the information theory research the Snorkel folks do, you get pretty damn good performance!


Yeah, but for some tasks, like recognizing a car, or something else you'd have to write pretty sophisticated code to even get to a reasonable result. It's far easier in those cases to use supervised learning to have a NN learn to do it for you.


Agreed! As noted in other answer, Snorkel certainly does not work for everything :) And indeed, in many cases it may be easier to express what you know extensionally (label examples) vs intensionally (write functions). A lot comes down to the unit cost per label over time- and whether it's more economical to label a bunch of data by hand vs. write LFs or similar.

That's btw why a lot of examples of ML today are ones where data is (i) simple for non-experts to label, (ii) non-private and therefore easy to outsource for labeling, and (iii) low rate of change (e.g. images for self-driving, basic NLP stuff for chat bots, etc)- this kind of data can be labeled cheaply and once, so hand-labeled training sets are (barely) economically feasible to build manually. However, most data is not that easy or cheap to label, needs to be relabeled constantly to adapt to change, and thus the investment in a programmatic approach is often far better even if certainly not push-button!


I have a question for the snorkel folks if they're lurking here. With noisy label functions for large amounts of data, I could easily see the cases where the label functions fail correlating with classes that are already having a hard time (disadvantaged/underrepresented/marginalized groups, etc). Has there been any work on using these tools while making sure to avoid dangerous biases? It seems like the kind of tool that could amplify problems while still being really useful for an average case.


Lurking for a few more min... great question! Dealing with both class imbalance and issues of pernicious biases in both underlying data distributions and training labels is an extremely important topic. Our underlying theory deals with local biases (e.g. individual labeling functions or sources of training signal being biased) but systemic biases (e.g. the user driving the system being biased) are certainly tougher.

One important and practical answer that we've found: with an approach like in Snorkel Flow, you can inspect the source of the training data and correct it if biased- which you just can't do with e.g. a million hand labeled training data points. So in practice this is a big advantage we've found.

On the theory / research side, this is definitely an area we want to pursue further!


Thanks for the answer! The local vs systemic bias thing is particularly interesting for a system like snorkel. I have a clarifying question.

I'm imagining a dumb example like recipes, where "1 tsp salt" is a common format for ingredient. I'd imagine that the majority of ingredients follow that format, so it'd be a natural function to write. I'd also imagine that there's a correlation between following that format and being a recipe with a european background.

Generalize that a little bit, and almost by definition the simplest N rules that get the most coverage will cover the majority cases best. Being outside the majority cases is probably correlated with most "human issues," defined however you want. Being an artifact of the properties of what the simplest N rules cover, I'm not clear whether it'd be defined as local or systemic in the sense you've worked on.

I'm curious whether this falls under the theory you've worked on already or the theory you're talking about pursuing in the future. If it's something you've worked on already, I'd be very interested in reading what you have.


Great question! Let me rephrase so you can confirm my understanding: I have some labeling functions (LFs) that are far more accurate on a majority subset of the data than on one or more minority subsets or "slices" of the data... and these subsets are not necessarily correlated with the class labels, so this isn't a traditional class imbalance problem...

We've actually done some recent work on this (https://papers.nips.cc/paper/9137-slice-based-learning-a-pro...) where we have users define these critical "slices" approximately so that the model being trained can pay special attention to them (extra representation layers) so they don't get drowned out by the majority subsets/slices. But definitely a lot more to do in this area!


Cool idea, and thanks for the answer! I'll have to look more closely at the paper :)


As another option, Compose is a ML tool for labeling data. It structures the labeling process and integrates easily with Featuretools which automates feature engineering. Definitely worth checking out!

[1] https://github.com/FeatureLabs/compose [2] https://github.com/FeatureLabs/featuretools


can you do multi-label w/ Compose? Snorkel only supports single-label.


Yes, you can represent the labeling function as a class and use its methods to represent each label individually.


Anyone here with practical experience using [Flying Squid](https://github.com/HazyResearch/flyingsquid) over the open-source Snorkel library? I'm curious if this platform re-uses some of that line of research or if it's not practical for some reason.


for those who want more on Snorkel, I recommend checking out Chris Re's prior talks:

- https://www.youtube.com/watch?v=yu15Nf5eJEE (14 min)

- https://www.cs.ucla.edu/upcoming-events/cs-201-jon-postel-di... (1hr talk - snorkel's predecessor was deepdive)

I went down the rabbit hole in this space about 3-4 years ago, and really got the message around "Dark Data". he's onto something huge and I regret not pursuing it further due to self doubt. hedge funds should be eating this up as well.


Does anyone use Snorkel (https://github.com/snorkel-team/snorkel)? From what I can tell, it seems like a research product. Not sure if any company uses them in a production environment.


There are some case studies on the website [0], but I would be wary of them. My experience from working in these kind of research environments is that the case studies are consulting jobs where a few PhD students do work for a company, and in return you can use the company name and write on your website whatever you want. They are probably not actually using the product. I'd be interested to hear about actual users and their experience.

[0] https://www.snorkel.ai/case-studies


I can say Grubhub and Chegg use it.


Hi all, this is Alex from the Snorkel team- thanks for all the great comments! Excited to respond to a few questions directly, but first highlighting some up here:

- Where to find more about the core Snorkel concepts: We've published 36+ peer-reviewed papers, along with blog posts, talks, office hours, etc over the years (see https://www.snorkel.ai/technology and https://www.snorkel.ai/case-studies), so I'll defer somewhat to those... but of course, academic papers can be painful to read (even when you wrote them!), so happy to also answer questions here.

- What Snorkel Flow is: Snorkel Flow is an end-to-end ML development platform based around the core idea that training data is the most important (and often ignored) part of ML systems today, and that you can label, build, and manage it programmatically with the right supporting techniques. This is based on our research at Stanford, where we spent several years exploring the basic question: can we enable subject matter expert users to train ML models with things like rules, heuristics, and other noisy sources of signal, expressed as "labeling functions" and other types of programmatic ops (ex: 'label this document X if it overlaps with dictionary Y'), instead of having to hand-label training data. This type of input, often termed "weak supervision", ends up requiring a lot of work to deal with as it is much noisier than hand-labeled data (eg the labeling functions can be inaccurate, differ in coverage and expertise, have tangled correlations, etc) but can be very powerful if you model it right! And Snorkel Flow specifically is focused on actually making the broader end-to-end process of building and managing ML with programmatic training data usable in production, rather than just on exploring the algorithmic and theoretical ideas as was the goal of our research/OSS code over the years!

- Why train a model if you have a programmatic way to label the data: In Snorkel, the basic idea is to label some portion of the data with labeling functions (usually it's hard to label all of the data- hence the need for ML), and then use ML to generalize beyond the LFs. In this sense Snorkel is an attempt to bridge rules-based approaches (high precision but low recall) and stats learning-based approaches (good at generalizing). This is also useful in "cross-modal" cases where you can write LFs over one feature set not available at inference time, but use them to train a model that does work on the servable/inference time features (e.g. text to image is one recent example https://www.cell.com/patterns/fulltext/S2666-3899(20)30019-2). But, of course, we believe in an empirical process all the way, which is another reason we like the Snorkel approach: if you can write a perfect set of labeling functions, then great- you don't need a fancy ML model, stop there!

- Does Snorkel work??: As an ML systems researcher, I'm always a bit perplexed by this question... the relevant questions for any system or approach are usually 'When/where might it be expected to be useful, and what are the relevant tradeoffs?' We've done our best to answer these questions over the years with theory, empirical studies, etc (see links above), and of course its very case specific. But one thing I'll note is that Snorkel is not a push-button automagic approach that takes in garbage and produces gold. It's our attempt to define a new input / development paradigm for ML--one which we've shown can often be orders of magnitude more efficient--but like any development process, it requires effort and infrastructure to use most successfully! Which is a big part of why we've built Snorkel Flow- to support and accelerate this new kind of ML development process.

- Who uses Snorkel? A few that have a published record: Google, Intel, Microsoft, Grubhub, Chegg, IBM... and many others at very large and smaller orgs that are not public

- What is going to happen with the OSS: The OSS project will remain up and open under Apache 2.0, same as all of the other research work we've put out over the years! See our community spectrum chat for more.


Yeaaaah, I’m skeptical. Especially since they seem to dance around what Snorkel actually does at pretty much every opportunity.


Snorkel does weak supervision for you. It takes your unlabeled data and use defined labeling functions (LFs), maps the LFs on the data and then de-correlates relates everything to give you a dataset that you can use for multi-class single-label supervision.

It's very powerful.


I presume this does not apply to computer vision datasets? Frankly I am still confused at what exactly Snorkel does.


you have a dataset of images and you write code (labeling functions LF) to label the images. Snorkel handles the pipeline but more importantly corrects the conflicts/correlations between the LFs. The output is a supervised dataset w/ mutually exclusive labels a la softmax classification.

the labels are noisy, but you have a quantity that you could not get by humans, AND at a faster/cheaper rate. they provide analysis arguing that, for discriminative models, quantity CAN outweigh quality.

to your point it's not typically used w/ the image-only modality. It's mostly used where there is some meta-data attached.


I don't understand why this approach is valuable. If you can label your examples using traditional software, why would you build a model? Why not just use the label function in whatever context you intended the model to exist?


Looks like they selling snake oil on the name of Stanford and AI :)


Why do think that? It's a very powerful and useful technique. You can get supervised labels in a day (ostensibly for free) vs paying humans to do it and waiting...


Take a look at case studies: https://www.snorkel.ai/case-studies




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: