Hi all, this is Alex from the Snorkel team- thanks for all the great comments! Excited to respond to a few questions directly, but first highlighting some up here:
- Where to find more about the core Snorkel concepts: We've published 36+ peer-reviewed papers, along with blog posts, talks, office hours, etc over the years (see https://www.snorkel.ai/technology and https://www.snorkel.ai/case-studies), so I'll defer somewhat to those... but of course, academic papers can be painful to read (even when you wrote them!), so happy to also answer questions here.
- What Snorkel Flow is: Snorkel Flow is an end-to-end ML development platform based around the core idea that training data is the most important (and often ignored) part of ML systems today, and that you can label, build, and manage it programmatically with the right supporting techniques. This is based on our research at Stanford, where we spent several years exploring the basic question: can we enable subject matter expert users to train ML models with things like rules, heuristics, and other noisy sources of signal, expressed as "labeling functions" and other types of programmatic ops (ex: 'label this document X if it overlaps with dictionary Y'), instead of having to hand-label training data. This type of input, often termed "weak supervision", ends up requiring a lot of work to deal with as it is much noisier than hand-labeled data (eg the labeling functions can be inaccurate, differ in coverage and expertise, have tangled correlations, etc) but can be very powerful if you model it right! And Snorkel Flow specifically is focused on actually making the broader end-to-end process of building and managing ML with programmatic training data usable in production, rather than just on exploring the algorithmic and theoretical ideas as was the goal of our research/OSS code over the years!
- Why train a model if you have a programmatic way to label the data: In Snorkel, the basic idea is to label some portion of the data with labeling functions (usually it's hard to label all of the data- hence the need for ML), and then use ML to generalize beyond the LFs. In this sense Snorkel is an attempt to bridge rules-based approaches (high precision but low recall) and stats learning-based approaches (good at generalizing). This is also useful in "cross-modal" cases where you can write LFs over one feature set not available at inference time, but use them to train a model that does work on the servable/inference time features (e.g. text to image is one recent example https://www.cell.com/patterns/fulltext/S2666-3899(20)30019-2). But, of course, we believe in an empirical process all the way, which is another reason we like the Snorkel approach: if you can write a perfect set of labeling functions, then great- you don't need a fancy ML model, stop there!
- Does Snorkel work??: As an ML systems researcher, I'm always a bit perplexed by this question... the relevant questions for any system or approach are usually 'When/where might it be expected to be useful, and what are the relevant tradeoffs?' We've done our best to answer these questions over the years with theory, empirical studies, etc (see links above), and of course its very case specific. But one thing I'll note is that Snorkel is not a push-button automagic approach that takes in garbage and produces gold. It's our attempt to define a new input / development paradigm for ML--one which we've shown can often be orders of magnitude more efficient--but like any development process, it requires effort and infrastructure to use most successfully! Which is a big part of why we've built Snorkel Flow- to support and accelerate this new kind of ML development process.
- Who uses Snorkel? A few that have a published record: Google, Intel, Microsoft, Grubhub, Chegg, IBM... and many others at very large and smaller orgs that are not public
- What is going to happen with the OSS: The OSS project will remain up and open under Apache 2.0, same as all of the other research work we've put out over the years! See our community spectrum chat for more.
- Where to find more about the core Snorkel concepts: We've published 36+ peer-reviewed papers, along with blog posts, talks, office hours, etc over the years (see https://www.snorkel.ai/technology and https://www.snorkel.ai/case-studies), so I'll defer somewhat to those... but of course, academic papers can be painful to read (even when you wrote them!), so happy to also answer questions here.
- What Snorkel Flow is: Snorkel Flow is an end-to-end ML development platform based around the core idea that training data is the most important (and often ignored) part of ML systems today, and that you can label, build, and manage it programmatically with the right supporting techniques. This is based on our research at Stanford, where we spent several years exploring the basic question: can we enable subject matter expert users to train ML models with things like rules, heuristics, and other noisy sources of signal, expressed as "labeling functions" and other types of programmatic ops (ex: 'label this document X if it overlaps with dictionary Y'), instead of having to hand-label training data. This type of input, often termed "weak supervision", ends up requiring a lot of work to deal with as it is much noisier than hand-labeled data (eg the labeling functions can be inaccurate, differ in coverage and expertise, have tangled correlations, etc) but can be very powerful if you model it right! And Snorkel Flow specifically is focused on actually making the broader end-to-end process of building and managing ML with programmatic training data usable in production, rather than just on exploring the algorithmic and theoretical ideas as was the goal of our research/OSS code over the years!
- Why train a model if you have a programmatic way to label the data: In Snorkel, the basic idea is to label some portion of the data with labeling functions (usually it's hard to label all of the data- hence the need for ML), and then use ML to generalize beyond the LFs. In this sense Snorkel is an attempt to bridge rules-based approaches (high precision but low recall) and stats learning-based approaches (good at generalizing). This is also useful in "cross-modal" cases where you can write LFs over one feature set not available at inference time, but use them to train a model that does work on the servable/inference time features (e.g. text to image is one recent example https://www.cell.com/patterns/fulltext/S2666-3899(20)30019-2). But, of course, we believe in an empirical process all the way, which is another reason we like the Snorkel approach: if you can write a perfect set of labeling functions, then great- you don't need a fancy ML model, stop there!
- Does Snorkel work??: As an ML systems researcher, I'm always a bit perplexed by this question... the relevant questions for any system or approach are usually 'When/where might it be expected to be useful, and what are the relevant tradeoffs?' We've done our best to answer these questions over the years with theory, empirical studies, etc (see links above), and of course its very case specific. But one thing I'll note is that Snorkel is not a push-button automagic approach that takes in garbage and produces gold. It's our attempt to define a new input / development paradigm for ML--one which we've shown can often be orders of magnitude more efficient--but like any development process, it requires effort and infrastructure to use most successfully! Which is a big part of why we've built Snorkel Flow- to support and accelerate this new kind of ML development process.
- Who uses Snorkel? A few that have a published record: Google, Intel, Microsoft, Grubhub, Chegg, IBM... and many others at very large and smaller orgs that are not public
- What is going to happen with the OSS: The OSS project will remain up and open under Apache 2.0, same as all of the other research work we've put out over the years! See our community spectrum chat for more.