Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Lightly (YC S21): Label only the data which improves your ML model
88 points by isusmelj 73 days ago | hide | past | favorite | 25 comments
Hi HackerNews! We’re Matt and Igor from Lightly (https://www.lightly.ai/). Most companies that do machine learning at scale label only 1% of their data because it's too expensive to label all of it. We built Lightly to help companies pick the most valuable 1% to be labeled.

If you wonder what data labeling looks like for images then think about these captchas that want you to tag images in the web containing objects such as a bus or person. When we were working on training machine learning (ML) models from scratch, we often had to do this labeling ourselves. But there was always far too much data for us to be able to label all of it. We talked with more than 250 ML teams ranging from small groups of 2-3 people to large teams at Apple and Google, and they all face the same problem: they have too much data to label.

Not only that, but there wouldn’t be a lot of value in labeling everything. For example, if you have billions of images, it's a waste of time to get humans to label every one of them, because most of those labels wouldn't add useful information to the model you’re hoping to train. Most of the images are probably similar enough to other images that have already been labeled and they have nothing new to tell your model. Spending more labeling effort on those would be a bit like labeling the same image over and over again—quite wasteful.

As soon as your ML model surpasses the initial prototype stage, you’re most interested in the edge cases in your dataset — the ones that represent rare events. For example, a few days ago, there was a Twitter thread about failure cases for Tesla vehicles. One Tesla has mistaken a yellow moon for a yellow traffic light: https://twitter.com/JordanTeslaTech/status/14184133078625853.... Another edge case is a truck full of traffic lights: https://twitter.com/haltakov/status/1400797882891091970. Finding and labeling such rare cases is key to having a robust system that will work in difficult situations.

Rather than labeling everything, a better approach is to first discard all the redundant images and keep only the ones that it's worth spending time/money to label. Let's call those "interesting" images. If you could spend labeling effort only on the "interesting" images, you'd get the same value for a fraction of the cost.

Many ML companies in a more advanced stage have had to tackle this problem. One approach is to pay people to go through the images and discard the "boring" (nothing-new-to-tell-me) images, leaving the "interesting" (worth-spending-resources-to-label) ones. That can save you money if it's on average cheaper to answer the question "boring or interesting?" about an image than it is to label it. This solution scales as long as you have an increasing human labeling workforce every year. However, ML data doubles every year on average, and therefore the labeling capacity would need to double too.

Much better than that — the holy grail — would be for a computer to do the work of discarding the "boring" images. Compared to paying humans to do it, you'd get the "interesting" subset of your billion images almost for free. You would have much less work to do (or money to spend) on labeling, and you'd get just as good a model after training. You could split the savings with whoever knew how to make a computer do this for you, and you'd both come out ahead. That’s basically our intention with Lightly.

My co-founder Matt and I worked on many machine learning projects ourselves, where we also had to manage tooling and annotation budgets. Dealing with data in a production environment is different from academia. In academia, we have well-balanced and manually curated datasets. It is, as some of you know, a huge pain. The solution of the problem boils down to working with unlabeled data.

Luckily, in recent years, a new subfield of deep learning has emerged called self-supervised learning. It’s a technique to train models to understand data without any labels. In natural language processing (NLP), modern models like BERT or GPT all rely on it. In computer vision, we have had a similar breakthrough in the last year with models such as SimCLR or MoCo. Back in 2020, we started experimenting with self-supervised learning to better understand unlabeled data and improve our software. However, there was no easy-to-use framework available to work with the latest models. To solve that problem, we built our own framework to make the power of self-supervised learning easily accessible. Since we want to foster research in this domain and grow a bigger community around this topic we decided to open-source the framework in fall 2020 (https://github.com/lightly-ai/lightly). It is now used by universities and research labs all over the world. We realized that the ability to understand and visualize unlabeled data is also valuable to other ML teams and decided to offer our solution as a SaaS platform. The platform builds on the open-source framework and helps you work with the most valuable data.

Here are some examples where Lightly can help you:

Analyze the quality and diversity of your datasets. Our platform can also use metadata or labels if available. Uncover class distributions, dataset gaps, and representation biases before labeling to save time and money. You can do it manually or automated through our data selection algorithms which ensure the most diverse subset of your dataset is chosen. (https://docs.lightly.ai/).

Once you have a labeled dataset and trained your model, our active-learning algorithms allow you to gradually select the next data to be added to your training set. Only label the best data for model training until you reach your target accuracy. https://docs.lightly.ai/getting_started/active_learning.html

Check it out yourself with this quick demo video https://www.youtube.com/watch?v=38kwv0xEIz4

Lightly integrates with an API directly into your pipeline, is available on-prem, and can process up to 100M samples within hours.

We're excited we get to show Lightly to you all. Thank you for reading! Please let us know your thoughts and questions in the comments.

Having built a model to identify sensitive data having a solid data labeling solution would be awesome. I can attest this is a real problem. Here's the library we built:


In this space, Prodigy really dominates:


We actually built our own internal system which integrates and can export the labels (does predictive labeling, etc). Of course, we only focused on text data at the moment.

All that being said, this is going to become a crowded space and highly competitive. Plus, once the data is labeled companies often drop their labelers. I would recommend ensuring some consistent use. Potentially, hosting their models off-prem or something to lock companies in.

The library looks great!

Prodi.gy is great but focuses heavily on NLP and speeding up the labeling process. Our goal is really to help you reduce what you label before you use any labeling tool.

We are working with labeling tools as well as providers to streamline the workflow.

Doesn't Prodi.gy also claim to do active learning, which essentially reduce the instances to label too?

Haven't used Prodi.gy so don't know how its active learning algo works. May you share the difference?

Most active learning frameworks just use model predictions to find the images where the model has the lowest confidence and ignore the image diversity aspect. E.g. the model struggles with bicycles at night. The problem with this approach is that you might end up adding many new images to your labeling pipeline that are very similar to each other.

However, with Lightly you can additionally make sure you only select images that are visually different from each other. And you always get visual feedback on the selected data in our web platform. The additional control and feedback mechanism allow you to have a more focused workflow.

How do models trained with Lightly compare with other approaches wrt adversarial robustness?

Can using Lightly introduce additional bias in the model, since only a select few of inputs are being labeled? This may be a concern for publicity purposes.

By the way, I thought ETH spinoff requirements were incompatible with YC requirements - nice to see it can be made to work.

Thanks for the interest and great questions. Responses are below:

>How do models trained with Lightly compare with other approaches wrt adversarial robustness?

We have no benchmark available. Both approaches can be combined. You can use Lightly to pick a diverse subset, label it and then during training/ evaluating the model check for adversarial robustness and re-iterate.

>Can using Lightly introduce additional bias in the model, since only a select few of inputs are being labeled? This may be a concern for publicity purposes.

If we remove bias we automatically introduce bias. BUT we want the introduced bias to be controlled and known.

Bias typically comes from the way we collect data. For example, more data is being collected during the day than during night for autonomous driving. We also have more data collected during sunny weather than rain or snow. We also have more data from cities like San Fransisco than cities like New Mexico. Most of our datasets are biased.

> By the way, I thought ETH spinoff requirements were incompatible with YC requirements - nice to see it can be made to work.

From what we know we are the first ETH spin-off who is part of the YC program. we hope they don't abandon us.

An obvious trick to speed up supervised learning is to label and import into the training set only the images for which the model makes wrong predictions. So for most of the images the human only needs to approve the automatic predictions - and from time to time he needs to label them.

Are there any libraries to facilitate a workflow like this?

We are currently working to support exactly this workflow with Lightly. The biggest challenge is to quickly and reliably find the images with wrong predictions. To tackle this, Lightly can leverage the strong representations from contrastive learning.

For example: A simple workflow for a classification task would be to train a self-supervised model on the whole dataset and find samples with a different annotation than their nearest neighbors. These can be identified quickly either in a colored scatter plot or by simply measuring disagreement.

Any plans for a Tensorflow support?

All the active learning features and interaction with the platform already works with different frameworks such as Keras, Tensorflow or Jax. The part for training self-supervised learning is currently only available for PyTorch. We don't have focused yet on bringing it to Tensorflow. But it's definitely something we should look at!

How does it differentiate from modAL? I think at a glance they try to achieve roughly the same thing: give human labelers only the datapoints most relevant to the problem at hand.


modAL indeed has a similar goal of choosing the best subset of data to be labeled. However it has some notable differences:

It is built on scikit-learn which is also evident from the suggested workflow. Lightly on the other hand was specifically built for deep learning applications supporting active learning for classification but also object detection and semantic segmentation.

modAL provides uncertainty-based active learning. However, it has been shown that uncertainty-based AL fails at batch-wise AL for vision datasets and CNNs, see https://arxiv.org/abs/1708.00489. Furthermore it only works with an initially trained model and thus labeled dataset. Lightly offers self-supervised learning to learn high dimensional embeddings through its open-source package https://github.com/lightly-ai/lightly. They can be used through our API to choose a diverse subset. Optionally, this sampling can be combined with uncertainty-based AL.

Thanks for the reply. In my case I have (hobby) problems that fall well into scikit-learn capabilities.

Since I don't do machine learning I don't know whether your product is good or not, but I automatically appreciate you calling this "machine learning" and not "AI".

Thanks, we also try to avoid using AI as much as possible :)

In the video you posted you say "I don't want to work with blurry images". Is that not an image a human could work with (drive), maybe by reducing the driving speed to 50-33% and having more time to inspect his surroundings?

Yes, depending on the kind of data you want to work with you might want to explicitly get the blurry ones. That depends on the task of the ML model you train and requires domain expertise.

If the dataset mostly contains edge cases, model performance on the dataset is going to be poor, but it's not an issue I think.

But how could the real-world accuracy be computed? Is a separate dataset needed for that purpose?

When learning ML at university one assumes that the data you have well represents the environment. We do the famous train/validation/test split and train our model.

However, in practice we see that it is very hard to collect a good dataset. There is a great twitter thread from Abubakar(CEO Gradio) about this topic: https://twitter.com/abidlabs/status/1423067498862219267

Thanks for your answer. I'm seeing a tweet, but not a thread. Is it expected?

Yes, sorry. I meant he started a good conversation with his tweet.

I'll admit I'm not an expert in this area, but can't this introduce a pernicious sort of bias on the downstream model, since the input data is being curated by your technology?

Whether you introduce or reduce bias using Lightly depends on how you use the software.

What we ideally want, is that the data we collect for training, validating, and testing our model represents the environment the trained model will operate in. However, collecting this data can be super difficult. E.g. for autonomous driving systems that would mean collecting data from every corner of the world, at every time of the day, during all kinds of weather conditions, in all 4 seasons...

However, this is a really difficult task. You will very likely end up having more data collected from one city (e.g. because your fleet is much bigger in city A than city B)

How does this compare to existing tools like Scale AI?

Scale is one of the biggest companies in the data labeling space. They recently introduced Scale Nucleus which goes in a similar direction as Lightly. However, whereas Nucleus works well with already labeled datasets we designed Lightly from the beginning to focus on unlabeled data. With the combination of self-supervised learning and embedding visualization of Lightly, you can easily work with datasets where you don’t have model predictions or labels at hand. Note, that Lightly is also available on-prem which is a must for some of our customers since usually there is 100x more unlabeled data than labeled data.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact