
Launch HN: Aquarium (YC S20) – Improve Your ML Dataset Quality - pgao
Hi everyone! I’m Peter from Aquarium (<a href="https:&#x2F;&#x2F;www.aquariumlearning.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.aquariumlearning.com&#x2F;</a>). We help deep learning developers find problems in their datasets and models, then help fix them by smartly curating their datasets. We want to build the same high-power tooling for data curation that sophisticated ML companies like Cruise, Waymo, and Tesla have and bring it to the masses.<p>ML models are defined by a combination of code and the data that the code trains on. A programmer must think hard about what behavior they want from their model, assemble a dataset of labeled examples of what they want their model to do, and then train their model on that dataset. As they encounter errors in production, they must collect and label data for the model to train on to fix these errors, and verify they&#x27;re fixed by monitoring the model’s performance on a test set with previous failure cases. See Andrej Karpathy’s Software 2.0 article (<a href="https:&#x2F;&#x2F;medium.com&#x2F;@karpathy&#x2F;software-2-0-a64152b37c35" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;@karpathy&#x2F;software-2-0-a64152b37c35</a>) for a great description of this workflow.<p>My cofounder Quinn and I were early engineers at Cruise Automation (YC W14), where we built the perception stack + ML infrastructure for self driving cars. Quinn was tech lead of the ML infrastructure team and I was tech lead for the Perception team. We frequently ran into problems with our dataset that we needed to fix, and we found that most model improvement came from improvement to a dataset’s variety and quality. Basically, ML models are only as good as the datasets they’re trained on.<p>ML datasets need variety so the model can train on the types of data that it will see in production environments. In one case, a safety driver noticed that our car was not detecting green construction cones. Why? When we looked into our dataset, it turned out that almost all of the cones we had labeled were orange. Our model had not seen many examples of green cones at training time, so it was performing quite badly on this object in production. We found and labeled more green cones into our training dataset, retrained the model, and it detected green cones just fine.<p>ML datasets need clean and consistent data so the model does not learn the wrong behavior. In another case, we retrained our model on a new batch of data that came from our labelers and it was performing much worse on detecting “slow signs” in our test dataset. After days of careful investigation, we realized it was due to a change to our labeling process that caused our labelers to label many “speed limit signs” as “slow signs,” which was confusing the model and causing it to perform badly on detecting “slow signs.” We fixed our labeling process, did an additional QA pass over our dataset to fix the bad labels, retrained our model on the clean data, and the problems went away.<p>While there’s a lot of tooling out there to debug and improve code, there’s not a lot of tooling to debug and improve datasets. As a result, it’s extremely painful to identify issues with variety and quality and appropriately modify datasets to fix them. ML engineers often encounter scenarios like:<p>Your model’s accuracy measured on the test set is at 80%. You abstractly understand that the model is failing on the remaining 20% and you have no idea why.<p>Your model does great on your test set but performs disastrously when you deploy it to production and you have no idea why.<p>You retrain your model on some new data that came in, it’s worse, and you have no idea why.<p>ML teams want to understand what’s in their datasets, find problems in their dataset and model performance, and then edit &#x2F; sample data to fix these problems. Most teams end up building their own one-off tooling in-house that isn’t very good. This tooling typically relies on naive methods of data curation that are really manual and involve “eyeballing” many examples in your dataset to discover labeling errors &#x2F; failure patterns. This works well for small datasets but starts to fail as your dataset size grows above a few thousand examples.<p>Aquarium’s technology relies on letting your trained ML model do the work of guiding what parts of the dataset to pay attention to. Users can get started by submitting their labels and corresponding model predictions through our API. Then Aquarium lets users drill into their model performance - for example, visualize all examples where we confused a labeled car for a pedestrian from this date range - so users can understand the different failure modes of a model. Aquarium also finds examples where your model has the highest loss &#x2F; disagreement with your labeled dataset, which tends to surface many labeling errors (ie, the model is right and the label is wrong!).<p>Users can also provide their model&#x27;s embeddings for each entry, which are an anonymized representation of what their model “thought” about the data. The neural network embeddings for a datapoint (generated by either our users’ neural networks or by our stable of pretrained nets) encode the input data into a relatively short vector of floats. We can then identify outliers and group together examples in a dataset by analyzing the distances between these embeddings. We also provide a nice thousand-foot-view visualization of embeddings that allows users to zoom into interesting parts of their dataset. (<a href="https:&#x2F;&#x2F;youtu.be&#x2F;DHABgXXe-Fs?t=139" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;DHABgXXe-Fs?t=139</a>)<p>Since embeddings can be extracted from most neural networks, this makes our platform very general. We have successfully analyzed dataset + models operating on images, 3D point clouds from depth sensors, and audio.<p>After finding problems, Aquarium helps users solve them by editing or adding data. After finding bad data, Aquarium integrates into our users’ labeling platforms to automatically correct labeling errors. After finding patterns of model failures, Aquarium samples similar examples from users’ unlabeled datasets (green cones) and sends those to labeling.<p>Think about this as a platform for interactive learning. By focusing on the most “important” areas of the dataset that the model is consistently getting wrong, we increase the leverage of ML teams to sift through massive datasets and decide on the proper corrective action to improve their model performance.<p>Our goal is to build tools to reduce or eliminate the need for ML engineers to handhold the process of improving model performance through data curation - basically, Andrej Karpathy’s Operation Vacation concept (<a href="https:&#x2F;&#x2F;youtu.be&#x2F;g2R2T631x7k?t=820" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;g2R2T631x7k?t=820</a>) as a service.<p>If any of those experiences speak to you, we’d love to hear your thoughts and feedback. We’ll be here to answer any questions you might have!
======
tmshapland
I'm an Aquarium user. There are two ways Aquarium provides value to my
company. First, we improved our model performance. Second, I spent less time
and less clicks curating my dataset.

Regarding model performance, I used Aquarium to improve the AUC for my model
by 18 percentage points (i.e., comparing the AUC for the first model trained
on my new dataset to the AUC for my production model).

Regarding dataset curation efficiency, I spent much less time curating my
dataset using Aquarium than I would have spent using our own in-house tooling.
For example, the embedding-based point cloud allowed me to identify lots of
images with an issue at once, rather than image by image, click by click.

This thread has been mostly focused on improving model performance (i.e., my
first point), but Aquarium is also valuable for improving model curation labor
efficiency (i.e., my second point). For the business owner, dataset curation
labor efficiency means less money wasted on having some of your most expensive
employees, ML data scientists, clicking around and writing ad-hoc scripts. For
the ML practitioner, dataset curation labor efficiency means fewer clicks and
less wear on your carpal tunnels.

The founders, Peter and Quinn, didn't ask me to write this. I chose to write
it because it's a great product that I think can help a lot of businesses and
people.

~~~
__sy__
To second your comment, I think non ML folks don't understand how much of an
impact dataset curation can have on model performance. More high-quality data
will outshine clever network architectures with less data. I've seen it again
and again. But the thing is, it's so hard to really curate your data once the
dataset has a lot of "dimensionality" to it (sorry couldn't think of a better
word...). To be honest, if I were to pick an area of dev-tool I'm most excited
about over the next 5 years, this area is probably it.

~~~
__sy__
Btw, for anyone interested, here's a good/quick talk by Andrej Kapathy on what
it will take to build the next software stack.
[https://www.youtube.com/watch?v=y57wwucbXR8&t=3s](https://www.youtube.com/watch?v=y57wwucbXR8&t=3s)

------
ishcheklein
Hey! DVC maintainer and co-founder here. First of all, congrats and let me
know if we can help you or you have some collaboration in mind! A few
questions - how does workflow look like - do you expect users to upload all
data to your service? How can data then be consumed from the platform?

~~~
pgao
Thanks!

We don't expect users to upload all data to our service - the type of data
we're interested in is "metadata." URLs to the raw data, labels, inferences,
embeddings, and any additional attributes for their dataset. Users can POST
this to our API and we'll ingest it that way.

If users don't provide their own embeddings, we need access to the raw data so
we can run our pretrained models on the data to generate embeddings.

However, if users do provide their own embeddings, we would never need access
to the raw data - Aquarium operates on embeddings, so the raw data URLs would
be purely for visualization within the UI. This is really nice because it
means that we can access restrict URLs so only customers can visualize it (via
URL signing endpoints, only authorizing IP addresses within customer VPNs,
Okta integration) and Aquarium would operate on relatively anonymized
embeddings and metadata.

------
stev3
Thanks for all the hard work and congrats on your launch!

I will definitely check this service out for a side project I'm working on
that combines basketball and AI
([https://www.myshotcount.com/](https://www.myshotcount.com/))

~~~
pgao
Yup, I saw your form submission through our site! I reached out to you over
email, I'm confident we can help out :)

------
masio12
I think is a great idea because as you mentioned quality Datasets can make
your model work or not at all. However this is not addressing the big elephant
in the room. Which is: no matter how much you curate or clean the data, you
are limited to the dataset that you have. The big answer would be, how can you
get more and better datasets. I think tooling is super important, but the big
difference will be, how to collect/generate/capture reliable, defendable,
datasets moving forward. I think your idea is complementary to this other
project: [https://delegate.dev](https://delegate.dev)

~~~
pgao
I absolutely, 120% agree on the importance of adding the right data. Aquarium
helps you with: "what data should I be collecting to improve my model" and
"where do I find that data?"

For the latter, Aquarium treats the problem of smart data sampling as a search
and retrieval problem. You want to find more examples of a "target" from a
large stream of unlabeled data. Aquarium does this by comparing embeddings of
the unlabeled data to your "target set" and then sending examples to labeling
if they're within a defined distance threshold in embedding space. We don't
actually do the labeling, but we wrap around common labeling providers and can
integrate into in-house flows with our API.

~~~
quinnhj
Other founder here! For a high level overview of this framing of the problem,
I recommend reading this Waymo blog post [1].

One nice feature is that by using embeddings produced by a user's model, which
has been trained in the context of their domain, we can do this sort of smart
sampling in domains we've never seen before. Embeddings are also naturally
anonymized, so we can do this without access to a user's potentially private
raw data streams.

[1] [https://blog.waymo.com/2020/02/content-
search.html](https://blog.waymo.com/2020/02/content-search.html)

------
TuringNYC
Dear @pgao thank you for the long intro with references and explanations. I
went to your website and noticed the "getting started" is a contact form.
Curious -- are you making a product to do this, or is it more
consulting/advisory? I'm currently creating some fun datasets for public usage
and i'd love to be a test rat for your software.

~~~
pgao
Hey there, it's a product right now! Our goal is to make it self serve, but
we're currently onboarding people one-by-one manually until we can streamline
the onboarding flow and build out a self serve process. Feel free to DM me or
fill out the form and I can send you our public demo!

------
hughpeters
Thanks for sharing @pgao! This tool looks really valuable.

> Since embeddings can be extracted from most neural networks, this makes our
> platform very general. We have successfully analyzed dataset + models
> operating on images, 3D point clouds from depth sensors, and audio.

Are there any types of datasets/models that this tool would not work well with
that you're aware of?

~~~
pgao
Thanks a bunch!

I think the biggest issues with this approach is the requirement for
embeddings. It's hard sometimes for a customer to understand what layer to
pull out of their net to send to us, so sometimes we just use a pretrained net
to generate embeddings. One net for audio, one net for imagery, one net for
pointclouds, etc.

I'd say that it's harder for this tool to work with structured/tabular data
for a few reasons.

One, most structured datasets are domain-specific, so it's not easy to pull a
pretrained model off the shelf to generate embeddings - typically we would
need a customer to give us the embeddings from their own model in these cases.

Two, neural nets actually aren't the best for certain structured data tasks.
Tree-based techniques often get better performance on simpler tasks, which
means there's no obvious embedding to pull from the model.

Three, an alternate interpretation is that a feature vector input for
structured data tasks is already an embedding! When the input data is low
dimensional, you can do anomaly detection and clustering just by histogramming
and other basic population statistics on your data, so it's a lot easier than
dealing with unstructured data like imagery.

So I wouldn't say that our tooling wouldn't work for structured data, but more
that in those types of cases, maybe there's something simpler that works just
as well.

------
fractionalhare
If I understand correctly, it sounds like your platform is primarily intended
for improving awareness and understanding of the data a team has, so they know
which features to focus on and emphasize.

Do you think you'll get into synthetic data generation as well? In other
words, improving dataset quality additively, not just curatively.

~~~
pgao
Yes, your interpretation is correct. I don't think we're going to get into
synthetic data generation in the near term, mainly due to the amount of effort
required + questions about domain transfer. However, we do improve dataset
quality additively by sampling the best data to label + retrain on to get the
best performance.

Said another way: once you've found "I do badly on green cones," we use
similarity search on the embeddings of known green cone examples to find more
instances of green cones in the wild. We pick the right examples from streams
of unlabeled data, then send it to labeling + add to your dataset so it does
better the next time you retrain.

~~~
mlthoughts2018
I like this much better than synthetic data augmentation actually. I think
synthetic augmentation, like with GANs is actually a failed concept.

There has long been theoretical limits around how much you can gain by
ensembling with a model of known limitations, and this is all that synthetic
training data is at root.

You can’t “make up” training data that allows you to escape the ceiling of
performance implied by whatever generator process you use for the synthetic
data, no differently than you can’t learning a better regression just by
bootstrapping a large sample of data from your existing training set.

Algorithmic synthetic data is a big type of fool’s gold.

------
jononor
Have tested the tool a little bit for audio, and see a potential here.
Especially useful for anyone who has a relatively large amount of unlabeled
data, and want to be efficient in terms of what samples to spend resources
labeling.

~~~
pgao
Thanks for the shoutout! We got connected to jononor through our previous
r/machinelearning launch:
[https://www.reddit.com/r/MachineLearning/comments/hjbl4h/p_l...](https://www.reddit.com/r/MachineLearning/comments/hjbl4h/p_launching_aquarium_understand_and_improve_ml/)

