
Cleaning algorithm finds 20% of errors in major image recognition datasets - groar
https://deepomatic.com/en/how-we-improved-computer-vision-metrics-by-more-than-5-percent-only-by-cleaning-labelling-errors/
======
CydeWeys
Why aren't these data sets editable instead of static? Treat them like a
collaborative wiki or something (OpenStreetMap being the closest fit) and
allow everyone to submit improvements so that all may benefit.

I hope the people in this article had a way to contribute back their
improvements, and did so.

~~~
6gvONxR4sf7o
The datasets serve as benchmarks. You get an idea for a new model that solves
a problem current models have. These ideas don't pan out, so you need
empirical evidence that it works. To show that your model does better than
previous models, you need some task that your model and previous models can
share for both training and evaluation. It's more complicated than that, but
that's the gist.

It would be so wasteful to have to retrain a dozen models that require a month
of GPU time each on to serve as baselines for your new model...

~~~
barkingcat
That's not wasteful. That's correction.

Is it wasteful to throw away a batch of food when 20% of it has been studied
to contain the wrong substance, which ends up causing disease?

Isn't it even more wasteful to continue using unedited and unverified data
sets just because all the previous models were trained on it, and thus we can
no longer advance the state of the research? It's a case of garbage in garbage
out.

~~~
6gvONxR4sf7o
>By one estimate, the training time for AlphaGo cost $35 million [0]

How about XLNet which cost something like $30k-60k to train [1]? GPT-2 may
have been around the same [2] is estimated around the same, while thankfully
BERT only costs about $7k[3], unless of course you're going to do any new
hyperparameter tuning on their models which you of course will do on your own
model. Who cares about apples-to-apples comparisons?

We're not talking about spending an extra couple hours and a little money on
updated replication. We're talking about an immediate overhead of tens to
hundreds of thousands of dollars per new paper.

Tasks are updated over time already to take issues into account, but not
continuously as far as I know.

[0] [https://www.wired.com/story/deepminds-losses-future-
artifici...](https://www.wired.com/story/deepminds-losses-future-artificial-
intelligence/)

[1]
[https://twitter.com/jekbradbury/status/1143397614093651969](https://twitter.com/jekbradbury/status/1143397614093651969)

[2]
[https://news.ycombinator.com/item?id=19402666](https://news.ycombinator.com/item?id=19402666)

[3] [https://syncedreview.com/2019/06/27/the-staggering-cost-
of-t...](https://syncedreview.com/2019/06/27/the-staggering-cost-of-training-
sota-ai-models/)

~~~
visarga
BERT is trained on unsupervised data. It's not the same kind of model the
article talks about.

------
rathel
Nothing is however said about _how_ the errors are detected. Can an ML expert
chime in?

~~~
thibaut-duguet
I'm a Product Manager at Deepomatic and I have been leading the study in
question here. To detect the errors, we trained a model (with a different
neural network architecture than the 6 listed in the post), and we then have a
matching algorithm that highlights all bounding boxes that were either
annotated but not predicted (False Negative), or predicted but not annotated
(False Positive). Those potential errors are also sorted based on an error
score to get first the most obvious errors. Happy to answer any other question
you may have!

~~~
liquidify
Curious if you could find errors by comparing the results from the different
models. Places where models disagree with each other more often would be areas
that I would want to target for error checking.

~~~
thaumasiotes
> Places where models disagree with each other more often would be areas that
> I would want to target for error checking.

This is a great idea if your goal is to maximize the rate at which things you
look at turn out to be errors. (On at least one side.)

But it's guaranteed to miss cases where every model makes the same
inexplicable-to-the-human-eye mistake, and those cases would appear to be
especially interesting.

------
kent17
20% annotation error is huge, especially since those datasets (COCO, VOC) are
used for basically every benchmark and state of the art research.

~~~
rndgermandude
And people wonder why I am still a bit skeptical of self-driving cars....

~~~
s1t5
In one of his fastai videos Jeremy Howard makes the point that wrong labels
can act as regularization and you shouldn't worry too much about them. I'm a
bit skeptical as to how far you can push this but you certainly don't need
_perfect_ labelling.

~~~
groar
That is true up to a certain point (for instance, in my experience, having
bounding boxes that are not pixel-perfect acts as a regularizer), but there is
also a good chance that you are mislabelling edge cases, situations that
happen rarely, and that definitely hurts the performance of the neural network
to make a correct prediction on these difficult / uncommon scenarios.

------
magicalhippo
> Create an account on the Deepomatic platform with the voucher code “SPOT
> ERRORS” to visualize the detected errors.

Nice ad.

~~~
thibaut-duguet
Our platform is actually designed for enterprise companies, so we don't
provide open access unfortunately.

~~~
scribu
I signed up and still couldn't see the errors.

I just see 3 datasets with generic annotations.

~~~
thibaut-duguet
The process is actually a bit complicated but let me explain it to you. Once
you are on a dataset, click on the label that you want and use the slider at
the top right corner of the page to switch modes (we call it smart detection).
You should then be able to access three tabs and the errors are listed in the
False Positive and False Negative tabs (I've added a screenshot in the
blogpost so that you can make sure to be at the right place). Let me know if
you have any problem, thanks!

~~~
scribu
Thanks, I can see them now.

------
fwip
The title here seems wrong. Suggested change:

"Cleaning algorithm finds 20% of errors in major image recognition datasets"
-> "Cleaning algorithm finds errors in 20% of annotations in major image
recognitions."

We don't know if the found errors represent 20%, 90% or 2% of the total errors
in the dataset.

~~~
groar
Yes agreed with that ! I can't change the title unfortunately

------
kent17
> We then used the error spotting tool on the Deepomatic platform to detect
> errors and to correct them.

I'm wondering if those errors are selected on how much they impact the
performance?

Anyway, this is probably a much better way of gaining accuracy on the cheap
than launching 100+ models for hyperparameter tuning.

------
frenchie4111
Best I can tell, they are using the ML model to detect the errors. Isn't this
a bit of an ouroboros? The model will naturally get better, because you are
only correcting problems where it was right but the label was wrong.

It's not necessarily a representation of a better model, but just of a better
testing set.

~~~
groar
If I understand correctly they actually did not change the test set.

~~~
frenchie4111
Ah, I guess I missed that

------
benibela
These things are why I stopped doing computer vision after my master thesis

------
jontro
Weird behaviour on pinch to zoom (macbook). It scrolls instead of zooming and
when swiping back nothing happens.

Another example of why you should never mess with the defaults unless strictly
necessary.

------
groar
Using simple techniques, they found out that popular open source datasets like
VOC or COCO contain up to 20% annotation errors in. By manually correcting
those errors, they got an average error reduction of 5% for state-of-the-art
computer vision models.

~~~
jessermeyer
Garbage in garbage out.

------
m0zg
An idea on how this could work: repeatedly re-split the dataset (to cover all
of it), and re-train a detector on the splits, then at the end of each
training cycle surface validation frames with the highest computed loss (or
some other metric more directly derived from bounding boxes, such as the
number of high confidence "false" positives which could be instances of under-
labeling) at the end of training. That's what I do on noisy, non-academic
datasets, anyway.

