
Thoughts On Machine Learning Accuracy - deegles
https://aws.amazon.com/blogs/aws/thoughts-on-machine-learning-accuracy/
======
komali2
I'm not only disappointed in the ACLU for misinterpreting the results, but
also because I usually have a lot of faith in their competency, and they
absolutely should have seen this response coming from Amazon.

What'd they think Amazon was going to do, roll over and be like "turns our our
facial recognition software is racist, woops!"? Now instead of a meaningful
dialog, we've got a line in the sand where on one side, the ACLU, champion of
the people's rights, doesn't understand technology, and on the other side,
facial recognition software is Bad and Evil. Shits so polarizing these days it
seems there's no room for negotiation, as much as I'd like there to be.

~~~
confounded
> _I 'm not only disappointed in the ACLU for misinterpreting the results..._

Did they? This blog post describes some implementation choices which could
make their false positive rate lower.

It reveals no information about how Rekognition is being deployed by LEAs, and
there’s no meaningful regulation or oversight about how that happens.

We can’t tell what the assumptions and implementation choices of local police
departments are, because that’s top secret information. Most people acting on
the predictions have no idea what they are, let alone the implications.

It’s fine to ding the ACLU study on methods —- you can, _because they
published some_.

No-one would argue that a toy model based on members of Congress is
representative of the public at large. But it’s more representative that a
press release from a company that are trying to normalize and monetize mass
surveillance via facial recognition.

~~~
l9k
The original ACLU title is: _Amazon’s Face Recognition Falsely Matched 28
Members of Congress With Mugshots_. It was spread on the news with the same
conclusion.

It puts directly the blame on Amazon and its technology and doesn't mention
any configuration they used.

~~~
confounded
Why would a title mention API/service configuration? This is not common
practice, even in academic applied ML papers.

They used the default settings, and a training set which seems extremely
likely to be used in ‘LEA-production’

Any user, including LEAs, is free to use which ever configuration options they
like.

The criticism with the most statistical implications is the confidence level
used. The ACLU were clear they they used the defaults. I can take at face
value that moving from 80% to 99% confidence on a sample of 500 faces could
produce 0 false positives. However, on the faces of the tens/hundreds of
thousands of people that might move through the center of town on any given
day, the implications for causing ordinary people serious harm are large.

I am personally shocked that Amazon are willing to “recommend” law enforcement
actions at any level of confidence, from an unintrospectable machine learning
system.

~~~
cptskippy
The title is akin to saying "We bought <gun maker>'s hunting rifle, pointed it
at Congress, pulled the trigger, and it shot 22 members of Congress". It's a
tool, you're responsible for how it is used, not the manufacturer.

I would imagine the confidence level would be tweaked by Law Enforcement or
anyone else based on the results. If you're monitoring cameras for missing
persons or shooting suspects and getting 0 hits, you might lower it because
it's better to dismiss false positives than the alternative. Conversely you
might increase it if you're overwhelmed by the number of false positives.

~~~
d4l3k
Since you're comparing facial recognition to guns and guns are regulated by
law and have strict rules for use, perhaps facial recognition should as well?

I think there should be a discussion about laws/regulations about facial
recognition use. Even if Amazon removes their service there will always be
companies that don't care about the ethics as much. Strict rules instead of
public shaming seems like the better approach

~~~
RhodesianHunter
If you screw up with a gun someone dies.

If you screw up with facial recognition a human double-checks and nothing
happens.

Scissors are dangerous but we don't regulate those either.

~~~
syshum
ummm no...

First of it is not a given if you "screw up with a gun someone dies" that is
not the only result

Further it is laughable to believe that with facial recognition a "human
double-checks" we have seen time and time again that new technology entering
the realm of criminal justice can be used to wrongfully convict people at an
alarming rate. See Hair and Bite mark controversy or the number of Labs that
have been compromised in recent years

The fact is that a "humans" that are suppose to be "double checking" do not in
reality, if the computer spits out "Well the computer says it is person X"
that is what the Jury will hear, and the Jury will not question the technology
and the person will end up in prison until such time as someone like the
Innocent project comes along to invalidate either the technology on the whole
or the application of the technology

------
alexandercrohde
For those who are emotionally invested in defending Amazon on this, could you
speak a little bit to what your horse is in this race?

I have stock in Amazon, but also caution about law enforcement (and a desire
for technology to increase accountability of government to citizens rather
than vice-versa). For me, the fear of going even one tiny step closer to China
outweighs everything else.

For those who are arguing on behalf of Amazon, what's your emotional calculus
here?

~~~
cwalv
Not defending anybody, but I think the point in the article is sound. If my
child were missing, I would want every technical resource possible to be
available to find her. Having "big brother" be able to identify me quickly in
a crowd doesn't seem that bad on balance ... if I'm in public, I expect to be
viewed.

~~~
jazzyjackson
I think maybe you've never been pulled over by an officer because you 'fit the
description of a suspect'

You, as a bystander, can be caused great inconvenience (where the line of
injustice gets crossed is up for debate) through someone else's expectation of
safety.

~~~
cwalv
You're right, I haven't (I wonder what percent of the population has?). I find
it hard to believe that law enforcement using facial recognition would really
make that problem much worse than it already is though.

~~~
deelowe
China is already experimenting with pre-empting crime by attempting to monitor
their vast network of cameras using ML systems.

------
zhobbs
A single accuracy rate is not that useful when looking at ML problems. There
is always a tradeoff between false positives and false negatives depending on
the threshold selected. Amazon is guilty of this in this article, saying 0%
false positive rate when you set the threshold to 99%. However, it doesn't say
how many faces it successfully matched, it's easy to have a 0% false positive
rate if you just say "no match" for every photo.

The best practice is to show a ROC curve, which shows the tradeoffs of
selecting a given threshold:
[https://en.wikipedia.org/wiki/Receiver_operating_characteris...](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

~~~
smittywerben
After Pearl Harbor, some wanted to get rid of early-warning radar systems. The
military created the ROC to screen for operators who might misidentify enemy
planes. The radar did fine, the interpretation was wrong.

~~~
joshgel
Interesting, do you have a source? I'm interested in misjudgment and cognitive
errors, so i'd like to read about this.

~~~
smittywerben
It's mostly Wikipedia - the history is a bit scattered. Probably better
sources out there.

[0]
[https://en.wikipedia.org/wiki/Receiver_operating_characteris...](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#History)

[1]
[https://en.wikipedia.org/wiki/SCR-270](https://en.wikipedia.org/wiki/SCR-270)

[2]
[https://archive.org/stream/Vol1PlansAndEarlyOperations#page/...](https://archive.org/stream/Vol1PlansAndEarlyOperations#page/n335/)

------
sheeshkebab
Amazon, let’s be honest - nobody reads documentation about how to set a knob
to get a good result, not police, aclu, or most developers working for these.
They’ll use 80% confidence happily and “sort out things later”.

So cut it out with your justifications about ovens and other nonsense and
acknowledge that you are just a corporation releasing a product that tries to
make a buck and/or not get left behind some other competitor of yours.

~~~
natalyarostova
The idea that the best solution to the future challenges of ML is to shame
companies into not selling ML products is so inane. Pandora's box was opened
the moment the computer was built. And really, the first time a human uses a
club to bludgeon another human.

~~~
sgt101
Well, it depends what you are selling - if aircraft manufactures had stuck to
selling hydrogen filled and aluminium painted airships I think that air travel
would not be the thing it is today. Amazon's technology is not fit for this
purpose, an ethical and responsible stance would be not to sell it for this
purpose.

------
simonhughes22
I love this quote at the end of the the google cached result - "we should not
throw away the oven because the temperature could be set wrong and burn the
pizza."

~~~
bvc35
When the oven will burn you if used incorrectly, and Amazon gives police
departments (who don't know how ovens work) the controls over the oven
regardless of their provable technical ability and with no oversight, we
should definitely throw away the oven.

~~~
joshgel
Maybe yes, maybe no. But realistically, with the hype out there about AI
solving _all_ problems, do we think its not going to get used?

Seems like more productive conversation could be had over how to use, than
blanket dismissals. The proverbial 400-lbs man in his basement can code this
up relatively easily with existing tools, to near state-of-the-art results. So
the guy that works for the FBI will too. Let's come up with appropriate,
targeted safety mechanisms so the oven doesn't light the whole building on
fire.

------
srinivasan
> The ACLU has not published its data set, methodology, or results in detail

This is my biggest gripe with how the ACLU has conducted this. I find it hard
to distinguish their "test" from clickbait.

~~~
confounded
Ironically, neither did Dr Matt Wood, the author of this AWS post.

He attempts to refute one unverifiable Rekognition configuration with another.

------
barbarr
Can someone explain this line to me?

>In addition to setting the confidence threshold far too low, the Rekognition
results can be significantly skewed by using a facial database that is not
appropriately representative and therefore is itself skewed. In this case,
ACLU used facial database of mugshots that that may have had a material impact
on the accuracy of Rekognition findings.

~~~
l9k
What I find problematic is no one can say that people with darker skin (or
from any ethnicity in particular) are harder to identify precisely without
offending people and being accused of racism/bias.

But we have no objective proof that every one is as recognizable as another.

~~~
Fomite
I know a number of researchers in ML who have stated this and been fine.

------
untangle
OK, maybe ACLU stumbled a bit by using a low (?) 80% confidence threshold. But
I wonder how much over-fitting occurs at the AMZN-recommended 99%? Is a 99%
fit scalable? What's the false-negative rate? I'm not satisfied that either
side has made a convincing argument.

------
zeroxfe
Page is 404 right now. Google cache:

[https://webcache.googleusercontent.com/search?q=cache:wzZHRC...](https://webcache.googleusercontent.com/search?q=cache:wzZHRC4q8uQJ:https://aws.amazon.com/blogs/aws/thoughts-
on-machine-learning-accuracy/+&cd=1&hl=en&ct=clnk&gl=ca)

This blog shares some brief thoughts on machine learning accuracy and bias.

Let’s start with some comments about a recent ALCU blog in which they run a
facial recognition trial. Using Rekognition the ACLU built a face database
using 25,000 publicly available arrest photos, and then performed facial
similarity searches of that database using public photos of all current
members of Congress. They found 28 incorrect matches out of 535, using an 80%
confidence level; this is a 5% misidentification (sometimes called ‘false
positive’) rate, and a 95% accuracy rate. The ACLU has not published its data
set, methodology, or results in detail, so we can only go on what they’ve
publicly said. But, here are some thoughts on their claims:

1\. The default confidence threshold for Rekognition is 80%, which is good for
a broad set of general use cases (such as identifying objects, or celebrities
on social media), but it’s not the right one for public safety use cases. The
80% confidence threshold used by the ACLU is far too low to ensure the
accurate identification of individuals; we would expect to see false positives
at this level of confidence. We recommend 99% for use cases where highly
accurate face similarity matches are important (as indicated in our public
documentation).

To illustrate the impact of confidence threshold on false positives, we ran a
test where we created a face collection using dataset commonly used in
academia, of over 850,000 faces. We then used public photos from US Congress
(the Senate and House) to search against this collection in a similar way to
the ACLU blog.

When we set the confidence threshold at 99% (as we recommend in our
documentation), our misidentification rate dropped to 0% despite the fact that
we are comparing against a larger corpus of faces (30x larger than ACLU’s
tests). This illustrates our point that developers should pick the appropriate
confidence threshold best suited for their application and their tolerance for
false positives.

2\. In real world public safety and law enforcement scenarios, Amazon
Rekognition is almost exclusively used to help narrow the field and allow
humans to expeditiously review and consider options using their judgement (and
not to make fully autonomous decisions), where it can help find lost children,
fight against human trafficking, or prevent crimes. Rekognition is generally
only the first step in identifying an individual. In other use cases (such as
social media), there isn’t the same need to double check, so confidence
thresholds can be lower.

3\. In addition to setting the confidence threshold far too low, the
Rekognition results can be significantly skewed by using a facial database
that is not appropriately representative and therefore is itself skewed. In
this case, ACLU used facial database of mugshots that may have had a material
impact on the accuracy of Rekognition findings.

4\. The beauty of a cloud-based machine learning application like Rekognition
is that it is constantly improving as we continue to improve the algorithm
with more data. Our customers immediately get the benefit of those
improvements. We continue to focus on our mission of making Rekognition the
most accurate and powerful tool for identifying people, objects, and scenes –
and that certainly includes ensuring that the results are free of any bias
that impacts accuracy. We’ve been able to add a lot of value for customers and
the world at large already with Rekognition in the fight against human
trafficking, reuniting lost children with their families, reducing fraud for
mobile payments, and improving security, and we’re excited about continuing to
help our customers and society at large with Rekognition in the future.

5\. There is a general misconception that people can match faces to photos
better than machines. In fact, the National Institute for Standards and
Technology (“NIST”) recently shared a study of facial recognition technologies
that are at least two years behind the state of the art used in Rekognition
and concluded that even those older technologies can outperform human facial
recognition abilities.

A final word about the misinterpreted ACLU results. When there are new
technological advances, we all have to be careful to be calm, thoughtful, and
reasoned about what’s real and what’s not. There’s a difference between using
machine learning to identify a food object and whether a face match should
warrant considering any law enforcement action. The latter is serious business
and requires much higher confidence levels. We continue to recommend that
customers not use less than 99% confidence levels for law enforcement matches,
and then to only use the matches as one input across others that make sense
for each agency. But, machine learning is a very valuable tool to help law
enforcement agencies, and while being concerned it’s applied correctly, we
should not throw away the oven because the temperature could be set wrong and
burn the pizza.

~~~
derf_
If you set the confidence threshold to 80% in the test you perform in (1), how
many misidentifications do you see, if any?

~~~
didibus
Would probably be less then 5%. Since they used a better dataset. ACLU got 5%
misidentification using the biased mugshot database at 80%.

~~~
didibus
I'd appreciate someone to correct my data if it's wrong. I too would like to
know.

------
ucaetano
404? Was it deleted?

~~~
simonhughes22
I am guessing legal and/or PR had a fit and removed it, as any publicity
around this sort of thing is very sensitive these days, even though they were
pointing out legitimate errors in the ACLU's work.

------
danielvf
TLDR:

1) When the ALCU tested mugshot photos VS congress member photos, the ALCU set
it to use an 80% confidence level threshold, which should result in a 5% false
positive rate. There are 535 members of congress, which from this setting
should have resulted in 26.75 misidentifications. The ALCU got 28
misidentifications, which is pretty darn close.

2) The ALCU use a comparatively small dataset of photos. Using a different,
30x bigger dataset, and a 99% confidence level resulted in no
misidentifications.

3) The police are only supposed to be using this system to narrow down
choices, and then have a human sorting out possible matches, which this should
do very well.

\---

My thoughts: The system appears to be functioning exactly according to
specifications. The eternal problem of systems is that the specifications do
not match what users are actually expecting/needing the system to do. If the
ALCU is confused about how it works, the police probably will be as well.

~~~
teraflop
Nitpick:

> the ALCU set it to use an 80% confidence level threshold, which should
> result in a 5% false positive rate.

I'm not sure where the "should" in this sentence is coming from. As far as I
know, Amazon's documentation doesn't claim any particular relationship between
confidence and false positive rate. The 5% number in the article was simply
talking about the _actual_ false positive rate that the ACLU observed.

Not to say the conclusions of the article are wrong, though. It certainly
seems unreasonable for someone to assume that "80% confidence" would give an
FPR of _less_ than 5%.

~~~
shriphani
possibly some internal metric they have? 80% might be the pairwise similarity
for a pair of images and across a whole dataset this might have a 5% error
rate.

------
fatjokes
I'm sure this rebuttal will get as much coverage as the original headline.
"Confidence intervals"? "Misclassification rates"? Get out of here with your
science-talk, nerd!

Sadly, these days I see too many ML practitioners and "data scientists"
without the necessary prob/stats foundation. Misinterpreting data happens even
to experts, so not surprisingly it's going to happen even more so to amateurs.
Suggesting more foundational knowledge is considered elitist. Why shouldn't a
month-long bootcamp be as good as a MS/PhD? In this case, the original ACLU
results fits a certain narrative and hits all the hot topics so it was bound
to be picked up.

------
larkinrichards
Unlike a dna test, a match via facial recognition would have trouble holding
up in court without other significant evidence.

Facial recognition is likely to help rapidly match known photos of a suspect
with security footage and imagery gathered near by to help pin down movements
of a suspect prior to and after a crime, in order to focus the search for
further evidence.

This is a tool to catch stupid criminals; it seems easily fooled by those who
approach crime in a more systematic manner.

There should be laws that prevent using facial recognition to track the
locations of people who aren’t the subject of investigation. I am sure that
technology would be very desirable to advertisers, and I don’t want them to
have it.

------
mlthoughts2018
The ultimate test of any predictive model is whether it works in the intended
stakeholder setting. Other types of diagnostics like ROC curves, performance
on benchmark data sets, etc., are of course valuable, but ultimately do not
necessarily reflect the true, run-time distribution of inputs or constraints
that will be relevant for a stakeholder’s usage scenario.

I’ve said this many times. What it means is that even if you think it’s great
to outsource your machine learning model to a third party like Amazon, and
consume it like a service, you really need in-house expertise in machine
learning anyway, to help you interpret the accuracy, diagnostics, and some
“integration test” notion of model performance on a sample that is agreed upon
as truly representative of the stakeholder runtime conditions.

So either way, you still have to pony up the dough to employ adequate in-house
expertise in modeling and domain machine learning.

As a result, there actually are not that many use cases when it would make
sense to not build your own tailored model in-house. If you already must hire
most of the staff with necessary skills to evaluate a third-party, the
incremental effort to train a good-enough fine-tuning of some imagenet-based
model is really not that high. And you can customize the training and
diagnostics.

This is especially critical when there are also stakeholder-specific latency
or throughput constraints.

For face detection in particular, I know a great deal about this directly
related to AWS Rekognition, because my team extensively evaluated it for the
possibility of outsourcing face detection calls in an in-browser image editing
application.

Not only could Rekognition not meet our runtime requirements, but separately
it had poor accuracy and coverage (often capping out at detecting a small
number of faces per image) and the Rekognition-reported confidence score
aligned horribly with our own ground truth bounding box data that allowed us
to quantify overlap of the detection with intersection-over-union scores.

We tested a handful of fine-tuned prebuilt models in Keras: Resnet, Inception,
and our own port of MTCNN, and had drastically better face detection (on a
very large training corpus) than Rekognition in about a week.

And our cost per request, even deploying to AWS, was much lower than
Rekognition even at our high volume of traffic and even inclusive of the cost
to spin up GPU instances for training.

What’s more is that we tightly controlled how to write the web service layer
wrapping it, and how to optimize the image preprocessing steps, etc., that we
have no visibility into with a third party.

I think there is a conceptual gap with this stuff where people just think that
because you could commoditize a web service around a prediction algorithm, it
must mean it would be economically valuable.

But that part is the absolute least meaningful part. The whole enchilada is
diagnosing how the model performs on the exact stakeholder use case, inclusive
of performance constraints.

For this reason, I think the only market for “ML tools for people who don’t
know ML” is going to be just like vapid corporate IT consulting.

Don’t get me wrong: big tech companies will profit from this. But not because
it solves any actual prediction problems or saves anyone money compared with
in-house ML development. It’ll just be the standard Dilbert-y politics that
has always yielded profits to IT consulting services.

