
Deep learning outperformed dermatologists in melanoma image classification task - EndXA
https://www.ejcancer.com/article/S0959-8049(19)30221-7/fulltext
======
0xab
I do research in computer vision and this paper is so bad it's beyond words.

* They give the network is huge advantage: they teach it that it should say "no" 80% of the time. The training data is unbalanced (80% no vs 20% yes) as is the test data. Of course it does well! I don't care what they do at training time, but the test data should be balanced or they should correct for this in the analysis.

* They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

* They measure the wrong thing about humans. What a doctor does is they decide how confident they are and then they refer you to a biopsy. They don't eyeball it and go "looks fine" or "it's bad". They should measure how often this leads to a referral, and they'll see totally different results. There's a long history in papers like this of defining a bad task and then saying that humans can't do it.

* They have a biased sample of doctors that is highly skewed toward people with no experience. Look at figure 1. A lot of those doctors have about as much experience to detect melanoma as you do. They just don't do this task.

* "Electronic questionnaire"s are a junk way of gathering data for this task. Doctors are busy. What tells the authors that they're going to be as careful for this task as with a real patient? Real patients also have histories, etc.

I could go on. The number of problems with this paper is just interminable
(54% of their images were non-cancer because a bunch of people looked at them.
If people are so wrong, why are they trusting these images? I would only trust
biopsies).

This isn't coming to a doctor's office anywhere near you. It's just a
publicity stunt by clueless people. Please collaborate with some ML folks
before publishing work like this! There are so many of us!

~~~
plus
Since this is a journal focused on cancer and not machine learning, I can
understand why the editors would see this paper as being worthy for for
publication. Unfortunately, many of the readers will read the paper
uncritically.

If possible, you should write a critical response to this paper, focusing on
its methodological flaws, and send it to the editors. It doesn't have to be
long; critical response are usually a couple pages at most. This is likely the
most effective way of removing (or at the very least, heavily qualifying) bad
science from research journals.

~~~
0xab
This is a huge problem throughout science, not just ML. As scientists, we're
rewarded for publishing cool new things that work, not for pointing out things
that don't or for pointing out flaws in existing papers. If the point is to
get people to not read one bad paper, it's just a waste of my time. Most
papers are false and a lot of them should never have passed review.

If the authors actually wanted to do good ML research, they could always have
reached out to a decent ML researcher who could have told them all of this.
There's no shortage of us. The journal could have reached out to an ML
reviewer. Why wouldn't they? But no one did, because the results look good and
so they send it off to press and it's good for both the authors and the
journal to have something that is hype-worthy. It's just the sad reality of
modern science.

~~~
mehrdadn
> Most papers are false and a lot of them should never have passed review.

Do you mean this literally or is this a metaphor to illustrate the point? If
you actually mean most papers are false it'd be nice to see a link on that!

~~~
michaelhoffman
John Ioannidis claims that "most published research is false" based on some
rather dubious assumptions.

[https://www.annualreviews.org/doi/abs/10.1146/annurev-
statis...](https://www.annualreviews.org/doi/abs/10.1146/annurev-
statistics-060116-054104)

~~~
DataWorker
I agree with him although the accuracy of that statement is partially based on
how “published research” is defined. Operational definitions and measurement
are themselves much of the problem.

------
hprotagonist
As always, let's see how well it does in live images. This system outperformed
dermatologists on its own validation set of 100 images, which I would
encourage you to interpret as "heartening preliminary evidence" but not much
more.

Posting high scores on your validation set is only as informative as your val
set is representative of the real world. 70% specificity, 84% sensitivity
looks OK on paper (maybe -- as another poster noted, it's equally fair to say
it's good evidence that image-only diagnosis is bad no matter what does it),
but it doesn't always feel that way in practice. As a cheap example, your word
error rate for a speech recognition system has to be extremely low in order
for that system to be nice to use -- way lower than most otherwise acceptable
looking scores.

This analogy only gets you so far, and i don't mean to impugn this study's
test set, but another example is just because you can post 99.9% on MNIST
doesn't mean that your system will approach that level of accuracy on digit
recognition in the wild.

~~~
learntoplay
Isn’t DeepMind about to release a medical product that will do something very
similar to this? Right now I wouldn’t doubt how well these systems can perform
as compared to trained specialists that rely on their eyes even for reading
test results.

~~~
sgt101
I wonder if these products will have to go through proper trials like drugs
do? If not, why not?

~~~
hprotagonist
[https://www.fda.gov/medical-devices/digital-
health/software-...](https://www.fda.gov/medical-devices/digital-
health/software-medical-device-samd)

[https://www.fda.gov/medical-devices/ivd-regulatory-
assistanc...](https://www.fda.gov/medical-devices/ivd-regulatory-
assistance/overview-ivd-regulation)

------
yumraj
I wonder if the results would be similar were the dermatologists to see the
actual patient, in person, and then diagnose. And, then a photo was taken to
be diagnosed by the CNN.

In other words, while dermatologists may have been outperformed by deep
learning in image classification, it is not evident if deep learning could do
the same against dermatologist diagnosing in person.

Also, not clear what the overall ratio of false negative/positives was in each
case.

Also, unless I missed it in the paper, I'd be curious to learn if the cases,
in Fig. 4, where majority of humans and CNN disagreed, were the same where
majority of humans disagreed too or not.

~~~
ModernMech
Maybe? It's hard to say. I had a question about a mole once and what they did
was take a photo of it with a special camera apparatus and they sent the image
off to be diagnosed while I waited around. The doctor looked at it personally
but it seems like the actual diagnosis was made by someone who never even saw
me.

~~~
sgt101
Same for me, the doctor I saw did say (after 3 seconds) "it's not a cancer but
we'll take a picture anyway to be sure." I wonder if the same sort of thing
would have happened if there was any real question in his mind?

------
leelin
Maybe a dumb question from a non-medical guy: are medical images considered
"stationary" from a stats viewpoint?

That is, will medical images of diseases we diagnose in the next 20 years look
a lot like the ones from the past 20 years, or is there a danger of over-
fitting on an evolving data set? Could either the technology or the biology of
the disease evolve?

In a prior life I was a quant trader, and financial market data is notorious
for having the non-stationary problem. On top of market rules and structures
changing all the time, once someone discovers a profitable trading idea, their
own actions change what the data looks like for everyone else from that point
forward.

~~~
savagedata
There are always potential issues when a machine learning algorithm is applied
over time.

Example #1: Let's say that cancer rates are increasing over time and cameras
are improving over time. You might end up with a weird artifact in your model
that higher resolution images are more likely to indicate cancer.

Example #2: Let's say that cancer-detecting algorithms are widely successful
and so someone makes an app that lets you upload images of skin and the app
tells you the probability of you having cancer. Suddenly a model that was
trained on suspicious lesions is being used on normal freckles that people
uploaded for fun. You end up with a lot of false positives. Maybe you try to
combat that by including images uploaded to the app (that you somehow obtain
labels for). But now you have a model that predicts that photos taken in
brightly lit medical offices are likely to be cancer and blurry images taken
in bathroom mirrors are not cancer.

You could argue that Example #2 is more about the difference between training
data and data to be scored, but the fact remains that outside of tightly
controlled scenarios, the way data is collected nearly always changes in time
and ends up affecting model performance in unexpected ways.

------
michaelhoffman
Reporting only sensitivity/specificity/ROC metrics and not reporting
precision/positive predictive value is a bad sign. Especially since the latter
is what health systems will want to look at before deciding on implementation.

The fact that they fiddled with the balance of classes in the test set makes
the above even worse.

------
bonyt
I'm always wary of claims like these, since it is difficult to get a clean
dataset of medical information, and it is difficult or impossible to know for
sure what exactly what the classifier is looking at to classify the image:

> For example, Roberto Novoa, a clinical dermatologist at Stanford University
> in the US, has described a time when he and his colleagues designed an
> algorithm to recognize skin cancer – only to discover that they’d
> accidentally designed a ruler detector instead, because the largest tumours
> had been photographed with rulers next to them for scale.

Source: [https://physicsworld.com/a/neural-networks-
explained/](https://physicsworld.com/a/neural-networks-explained/)

------
gigantum
Interesting, however there is no indication from the publisher or researchers
how this result can be reproduced. It's nice that they put in some of the
training data, but imagine how much more impactful to the community this could
be if those interested could reproduce - and iterate - on this...

At Gigantum ([https://github.com/gigantum/gigantum-
client](https://github.com/gigantum/gigantum-client)) this is literally our
raison d'être to make this process as simple as possible.

------
JoshTko
It's hard to imagine an narrow image classification task that humans will be
able to beat NNs.

~~~
sgt101
Considering that the ground truth comes from humans I would say that humans
always outperform NN's and that results which show otherwise are demonstrating
the limitations of the data set or testing process.

~~~
leesec
It's been proven that using humans as ground truth you can ultimately build a
NN off that data which outperforms the humans.

~~~
sgt101
What is meant by proven? What is meant by "the humans"? And how can any human
say that is so?

~~~
leesec
I mean, for example, you take 5 expert radiologists, and average their
assessments when scoring an image, and your train a NN to predict those
averaged scores, and with enough data the NN will beat any single selected
doctor in accuracy.

------
mikehollinger
I find it most interesting to use tools like this to augment - not replace
diagnosticians and specialists.

Of course the key here is that the training set is crucial to building a high
quality model - which of course needs a set of specialists to give their
consensus on the diagnosis of the patient based on the images.

Presuming those folks can agree - the technology becomes a force multiplier
for good. If they disagree or label things problematically - they become a
force multiplier for bad.

~~~
sgt101
So long as the system is implemented in such as way as to stop it becoming a
crutch or a default. If it's a tool in the flow then I think such things (not
this one given the expert review information in this thread?) might be very
valuable.

------
georgeek
Every doctor tends to have a very static sensitivity-specificity preferences
(true positive rate aka recall and true negative rate, respectively). One of
the interesting consequences of using an automated diagnostic tool (already
mentioned in Esteva et al 2017's Nature article) is that the sensitivity level
can be chosen dynamically, depending on additional risk factors.

------
sonnyblarney
Despite the issues listed below, I believe this will be the future of such
classifications.

In the future your doctor will have an image scanner in their office and
you'll get 'really cheap' diagnosis of this to back up the doctors opinion.

Then you'll go for biopsy etc..

------
gbronner
MelaFind did this 15 years ago, and had a database of 50k lesions. Its
classifier did about as well as derms, and much better than GPs. It got FDA
approval, but, by that time, they had run out of money.

------
weaklearner
well the answer is a bit more complicated than just replacing dermatologists
with a CNN. I am pretty convinced the better approach is something like the
approach in this paper-use the CNN on easy cases-have the CNN tell a human
which instances are hard to classify. Many images are easy to classify-but
some are hard [even for the cnn] and humans should give those images more
study.

[https://arxiv.org/abs/1903.12220](https://arxiv.org/abs/1903.12220)

------
mrosett
Given all of the other concerns raised by commenters in this thread, I
wouldn't be surprised to find that there's some sort of data leakage as well.

------
StreamBright
Deep learning outperformed humans in random pattern recognition task using a
dataset that somewhat similar to live data.

------
assblaster
I've always thought that diagnostic-oriented specialties would be most at risk
(pathology, dermatology, radiology, ophthalmology).

As long as you have procedures, you will have a need for an extremely
competent clinician that can synthesize all information and coordinate with
use of hands or devices.

~~~
rscho
All the specialties you cite entail various manual procedures.

Other than that yes, robots are not capable of replacing manual work in
médicine yet.

~~~
assblaster
Exactly. As long as those specialities hold onto procedures, they'll be
relatively ok.

------
jmpman
Any place to upload an image?

~~~
bobowzki
If this is regarding yourself just make an appointment with a dermatologist.
Source: I'm and MD.

------
canada_dry
This is a great example of where we need to get humans out of the equation
when (if) a machine is conclusively proven to perform consistently better.

It was justified (cost wise) to replace many human labourers on auto assembly
since machines don't get tired, need breaks, have off days. It could certainly
be argued it is even more important in the field of health care (reduce costs
and improve outcomes) for all forms of image scanning.

~~~
assblaster
Machines will never replace dermatologists, machines will only make them more
efficient.

~~~
feral
If you make a dermatologist 5x more efficient, don't you replace 80% of them?

Or even better, allow them to spend more time on the hardest cases. And allow
people with no access to a dermatologist now, access to a machine almost as
good?

~~~
sgt101
Actually when you make a knowledge worker / service worker in a business
processes 5* more efficient the experience is that they spend 400% more time
on the cases that they have left. These are the cases that you can't automate
and that before automation you couldn't service properly/economically. Now you
can, so the workers do.

