Deep learning outperformed dermatologists in melanoma image classification task

0xab · on April 30, 2019

I do research in computer vision and this paper is so bad it's beyond words.

* They give the network is huge advantage: they teach it that it should say "no" 80% of the time. The training data is unbalanced (80% no vs 20% yes) as is the test data. Of course it does well! I don't care what they do at training time, but the test data should be balanced or they should correct for this in the analysis.

* They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

* They measure the wrong thing about humans. What a doctor does is they decide how confident they are and then they refer you to a biopsy. They don't eyeball it and go "looks fine" or "it's bad". They should measure how often this leads to a referral, and they'll see totally different results. There's a long history in papers like this of defining a bad task and then saying that humans can't do it.

* They have a biased sample of doctors that is highly skewed toward people with no experience. Look at figure 1. A lot of those doctors have about as much experience to detect melanoma as you do. They just don't do this task.

* "Electronic questionnaire"s are a junk way of gathering data for this task. Doctors are busy. What tells the authors that they're going to be as careful for this task as with a real patient? Real patients also have histories, etc.

I could go on. The number of problems with this paper is just interminable (54% of their images were non-cancer because a bunch of people looked at them. If people are so wrong, why are they trusting these images? I would only trust biopsies).

This isn't coming to a doctor's office anywhere near you. It's just a publicity stunt by clueless people. Please collaborate with some ML folks before publishing work like this! There are so many of us!

plus · on April 30, 2019

Since this is a journal focused on cancer and not machine learning, I can understand why the editors would see this paper as being worthy for for publication. Unfortunately, many of the readers will read the paper uncritically.

If possible, you should write a critical response to this paper, focusing on its methodological flaws, and send it to the editors. It doesn't have to be long; critical response are usually a couple pages at most. This is likely the most effective way of removing (or at the very least, heavily qualifying) bad science from research journals.

0xab · on April 30, 2019

This is a huge problem throughout science, not just ML. As scientists, we're rewarded for publishing cool new things that work, not for pointing out things that don't or for pointing out flaws in existing papers. If the point is to get people to not read one bad paper, it's just a waste of my time. Most papers are false and a lot of them should never have passed review.

If the authors actually wanted to do good ML research, they could always have reached out to a decent ML researcher who could have told them all of this. There's no shortage of us. The journal could have reached out to an ML reviewer. Why wouldn't they? But no one did, because the results look good and so they send it off to press and it's good for both the authors and the journal to have something that is hype-worthy. It's just the sad reality of modern science.

PierredeFermat · on April 30, 2019

It's amazing that a similar concern is raised/discussed here just couple hours ago: https://news.ycombinator.com/item?id=19788088

Any chance we could connect over email or something?

dataflow · on April 30, 2019

> Most papers are false and a lot of them should never have passed review.

Do you mean this literally or is this a metaphor to illustrate the point? If you actually mean most papers are false it'd be nice to see a link on that!

michaelhoffman · on April 30, 2019

John Ioannidis claims that "most published research is false" based on some rather dubious assumptions.

https://www.annualreviews.org/doi/abs/10.1146/annurev-statis...

DataWorker · on April 30, 2019

I agree with him although the accuracy of that statement is partially based on how “published research” is defined. Operational definitions and measurement are themselves much of the problem.

michaelhoffman · on April 30, 2019

How to Publish a Scientific Comment in 1 2 3 Easy Steps

http://frog.gatech.edu/Pubs/How-to-Publish-a-Scientific-Comm...

I agree that a formal comment is best although not necessarily easy. A comment on PubPeer is easier but it will probably only be seen by those with the PubPeer extension.

I do machine learning in computational biology and cancer. The issues described in the parent comment are known among experts. It’s too bad so many others don’t know or care.

chris_va · on April 30, 2019

Thank you for that link, it was a joy to read

Florin_Andrei · on April 30, 2019

I mean, if it's an interdisciplinary study, you may want to get advisers from all sides to look at it before you publish, no?

claytonjy · on April 30, 2019

Why would you ever balance your test data? If 80/20 is the actual population distribution, the sample that forms your test set should conform to that. Balance all you want in train/validation sets, but never the test set.

Not balancing and using ROC is a terrible combo, but the metric is the problem, not the lack of artificial balance.

0xab · on April 30, 2019

I agree, they should do one or the other.

The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).

Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.

Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).

arkades · on April 30, 2019

> Where's the evidence that doctors see a 80/20 split in real life?

Cause they definitely don’t. Even in a select subpopulation - say, people going to a derm for screening - you’d expect one melanoma per 620 persons screened (as per the SCREEN trial). Since most people have more than one mole for evaluation, and even those with melanoma will have multiple innocent moles... a mole count >50 triggers a referral for screening, though in more cautious docs, possibly as few as 25...

If you wanna be really generous and consider our hypothetical high risk group to have an average of 10 moles per person, that’s 6209:1, not 80:20.

p1esk · on April 30, 2019

Another reason to balance the test set when the train set is unbalanced is to check if lack of training data for certain classes is a problem. You would use cross-validation, but do different splits for each class. It might well turn out that certain classes are just "easy", and you don't need to find more training samples for them to get the overall accuracy up.

michaelhoffman · on April 30, 2019

80/20 is not the actual population distribution though.

aoeusnth1 · on April 30, 2019

Do you have an explanation of why ROC is bad for unbalanced datasets? Isn't ROC unaffected by dataset imbalance?

theferalrobot · on April 30, 2019

Agreed, I have a hard time believing this person does CV research (though I suppose it could just be a hobby for them) with a statement like that. Especially calling out that they didn't balance the test set, ummm... what?

1e-9 · on April 30, 2019

I would say your criticism is way off base. I've developed and fielded ML-based medical devices and this looks like a reasonable study that suggests they have a system worthy of further testing. There's nothing wrong with using an ROC curve here and they document the experience of the doctors, so they weren't hiding that and around 60 or so doctors had greater than 5 years experience. Also, studies like this generally don't use only biopsy-proven negatives, since that would bias the negatives towards those that were suspicious enough to biopsy. Without knowing more details than what the paper provides, I cannot say the results are valid, but I also don't see any terrible errors after a quick scan. The main weakness is probably the fact that the test set came from the same image archive used for development. As a result, there can be all sorts of biases the CNN is using to inflate its performance unbeknownst to the developers. The best way to eliminate that concern is to use a test set gathered through a different data collection effort using different clinics, but that is expensive and time consuming and not something I would do initially. This looks like a good first step and I would encourage the developers to carry it further.

EDIT: I'll add that the ratio of positives to negatives in the training set is irrelevant and in no way invalidates the study. As far as testing goes, there is always a balance you must strike in a reader study involving doctors. Ideally, you would have the exact ratio a doctor would encounter in practice, but for a screening study, that is typically impractical as you would need a huge number of cases and doctor time is expensive. A ratio of 1 positive to 4 negatives is entirely reasonable, although the doctors (particularly the less experienced ones) will almost certainly have an elevated sensitivity and reduced specificity since they will know it is an enriched set, but this is reasonable for ROC comparison purposes as it mostly just selects a different point on the doctor's personal ROC curve. Note that some studies even tell the doctors beforehand what percentage of cases are positive.

sgt101 · on April 30, 2019

Thank you for posting this; I can see that this evaluation came very easily to you because of your experience and expertise but to me it shows how much knowledge is required to evaluate something like this. There really should be a protocol defined around this kind of study that encodes the criticisms that you make here (and others) and stops publication of this kind of thing in its tracks.

arturadib · on April 30, 2019

Agreed. See this paper for a reputable reference in this space: https://www.nature.com/articles/nature21056

argonaut · on April 30, 2019

Why can't you use ROC with an imbalanced dataset?

My understanding is the PR curve is preferable to ROC since the ROC can make it difficult to discern differences between models on imbalanced data; but the ROC is still a valid way to compare/measure models.

rcheu · on May 1, 2019

I work as an ML engineer, some thoughts:

The train/test data being imbalanced in the same way does give the model an advantage, but I don't think that making the test set 50% would solve the issue completely either. Doctors have been "trained" on the true distribution, while which is not 50% (I'd guess that the true distribution is actually extremely unbalanced).

The model isn't simply learning to predict no 80% of the time, it is learning the distribution of the data with respect to the input features. For example, let's say that we have a simple model with only 3 binary features. It may learn that when features X_0, X_1 and X_2 are 1, the probability of cancer is 70%. This isn't a simple multiplication of the true probability by the upscaling factor though--it depends on the percent of negative samples with this feature vector and the percent of positive samples with this feature vector.

If we are to change the test set to be 50% positive and keep the same train distribution, the model no longer has the correct information about cancer rates with respect to feature distributions, but neither does the dermatologist. The specificity and sensitivity continue to not be interpretable as predicted specificity and sensitivity in the real world.

There is no issue with reporting specificity/sensitivity if they had used the true distribution of cases. Yes, the curves/AUCs will look better than the precision/recall rates, but they do not mis-represent what the doctors are interested in (what percent of people will be missed, and what percent of healthy people will be subjected to unnecessary procedures).

Anyways, the classifier doesn't actually seem to be that good, there's actually doctors that were better than the classifiers if you check the paper.

psoy · on May 1, 2019

Sensitivity and recall are two names for the same thing, Mr Stats 101 :)

Also, please explain the problem with using ROC here. The probabilistic interpretation of ROC's AUC is the probability of correctly ranking a random mixed pair (i.e. ranking the positive example higher than a negative one). How is that metric affected by the 80/20 split of the test data? Genuinely curios here...

avvakum · on April 30, 2019

It does not matter whether the data is balanced or not when you report ROC (AUC), sensitivity and specificity for the purpose of comparison of two ways of image interpretation (e.g. humans vs. machines) as long as the evaluation is done on the same dataset with the same methodology. Obviously, the absolute numbers would not mean much outside of the study.

ppod · on April 30, 2019

> test data should be balanced or they should correct for this in the analysis.

Why should it be balanced? It should be the expected natural clinical class distribution, no? The humans have priors about this too. If anything, it should be more imbalanced, as I would guess (I would hope!) that less than 20% of scans are malignant.

hhs · on April 30, 2019

Very useful, thanks for this level of critique.

I wish they added this context in the limitations section. The paper only says:

"There are some limitations to this system. It remains an open question whether the design of the questionnaire had any influence on the performance of the dermatologists compared with clinical settings. Furthermore, clinical encounters with actual patients provide more information than that can be provided by images alone. Hänßle et al. showed that additional clinical data improve the sensitivity and specificity of dermatologists slightly [5]. Machine learning techniques can also include this information in their decisions. However, even with this slight improvement, the CNN would still outperform the dermatologists."

Your points hit on validity issues. Where would it fit on the errors of omission/commission scale?

Scea91 · on May 2, 2019

While I agree that there are problems with the paper, I think you are confused about suitability of ROC, PR and how test set class imbalance affects them.

Your first two suggestions combined together are very wrong. If you made the test dataset balanced and then measured PR curve the precision would be way too optimistic as it is directly affected by the class imbalance. ROC curve on the other hand is invariant to the test set imbalance.

You can find interesting this short article I have written about this problem: https://arxiv.org/abs/1812.01388

perturbation · on April 30, 2019

> * They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

A̶F̶A̶I̶K̶,̶ ̶a̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶c̶a̶n̶ ̶b̶e̶ ̶m̶i̶s̶l̶e̶a̶d̶i̶n̶g̶ ̶f̶o̶r̶ ̶a̶n̶ ̶i̶m̶b̶a̶l̶a̶n̶c̶e̶d̶ ̶d̶a̶t̶a̶s̶e̶t̶,̶ ̶b̶u̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶o̶k̶a̶y̶ ̶f̶o̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶m̶o̶d̶e̶l̶s̶.̶ Edit: This is incorrect, a PR curve + PR AUC should be used for model selection if imbalanced. I agree it would be really misleading if they (say) just reported accuracy (since the null classifier of always guess negative would give 80% overall accuracy). I̶ ̶t̶h̶o̶u̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶f̶o̶r̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶s̶h̶o̶u̶l̶d̶ ̶s̶t̶i̶l̶l̶ ̶b̶e̶ ̶a̶ ̶v̶a̶l̶i̶d̶ ̶m̶e̶a̶s̶u̶r̶e̶ ̶s̶i̶n̶c̶e̶ ̶i̶t̶'̶s̶ ̶s̶h̶o̶w̶i̶n̶g̶ ̶h̶o̶w̶ ̶m̶u̶c̶h̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶h̶e̶ ̶m̶o̶d̶e̶l̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶n̶ ̶r̶a̶n̶d̶o̶m̶ ̶g̶u̶e̶s̶s̶i̶n̶g̶.̶

How do you usually handle imbalanced data? I've had some success with SMOTE or weighted loss for imbalanced datasets, but I'm embarrassed to say I've been using AUC with ROC curves as the default - if this gives inferior model selection than AUC with PR curve I'll have to start doing that instead.

TuringNYC · on April 30, 2019

Thanks for the comments, this is a great summary. Curious what you'd think of a Kappa score given the imbalance?

https://en.wikipedia.org/wiki/Cohen%27s_kappa

miemo · on May 2, 2019

>you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve

But sensitivity and recall are the same thing...

dontreact · on May 1, 2019

There is nothing wrong with using ROC for imbalanced data. It is also perfectly reasonable to use an enriched dataset for a reader study, this is the standard practice.

avip · on May 1, 2019

It's almost as if publishing the thing was more important for the authors than the scientific value of the content.

hprotagonist · on April 30, 2019

As always, let's see how well it does in live images. This system outperformed dermatologists on its own validation set of 100 images, which I would encourage you to interpret as "heartening preliminary evidence" but not much more.

Posting high scores on your validation set is only as informative as your val set is representative of the real world. 70% specificity, 84% sensitivity looks OK on paper (maybe -- as another poster noted, it's equally fair to say it's good evidence that image-only diagnosis is bad no matter what does it), but it doesn't always feel that way in practice. As a cheap example, your word error rate for a speech recognition system has to be extremely low in order for that system to be nice to use -- way lower than most otherwise acceptable looking scores.

This analogy only gets you so far, and i don't mean to impugn this study's test set, but another example is just because you can post 99.9% on MNIST doesn't mean that your system will approach that level of accuracy on digit recognition in the wild.

fergal_reid · on April 30, 2019

First off, if I'm reading correctly, it outperformed on its test set. This is different as it doesn't get to see that at any point before it's final.

If the authors have done a diligent job here, that should be good evidence of it's accuracy. It's also encouraging to see they do multiple training runs, getting similar accuracy, and that their ROC is generally better than not just the average physician, but almost all.

sgt101 · on April 30, 2019

Remember the test set is derived from the same source as the training set. This is not the case in the wild.

Florin_Andrei · on April 30, 2019

Yeah, this is the ML equivalent of "it works in vitro".

hprotagonist · on April 30, 2019

That's a lovely way of putting it. You're exactly right.

fergal_reid · on April 30, 2019

Look, I'm not a domain expert in the medical side of this, but the paper says they used a dataset, described in it's referenced paper as:

"This challenge leveraged a database of dermoscopic skin images from the ISIC Data Archive1 , which at the time of this publication contains over 10,000 images collected from leading clinical centers internationally, acquired from a variety of devices used at each center. The images are screened for both privacy and quality assurance. The associated clinical metadata has been vetted by recognized melanoma experts. Broad and international participation in image contribution ensures that the dataset contains a representative clinically relevant sam- ple". [Gutman et al.]

This paper selects a subset relevant to a certain condition (beyond my expertise).

But that sounds pretty good to me. If the test set in the paper is carefully randomly selected, as the authors described (and I've no reason to disbelieve them) then performance on test set should be a good proxy to actual performance on unseen data, as the underlying dataset is designed to be representative of clinical practice.

Of course that's not the same as having a clinically useful product.

But commenters here are knocking these cancer researchers as if they are idiots, and saying very harsh things about their methods, when it seems like 3 minutes reading their Dataset section gives reasons to think their setup is actually pretty ok?

argonaut · on April 30, 2019

> 10,000 images collected from leading clinical centers internationally

Lets say the data was collected from 5 different clinical centers. One risk is basically that when you deploy the model, it only works at those clinical centers due to idiosyncrasies specific to those centers. Or suppose certain doctors were more likely to take melanoma images, and certain doctors were more likely to take non-melanoma images, and both sets of doctors used different techniques to take images. These are just some ideas, there could be any number of confounding factors.

Basically, one should be default suspicious of most research papers working with small datasets (the test set is only 100 images - this is very small) that have not been deployed in the real world or otherwise validated independently. The root comment of this thread is basically saying this (a different comment is the one calling the researchers idiots).

fergal_reid · on April 30, 2019

> These are just some ideas, there could be any number of confounding factors.

There _could_ be.

But when the source dataset was carefully gathered for a competition and the academics are saying "Broad and international participation in image contribution ensures that the dataset contains a representative clinically relevant sample", talking about multiple different equipment and labs, things look pretty promising.

A lot of ML systems are built with much less rigourous datasets and do a good job when you put them in production.

10k such images were gathered from this process.

The authors then randomly select 100 images from it to use as a test dataset. 100 is a small number. But that smallness is not relevant to the _selection_ issues here. Its only relevant to whether the measurements of performance are statistically significant (i.e. that we didn't end up with a sample that by chance is particularly favorable to the ML approach.) 100 data points is enough that that's unlikely (though one should check.)

Additionally the authors talk about using a much larger validation set, and performing multiple runs and checking the validation accuracy is similar. Unless they deliberately left out the damning fact that their accuracy was a lot _higher_ on their small test set than on their validation set, then its even more unlikely that the test set is a sample that is particularly favorable to the ML approach.

You could argue it happens to be particularly _unfavorable_ to the humans, but that seems a stretch. Perhaps they should have created multiple test sets, and a couple of different batches of human raters etc. But honestly, their setup seems pretty good to me.

> (a different comment is the one calling the researchers idiots). That's fair.

argonaut · on May 1, 2019

Their validation set consisted of 210 positive images. The test set consisted of 20 positive images.

These are very small evaluation sets for deep learning. My point is the work is promising but should be viewed with healthy skepticism (by default).

I would really not read anything in particular into "Broad and international participation... ...sample." That's just a claim in a paper, it's not "the truth".

fergal_reid · on May 1, 2019

> These are very small evaluation sets for deep learning.

Evaluation is a statistics question, and it doesn't matter that the deep learning model used is high capacity and needs a lot of training data.

There's nothing inherently wrong with validating a complex model on a small amount of data.

The paper has a section 4.2 that gives a statistical analysis. Granted, it'd be nicer if they had enough data to show statistically significant differences.

learntoplay · on April 30, 2019

Isn’t DeepMind about to release a medical product that will do something very similar to this? Right now I wouldn’t doubt how well these systems can perform as compared to trained specialists that rely on their eyes even for reading test results.

rscho · on April 30, 2019

The question is whether we really measure things that are relevant and specific to the diagnosis. As of today the answer is largely no, which is why you cannot replace doctors yet and the reason why "90% of diagnoses are made on patient history".

sgt101 · on April 30, 2019

I wonder if these products will have to go through proper trials like drugs do? If not, why not?

hprotagonist · on April 30, 2019

https://www.fda.gov/medical-devices/digital-health/software-...

https://www.fda.gov/medical-devices/ivd-regulatory-assistanc...

yumraj · on April 30, 2019

I wonder if the results would be similar were the dermatologists to see the actual patient, in person, and then diagnose. And, then a photo was taken to be diagnosed by the CNN.

In other words, while dermatologists may have been outperformed by deep learning in image classification, it is not evident if deep learning could do the same against dermatologist diagnosing in person.

Also, not clear what the overall ratio of false negative/positives was in each case.

Also, unless I missed it in the paper, I'd be curious to learn if the cases, in Fig. 4, where majority of humans and CNN disagreed, were the same where majority of humans disagreed too or not.

nknealk · on April 30, 2019

This is extremely relevant. Factors like whether the patient has freckles or red hair often drive doctors to resect anything out of an abundance of caution. It doesn't look like these additional demographic factors were made available to clinicians.

That said, the results here are still impressive.

Source: I worked in clinical melanoma research for 2 years.

ModernMech · on April 30, 2019

Maybe? It's hard to say. I had a question about a mole once and what they did was take a photo of it with a special camera apparatus and they sent the image off to be diagnosed while I waited around. The doctor looked at it personally but it seems like the actual diagnosis was made by someone who never even saw me.

sgt101 · on April 30, 2019

Same for me, the doctor I saw did say (after 3 seconds) "it's not a cancer but we'll take a picture anyway to be sure." I wonder if the same sort of thing would have happened if there was any real question in his mind?

leelin · on April 30, 2019

Maybe a dumb question from a non-medical guy: are medical images considered "stationary" from a stats viewpoint?

That is, will medical images of diseases we diagnose in the next 20 years look a lot like the ones from the past 20 years, or is there a danger of over-fitting on an evolving data set? Could either the technology or the biology of the disease evolve?

In a prior life I was a quant trader, and financial market data is notorious for having the non-stationary problem. On top of market rules and structures changing all the time, once someone discovers a profitable trading idea, their own actions change what the data looks like for everyone else from that point forward.

savagedata · on April 30, 2019

There are always potential issues when a machine learning algorithm is applied over time.

Example #1: Let's say that cancer rates are increasing over time and cameras are improving over time. You might end up with a weird artifact in your model that higher resolution images are more likely to indicate cancer.

Example #2: Let's say that cancer-detecting algorithms are widely successful and so someone makes an app that lets you upload images of skin and the app tells you the probability of you having cancer. Suddenly a model that was trained on suspicious lesions is being used on normal freckles that people uploaded for fun. You end up with a lot of false positives. Maybe you try to combat that by including images uploaded to the app (that you somehow obtain labels for). But now you have a model that predicts that photos taken in brightly lit medical offices are likely to be cancer and blurry images taken in bathroom mirrors are not cancer.

You could argue that Example #2 is more about the difference between training data and data to be scored, but the fact remains that outside of tightly controlled scenarios, the way data is collected nearly always changes in time and ends up affecting model performance in unexpected ways.

1wheel · on April 30, 2019

Yup, this thread has a nice overview of ways performance on a validation set can overestimate clinical performance:

https://twitter.com/IAmSamFin/status/1122271463170564100

Another example of change over time:

> One difficulty in such a comparison is that Gleason grading standards have shifted over time, so that scores below six are now rarely assigned, and assigning a higher grade has become more common

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3775342/

ska · on April 30, 2019

Stationarity isn't really the issue here, as you don't typically analyse this data as measurements from a single stochastic process.

However - you have hit on a very real problem. Imaging systems have got better over time, imaging quality even on the nominally same system can be different from different sites. Image coverage can change by both policy and system capabilities, etc.

It's worse the more sophisticated the imaging systems are. Consider MRI, which is perhaps better thought of as equipment to perform physics experiments than as an imaging device. In that case, nominally equivalent scans from different vendors (even different generation from the same vendor) can have significantly different characteristics. And there is a ton of processing going on, there is no such thing as "raw" data here - even the vendors themselves may no longer be able to really (or at least easily) characterize what is being done.

So yes, in any machine learning applied to these data sets, you have a very real risk of learning odd characteristics of the sample data and hurting your generalization.

Biology isn't as likely to be a problem I think, but biological response to changing treatment protocols, sure.

JoshTko · on April 30, 2019

My guess is largely no. The diagnosis and actions taken are in isolated environments e.g, a cancer in person B is not impacted by a diagnosis and treatment in patient A. This is not the case in trading where everyone is analyzing and influencing a single messy environment.

notahacker · on April 30, 2019

Overfitting to people with certain complexions of healthy skin seems like a potentially much bigger problem than evolution of ailments.

chasedehan · on April 30, 2019

This isn’t time series data so don’t need to worry about stationarity.

michaelhoffman · on April 30, 2019

Reporting only sensitivity/specificity/ROC metrics and not reporting precision/positive predictive value is a bad sign. Especially since the latter is what health systems will want to look at before deciding on implementation.

The fact that they fiddled with the balance of classes in the test set makes the above even worse.

bonyt · on April 30, 2019

I'm always wary of claims like these, since it is difficult to get a clean dataset of medical information, and it is difficult or impossible to know for sure what exactly what the classifier is looking at to classify the image:

> For example, Roberto Novoa, a clinical dermatologist at Stanford University in the US, has described a time when he and his colleagues designed an algorithm to recognize skin cancer – only to discover that they’d accidentally designed a ruler detector instead, because the largest tumours had been photographed with rulers next to them for scale.

Source: https://physicsworld.com/a/neural-networks-explained/

gigantum · on April 30, 2019

Interesting, however there is no indication from the publisher or researchers how this result can be reproduced. It's nice that they put in some of the training data, but imagine how much more impactful to the community this could be if those interested could reproduce - and iterate - on this...

At Gigantum (https://github.com/gigantum/gigantum-client) this is literally our raison d'être to make this process as simple as possible.

JoshTko · on April 30, 2019

It's hard to imagine an narrow image classification task that humans will be able to beat NNs.

hprotagonist · on April 30, 2019

It's not clear that melanoma diagnosis is fairly described as a static image classification task.

Certainly it's a component, but (for example) patient history is, too.

sgt101 · on April 30, 2019

Considering that the ground truth comes from humans I would say that humans always outperform NN's and that results which show otherwise are demonstrating the limitations of the data set or testing process.

leesec · on April 30, 2019

It's been proven that using humans as ground truth you can ultimately build a NN off that data which outperforms the humans.

sgt101 · on April 30, 2019

What is meant by proven? What is meant by "the humans"? And how can any human say that is so?

leesec · on May 1, 2019

I mean, for example, you take 5 expert radiologists, and average their assessments when scoring an image, and your train a NN to predict those averaged scores, and with enough data the NN will beat any single selected doctor in accuracy.

woadwarrior01 · on April 30, 2019

Because an ensemble of humans can usually outperform a single human (quite like in artificial neural nets). One way to do it would be to setup a sufficiently large majority-voting ensemble of human classifiers.

mikehollinger · on April 30, 2019

I find it most interesting to use tools like this to augment - not replace diagnosticians and specialists.

Of course the key here is that the training set is crucial to building a high quality model - which of course needs a set of specialists to give their consensus on the diagnosis of the patient based on the images.

Presuming those folks can agree - the technology becomes a force multiplier for good. If they disagree or label things problematically - they become a force multiplier for bad.

sgt101 · on April 30, 2019

So long as the system is implemented in such as way as to stop it becoming a crutch or a default. If it's a tool in the flow then I think such things (not this one given the expert review information in this thread?) might be very valuable.

georgeek · on April 30, 2019

Every doctor tends to have a very static sensitivity-specificity preferences (true positive rate aka recall and true negative rate, respectively). One of the interesting consequences of using an automated diagnostic tool (already mentioned in Esteva et al 2017's Nature article) is that the sensitivity level can be chosen dynamically, depending on additional risk factors.

CraigRo · on May 1, 2019

MelaFind did this 15 years ago, and had a database of 50k lesions. Its classifier did about as well as derms, and much better than GPs. It got FDA approval, but, by that time, they had run out of money.

weaklearner · on April 30, 2019

well the answer is a bit more complicated than just replacing dermatologists with a CNN. I am pretty convinced the better approach is something like the approach in this paper-use the CNN on easy cases-have the CNN tell a human which instances are hard to classify. Many images are easy to classify-but some are hard [even for the cnn] and humans should give those images more study.

https://arxiv.org/abs/1903.12220

mrosett · on April 30, 2019

Given all of the other concerns raised by commenters in this thread, I wouldn't be surprised to find that there's some sort of data leakage as well.

StreamBright · on April 30, 2019

Deep learning outperformed humans in random pattern recognition task using a dataset that somewhat similar to live data.

assblaster · on April 30, 2019

I've always thought that diagnostic-oriented specialties would be most at risk (pathology, dermatology, radiology, ophthalmology).

As long as you have procedures, you will have a need for an extremely competent clinician that can synthesize all information and coordinate with use of hands or devices.

rscho · on April 30, 2019

All the specialties you cite entail various manual procedures.

Other than that yes, robots are not capable of replacing manual work in médicine yet.

assblaster · on May 2, 2019

Exactly. As long as those specialities hold onto procedures, they'll be relatively ok.

jmpman · on April 30, 2019

Any place to upload an image?

bobowzki · on April 30, 2019

If this is regarding yourself just make an appointment with a dermatologist. Source: I'm and MD.

sonnyblarney · on April 30, 2019

Despite the issues listed below, I believe this will be the future of such classifications.

In the future your doctor will have an image scanner in their office and you'll get 'really cheap' diagnosis of this to back up the doctors opinion.

Then you'll go for biopsy etc..

canada_dry · on April 30, 2019

This is a great example of where we need to get humans out of the equation when (if) a machine is conclusively proven to perform consistently better.

It was justified (cost wise) to replace many human labourers on auto assembly since machines don't get tired, need breaks, have off days. It could certainly be argued it is even more important in the field of health care (reduce costs and improve outcomes) for all forms of image scanning.

srfilipek · on April 30, 2019

> This is a great example of where we need to get humans out of the equation now that a machine performs consistently better.

No, it's not.

The system is an image classifier with a HUGE false positive rate (and false negative rate). When false positive rates exceed actual incident rates in the population (or far exceed by orders of magnitude, in this case), then it's practically worthless. This is something that the medical community just does not get about statistics (edit: I'm not making a wild generalization here, there are articles out there about this issue).

What this study actually shows is that an image-only diagnosis of melanoma sucks and should never be used. It doesn't matter if it out-performs doctors, because in either case the diagnosis is garbage.

eljimmy · on April 30, 2019

I think it could be useful as a tool for MDs.

When booking an appointment with your doctor, you could take a photo and submit it. If the NN detects something abnormal, it could be useful in prioritizing your appointment or at least flagging it for reviewal by the MD.

assblaster · on April 30, 2019

Machines will never replace dermatologists, machines will only make them more efficient.

hhs · on April 30, 2019

Interesting point.

The authors also seem to note this in their conclusion: “Our findings suggest that artificial intelligence algorithms may successfully assist dermatologists with melanoma detection in clinical practice which needs to be carefully evaluated in prospective trials.”

They use the verb assist. I think this will be the case, tools like this will be made available to board certified dermatologists to enhance their work. It won’t replace them anytime soon.

This field continues to be among the most competitive for physicians in the US. The barriers to entry are not only high, the number of residency spots get smaller each year when compared to total applicants. If interested, here’s some stats on med school rankings and derm match results: https://escholarship.org/uc/item/59p3z80r.

But over many years I wonder what will happen. Will there be paradigm shifts that make us rethink about all these specialities and subspecialties? Should we combine them or do something else?

rafiki6 · on April 30, 2019

What is your efficiency metric here? Number of patients serviced? If so, it's not that simple.

Imaging specialties might end up becoming more "efficient" (such as radiology), but medical patient data which is used to diagnose and triage is some of the worst there is (source: I worked with it). The only positive outcome I see from ML for the medical field for the foreseeable future might be to reduce the number of misdiagnoses if all things go very well.

fergal_reid · on April 30, 2019

If you make a dermatologist 5x more efficient, don't you replace 80% of them?

Or even better, allow them to spend more time on the hardest cases. And allow people with no access to a dermatologist now, access to a machine almost as good?

sgt101 · on April 30, 2019

Actually when you make a knowledge worker / service worker in a business processes 5* more efficient the experience is that they spend 400% more time on the cases that they have left. These are the cases that you can't automate and that before automation you couldn't service properly/economically. Now you can, so the workers do.

djg123 · on April 30, 2019

Well... Never say never. If you said machines would not replace dermatologists in the next 1 or 2 decades, I would agree with you. After that, all bets are off.