
International evaluation of an AI system for breast cancer screening - superfx
https://www.nature.com/articles/s41586-019-1799-6.epdf?shared_access_token=kVvZMoNQY1E60a9V4HRgz9RgN0jAjWel9jnR3ZoTv0M5zwPVx5jT4z_z-YkUZTBTE1cippy6E3F3qd68BjOQxslFJ4XPytLp4iDhuNLXcChKCLDbaOud4l8DoxozA37iL2-mcZ1zUHf9b_cMFME3CA%3D%3D
======
samuel
A bit unrelated, but I'm surprised because of how second readings are
performed.

_In the UK, each mammogram is interpreted by two readers, and in cases of
disagreement, an arbitration process may invoke a third opinion. These
interpretations occur serially, such that each reader has access to the
opinions of previous readers._

I wonder if it's sensible to condition the opinion of the second radiologist
this way. My first impression is that this should be done blindly or the
temptation of just confirming the previous reading because of fatigue/lazynes
or to avoid confrontation could severely affect the results.

Similarly, a future in which the radiologist gets too much support could just
impair their skills and turn them into a sort of a mechanical worker who
simply accepts what the AI overlord has decided.

------
arglemargle
I’m not a huge fan of turning the BI-RADS classification scheme into a ROC
curve. From what I’ve seen, BI-RADS is something like a yes / no / maybe
scheme for mammograms. I don’t think it was designed to be treated like a test
score, so using it to generate a ROC curve feels like an unfair comparison
between the AI system and current clinical practice.

What they’re doing is interesting, but it’s still very academic. I have little
doubt that eventually some sort of AI system will benefit clinical practice,
but based on the sheer number of studies that fail to make it over the line,
I’m not sure I have high hopes for this one. Why they’ve done so far is the
equivalent of “It works in vitro...”

~~~
sanxiyn
As I understand, for comparison, they _also_ turned AI system output to BI-
RADS class and then back to ROC curve.

~~~
arglemargle
Huh, where does it say that in the article? I don’t think I spotted that.

All the same, it feels sort of beside the point to me. It just doesn’t feel
right to take a medical diagnostic tool - whose intended purpose is for
communication among doctors - and treat it as a test score. That’s just... not
what it was designed for.

~~~
sanxiyn
In page 3, "Readers rated each case using the forced BI-RADS scale, and BI-
RADS scores were compared to ground-truth outcomes to fit an ROC curve for
each reader. The scores of the AI system were treated in the same manner (Fig.
3)."

This isn't as clear as I want it to be, but Fig. 3 shows both "AI system" and
"AI system (non-parametric)" ROC curve. My understanding is that the former is
fit from discrete BI-RADS class, and the latter is "raw" output.

------
sanxiyn
> Notably, the additional cancers identified by the AI system tended to be
> invasive rather than in situ disease.

This is probably due to invasive cancers being more common (~80%) than in situ
cancers. I am not sure why this natural explanation was not suggested.

------
stoicShell
> Screening mammography aims to identify breast cancer at earlier stages of
> the disease, when treatment can be more successful. Despite the existence of
> screening programmes worldwide, the interpretation of mammograms is affected
> by high rates of false positives and false negatives. Here we present an
> artificial intelligence (AI) system that is capable of surpassing human
> experts in breast cancer prediction. [...]

> In an independent study of six radiologists, the AI system outperformed all
> of the human readers: the area under the receiver operating characteristic
> curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the
> average radiologist by an absolute margin of 11.5%. We ran a simulation in
> which the AI system participated in the double-reading process that is used
> in the UK, and found that the AI system maintained non-inferior performance
> and reduced the workload of the second reader by 88%. This robust assessment
> of the AI system paves the way for clinical trials to improve the accuracy
> and efficiency of breast cancer screening.

So, there you have it: AI not "either/or" humans, but _both_ , in conjunction,
as a _composition_ of the best of both worlds.

At the very least, that's how civilization will massively and intimately
introduce true assistant AI.

It's also somewhat counter-intuitive to think that the most specialized tasks
are the low hanging fruits; i.e. that the "difficult" to us, culminating years
of training and experience for humans (e.g. how to read a medical scan) may
be, per its natural advantages (like speed and parallelism), "easy" to the
machine.

That space (where machine expertise is cheaper than human) roughly maps to the
immense value attributed to the rise of industrial-age narrow AI; therein lies
not a way to replace humans — we never did that in history, merely destroyed
_jobs_ to create ever more — but rather to _augment_ ourselves once more to
whole new levels of performance.

Anything more than this is AGI-level, science-fiction so far — and there's not
even a shred of evidence that it's theoretically a sure thing, possible in the
first place. Which is not to say that AI safety research isn't _extremely
important_ even for the narrow kind (manipulation comes to mind), but we
shouldn't go as far as to bet future economic growth on its existence. Like
fusion or interstellar travel, we just don't know. Yet, and for the
foreseeable future, because scale.

~~~
rvz
Exactly this. This is where I see AI possibly going: To be a complimentary
tool or second pair of eyes to speed up the work for the professionals rather
than replacing them. I also see this research as a very positive step forward
for using AI for good and especially bringing highly accurate results that can
used as a aid for health professionals.

However, given that this research used a deep learning (DL) based AI system in
the medical industry, there are still questions around this AI system
explaining itself and its internal decision process for the sake of
transparency, which will almost be ignored in other news reporting sites and
will focus only on the accuracy. DL-based AI systems will still be a concern
towards both patients and clinicians and I would expect this to be a focus
point in the future, despite the welcoming results which is still very
interesting anyways.

Other than the transparency issues behind the AI system, I'd say this is a
great start into the new decade for AI.

~~~
xarope
Agreed. The ability of the someone/AI to explain their decision making
process, is critical in determining whether such a decision has been
adequately thought out or not. If a PhD must go through a viva, surely it is
also incumbent on anybody pushing "AI" to also be able to "survive" such a
viva. Otherwise, we might as well just go back to the days of reading
entrails, flipping coins, etc. [edit: typo on viva]

------
sergers
There were like 80 different vendors showing assisted ai in many different
spaces at RSNA 2019

Well known every major pacs vendor is looking for assisted findings (some have
been available for ages, ex: icad bi-rads finding for telling rad to check) or
even just case prioritization for radiologists. (Ex: aidoc has an algorithm
for brain bleed for case prioritization, not a diagnosis).

They all are employing machine learning really (zebra medical claims 30
million scans processed).

Medical "ai" "algorithms" companies are vastly growing in the past few years

~~~
ariehkovler
> Ex: aidoc has an algorithm for brain bleed for case prioritization, not a
> diagnosis).

Eh... well it's more complicated than that. These systems CAN diagnose, but
their regulatory approval is only for use as an aid, not as a diagnosis tool.

------
sytelus
Paywalled scientific research, no source code and the reviewers who don't mind
calling run off the mill CNNs as "our AI system".

How far Nature has fallen these days? How long before Nature is merely PR
agency for the big tech?

~~~
sanxiyn
Not only is this "no code", it is also "no data"! At least data availability
statement is now required so we know data is not publicly available...

~~~
1_over_n
[https://twitter.com/drhughharvey/status/1212646954477539328](https://twitter.com/drhughharvey/status/1212646954477539328)

------
dang
Related from 2 days ago:
[https://news.ycombinator.com/item?id=21917747](https://news.ycombinator.com/item?id=21917747).
This looks like different work though?

~~~
sanxiyn
Yes, it is a different work. The current one published at Nature is from
DeepMind.

It is interesting to note the differences. For example, DeepMind notes "In our
reader study, all of the radiologists were eligible to interpret screening
mammograms in the USA, but did not uniformly receive fellowship training in
breast imaging." whereas DeepHealth notes "All readers were fellowship trained
in breast imaging", so +1 to DeepHealth.

On the other hand, DeepMind says "Where data were available, readers were
equipped with contextual information typically available in the clinical
setting, including the patient’s age, breast cancer history, and previous
screening mammograms." while DeepHealth says "Radiologists did not have any
information about the patients (such as previous medical history, radiology
reports, and other patient records)", so +1 to DeepMind. And so on. These
differences make direct comparison between studies very difficult.

~~~
nl
This "+1" thing is damaging and incorrect.

Depending on the context the model ends up being used in something that
appears good may not be. For example the fellowship training thing - these
non-fellowship trained radiologists are doing this task now, so it is
absolutely reasonable to assess against them to test real-world performance.

It would be interesting to see if the fellowship trained radiologists did
actually perform better in all circumstances (in some fields the better
trained radiologists end up not using their skills on as broad a range of
patients, so their performance is actually worse one some subsets of data).

~~~
sanxiyn
+1 was mostly to indicate whether you should upgrade or downgrade the reported
result to be comparable with other studies. I didn't mean to imply whether it
improves clinical relevancy.

~~~
nl
Yeah that is fair.

