Again, check that table. It says a lot: https://stanfordmlgroup.github.io/compet...

bitL · on May 30, 2018

I was referring mainly to this one (from the same group and it actually surpassed humans on average):

https://stanfordmlgroup.github.io/projects/chexnet/

In their paper they even used "weaker" DenseNet-121 instead of DenseNet-169 for Mura/bones. DenseNet-BC I tried is another refinement of the same approach.

timr · on May 30, 2018

Those are some sketchy statistics. The evaluation procedure is questionable (F1 against the other 4 as ground truth? Mean of means?), and the 95% CI overlap pretty substantially. Even if their bootstrap sampling said the difference is significant, I don't believe them.

Basically, I see this as "everyone sucks, but the AI maybe sucks a little less than the worst of our radiologists, on average"