My stack trace of investigation:
- The model is good, we just need to get the doctors to trust the model.
- The model is good, we need to figure out how to build an informed trust in the model (so as to avoid automation bias).
- The model is good, we need informed trust, but we can't tackle the trust issue without first figuring out a realistic deployment scenario.
- The model is good, we need informed trust, we need a realistic deployment scenario, but there are some infrastructural issues that make deployment incredibly difficult.
After painstaking work with real-life EHR system, sanity-check model inference against realistic deployment scenario.
- Holy crap, the model is bad and not at all suitable for deployment. 0.95 AUC, subject of a previous publication, and fails on really obvious cases.
My summary so far of "why?": assumptions going into model training are wildly out of sync with the assumptions of deployment. It's "Hidden Tech Debt in ML"  on steroids.
Underspecification Presents Challenges for Credibility in Modern Machine Learning, D'Amour et al.,
- Benjamin Brewster . . . 1882
"The difference between practice and theory is greater in practice than in theory."
Yes, it's a challenge, especially with vision models, but it's doable. Health care models I've worked on have been put into production, and they just need to be monitored to remain effective.
I had it drilled into my brain that I really shouldn't trust anything except the average validation score of a (preferably high K value) K-fold cross evaluated model when trying to get an idea of how well my ML algorithm performs on unseen data. Apparently most people in my field (NLP) did not have this drilled into their head either. This is partly why NLP is filled with unreproducable scores (because the magically score if it was ever found was only found on seed #3690398 train/test split)
As far as I'm concerned, if you didn't cross-validate, the test set score is basically useless.
My view is that all high value production models should include out of distribution detection, uncertainty quantification, and other model specific safeguards (like self-consistency) to confirm that the model is only being used to predict on data it is competent to handle.
But it's an interesting point, e.g. if a CNN works better on in-distribution data, but a Gaussian process is better at providing a confidence estimate for OOD points (if it does), a hybrid model is possible.
A skilled modeller can reduce variance using domain specific tricks that are more powerful at variance reduction than cross validation. But still cross val is usually good to use as well.
"Ng was responding to a question about why machine learning models trained to make medical decisions that perform at nearly the same level as human experts are not in clinical use"
"“It turns out,” Ng said, “that when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions.”
But, he said, “It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. In contrast, any human radiologist can walk down the street to the older hospital and do just fine."
The only slight consolation though I find is that this time when the fog clears it's not going to be a complete waste as we're going to have much improved data engineering processes, data gathering methodologies, some scale improvements and improved data parallelism and a further sizeable portion of the research field will be cut off and put into the "that's not real AI" category and used in production software. There will be doom mongers, but if we come out of this with a much more professionalized interaction between software and the physical world then it was all still worth it all.
Come back again in 15 years and we'll find a new generation taking yet another crack at this building on the missteps of the now.
It's also totally in something where it's obvious that this should work. We aren't even using "latest and greatest" ML algos, I'm pretty sure what we are using is a really cobbled together ML stuff from a few years ago, probably "latest and greatest" from half a decade ago when ML was just kicking up.
But holy shit, there are so many interconnecting and annoying bits in the non-ML part of the stack (where I am). Our codebase has gotten rather messy (for understandable reason) trying to negotiate leaky abstractions between different clients needs and international standards (and we're only in 3 countries)... And we have a very broken data pipeline (It works well enough to get the job done but I don't sleep well at night) for making sure there are good pulls for the ML engineers to deal with -- and this is code written by folks who should know better about concepts like data gravity, just when you're doing it hastily on startup timescales and startup labor it's (understandably) not going to come out pretty. And all of this is why I haven't even had time to poke into the AI bits, not even stand up an instance for localdev.
Supposedly our competitors aren't even using real AI, just mechanically turked stuff. Yeah. Of course. Just the real messy domain of dealing with these human systems is bad enough to sink a ton of money without even getting to the point where you have enough money to buy some expensive data scientists and ML engineers.
i am still not convinced about ML winter, as it has found its killer app with advertising (i mean previous AI generations didn't find an equivalent cash cow).
Also: why don't they just specify that this ML model has been trained with this type of medical equipment? Couldn't they make it part of the SLA to use the same type of equipment in the field as that of the training images?
In the research world we are still working on how to get broad optimizations and are treating problems like they are short tailed. Real world (especially in medicine, which is what Ng works on a lot) datasets are long tailed. You look at common datasets like CelebA and you'll find a ton of light skinned people and other attributes that are limited (picking because obvious, but most datasets have these attribute problems and bias doesn't necessarily have to do with race or even human attributes. It can be words or even just types of events. But race is easy to understand and easier to talk about). For the most part we don't really care about that in the research world because we're still trying to optimize performance on these small world datasets (there are people working on long tail problems. This work is getting more popular in generative face models because we've gotten good at the restricted problem: see StyleGAN2-ADA). But there seems to be plenty of people that think they can just apply a CelebA pretrained model and use it for production in the real world. The data is already bad to begin with. And since we haven't tested on better datasets who knows if our model itself is going to exaggerate those biases or not, even if we did have better data. (There's many places bias can appear in the pipeline)
The problem is that in the ML community we're just at the stage where we're bad at performance and so limiting how complex our environment is (e.g. if you operate on mostly white people your world is less complex that has a more diversified spectrum of skin tones). We're getting there though. That's what this talk about algorithmic bias is often about. But if you just take a model from a small world you shouldn't expect similar results in anything but the exact same small world that the researchers tested on.
TLDR: research is small world, production is large world. Don't expect a small world model to work as effectively in the large world.
But the headline "Artificial intelligence fails to adapt" is more outrageous.
> “All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says. “The full cycle of a machine learning project is not just modeling. It is finding the right data, deploying it, monitoring it, feeding data back [into the model], showing safety—doing all the things that need to be done [for a model] to be deployed
Healthcare has some special needs in regards to what "real world use" means. Especially the "showing safety" part he mentions.
That's way different from some recommendation engine application, where it doesn't really matter, whether your ML approach just creates a bunch of bad feedback loops and people get sent into rabbit holes of bad music. No lives are at stake in that sense but the recommendation engine still "performs poorly on out of sample inputs" and is so to speak, "a long way from real world use". It's just that either nobody notices or even if they do, again, no lives are at stake and so it's OK that we're getting banana software (i.e. software that ripens in the hands of customers).
“All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says."