Hacker News new | past | comments | ask | show | jobs | submit login
Andrew Ng says AI has a proof-of-concept-to-production gap (ieee.org)
109 points by TN567 8 days ago | hide | past | favorite | 39 comments

Two cents as an MD-(CS)PhD student studying what I've heard referred to as "the last mile problem."

My stack trace of investigation:

- The model is good, we just need to get the doctors to trust the model.

- The model is good, we need to figure out how to build an informed trust in the model (so as to avoid automation bias).

- The model is good, we need informed trust, but we can't tackle the trust issue without first figuring out a realistic deployment scenario.

- The model is good, we need informed trust, we need a realistic deployment scenario, but there are some infrastructural issues that make deployment incredibly difficult.

After painstaking work with real-life EHR system, sanity-check model inference against realistic deployment scenario.

- Holy crap, the model is bad and not at all suitable for deployment. 0.95 AUC, subject of a previous publication, and fails on really obvious cases.

My summary so far of "why?": assumptions going into model training are wildly out of sync with the assumptions of deployment. It's "Hidden Tech Debt in ML" [1] on steroids.

[1] https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f...

You've probably seen it, but a more recent, related paper (that I think has some of the same authors) about inherent features of modern ML that make models so fragile, even if they test OK:

Underspecification Presents Challenges for Credibility in Modern Machine Learning, D'Amour et al., https://arxiv.org/abs/2011.03395

I had not seen it yet, excited to read it! Thanks!

my question: why don't they just specify that this ML model has been trained with this type of medical equipment? Couldn't they make it part of the SLA to use the same type of equipment in the field as that used to obtain the training images?

"In theory, there is no difference between theory and practice. In practice, there is."

- Benjamin Brewster . . . 1882

Or its variant:

"The difference between practice and theory is greater in practice than in theory."

Clickbait headline. He did not say it's a long way from use, but instead that it's challenging to ensure models translate well to real world conditions.

Yes, it's a challenge, especially with vision models, but it's doable. Health care models I've worked on have been put into production, and they just need to be monitored to remain effective.

Ok, we've replaced the title with something he actually said.

While this is indeed clickbait as mentioned by others - I am consistently shocked with how little the most common technique for ensuring that a model you trained works on unseen data, cross-validation, is used in the real world.

I had it drilled into my brain that I really shouldn't trust anything except the average validation score of a (preferably high K value) K-fold cross evaluated model when trying to get an idea of how well my ML algorithm performs on unseen data. Apparently most people in my field (NLP) did not have this drilled into their head either. This is partly why NLP is filled with unreproducable scores (because the magically score if it was ever found was only found on seed #3690398 train/test split)

As far as I'm concerned, if you didn't cross-validate, the test set score is basically useless.

The point of the article is more that even if all of your testing and validation is rigorous and the performance looks great, trivial changes in the production data can break your model anyway.

My view is that all high value production models should include out of distribution detection, uncertainty quantification, and other model specific safeguards (like self-consistency) to confirm that the model is only being used to predict on data it is competent to handle.

All that is only needed because incremental learning algorithms don't really work all that well. It's a dirty secret in the field that we still don't have good answers for catastrophic forgetting in neural networks (the best candidate incremental learner as of right now), and the other alternatives are far worse.

This is good to have but it doesn't really address the problem of predictive accuracy in the presence of nonstationarity. The safeguards just help us switch off the model at the right time. We're still stuck with no capability in the new environment.

I think knowing what you don't know is still a pretty big win. It can help people trust the models in cases they do work, and it can serve as a diagnostic for why it fails in certain circumstances.

This sounds an awful lot like guassian processes which are fairly common in research environments. I don't know how common it is to deploy guassian processes in the real world, but I see published papers integrating them into other models all the time. The gist is that instead of input -> prediction you get input -> prediction + sigma (every prediction is given as a guassian distribution).

For common neural network models, the output probability has no meaning for out of distribution inputs, so you need to do an ensemble or some other method to get at the actual model confidence. I don't know enough about Gaussian processes to know if they have any limitations like that.

But it's an interesting point, e.g. if a CNN works better on in-distribution data, but a Gaussian process is better at providing a confidence estimate for OOD points (if it does), a hybrid model is possible.

No, this does not solve the problem that he describes in the article. You can have a great crossvalidation score and still struggle on unseen data if the data is relatively dissimilar from your train set. Like X-Ray scans produced from a different machine. There are numerous other examples. CNNs on images for example are famously known to disintegrate on images + white noise (which look the same to a human).

For the last example, could they train on "images + white noise" instead ?

I think cross validation is a powerful tool but not always necessary and is prone to abuse, such as overfittimg to the test set.

A skilled modeller can reduce variance using domain specific tricks that are more powerful at variance reduction than cross validation. But still cross val is usually good to use as well.

I've said for a long time that we are currently in the fitbit stage of AI.

"Ng was responding to a question about why machine learning models trained to make medical decisions that perform at nearly the same level as human experts are not in clinical use"

"“It turns out,” Ng said, “that when we collect data from Stanford Hospital, then we train and test on data from the same hospital, indeed, we can publish papers showing [the algorithms] are comparable to human radiologists in spotting certain conditions.”

But, he said, “It turns out [that when] you take that same model, that same AI system, to an older hospital down the street, with an older machine, and the technician uses a slightly different imaging protocol, that data drifts to cause the performance of AI system to degrade significantly. In contrast, any human radiologist can walk down the street to the older hospital and do just fine."

It may also be a different shared group culture, different lessons learned and previous screwups passed down to the next generations. I also saw this in aerospace.

AI has a much larger "An Executive heard it in a marketing pitch from an external consultant or at an AI conference" to "Engineering implementation" gap.

As somebody who has intentionally not published low quality work, it amuses me to see people finally recognizing that a lot of the most regarded papers just don't have any real impact in the real world of life sciences/health research/medicine. What doesn't amuse me is the career success these folks have had by selling overfit, undergeneralizing models.

Winter is coming

Yeah hate to say it, but you're probably right. AI hype is the tech equivalent of a land war in Asia.

The only slight consolation though I find is that this time when the fog clears it's not going to be a complete waste as we're going to have much improved data engineering processes, data gathering methodologies, some scale improvements and improved data parallelism and a further sizeable portion of the research field will be cut off and put into the "that's not real AI" category and used in production software. There will be doom mongers, but if we come out of this with a much more professionalized interaction between software and the physical world then it was all still worth it all.

Come back again in 15 years and we'll find a new generation taking yet another crack at this building on the missteps of the now.

100%. And I work for a company that (I think? I've never actually run the model myself) that deploys an AI model. It's pretty good. It hits 95% accuracy. Solves a real pain point for humans.

It's also totally in something where it's obvious that this should work. We aren't even using "latest and greatest" ML algos, I'm pretty sure what we are using is a really cobbled together ML stuff from a few years ago, probably "latest and greatest" from half a decade ago when ML was just kicking up.

But holy shit, there are so many interconnecting and annoying bits in the non-ML part of the stack (where I am). Our codebase has gotten rather messy (for understandable reason) trying to negotiate leaky abstractions between different clients needs and international standards (and we're only in 3 countries)... And we have a very broken data pipeline (It works well enough to get the job done but I don't sleep well at night) for making sure there are good pulls for the ML engineers to deal with -- and this is code written by folks who should know better about concepts like data gravity, just when you're doing it hastily on startup timescales and startup labor it's (understandably) not going to come out pretty. And all of this is why I haven't even had time to poke into the AI bits, not even stand up an instance for localdev.

Supposedly our competitors aren't even using real AI, just mechanically turked stuff. Yeah. Of course. Just the real messy domain of dealing with these human systems is bad enough to sink a ton of money without even getting to the point where you have enough money to buy some expensive data scientists and ML engineers.

"AI hype is the tech equivalent of a land war in Asia" - that's quotable! i did a web search on this phrase with several search engines, and it appeared just here.

i am still not convinced about ML winter, as it has found its killer app with advertising (i mean previous AI generations didn't find an equivalent cash cow).

Also: why don't they just specify that this ML model has been trained with this type of medical equipment? Couldn't they make it part of the SLA to use the same type of equipment in the field as that of the training images?

Why winter instead of people filling in the gaps that exist in their processes?

Maybe in the West. I would have a hard time believing Chinese researchers feel this way.

And it will be the second AI winter.

Well no shit, and it is a problem that people don't know this.

In the research world we are still working on how to get broad optimizations and are treating problems like they are short tailed. Real world (especially in medicine, which is what Ng works on a lot) datasets are long tailed. You look at common datasets like CelebA and you'll find a ton of light skinned people and other attributes that are limited (picking because obvious, but most datasets have these attribute problems and bias doesn't necessarily have to do with race or even human attributes. It can be words or even just types of events. But race is easy to understand and easier to talk about). For the most part we don't really care about that in the research world because we're still trying to optimize performance on these small world datasets (there are people working on long tail problems. This work is getting more popular in generative face models because we've gotten good at the restricted problem: see StyleGAN2-ADA). But there seems to be plenty of people that think they can just apply a CelebA pretrained model and use it for production in the real world. The data is already bad to begin with. And since we haven't tested on better datasets who knows if our model itself is going to exaggerate those biases or not, even if we did have better data. (There's many places bias can appear in the pipeline)

The problem is that in the ML community we're just at the stage where we're bad at performance and so limiting how complex our environment is (e.g. if you operate on mostly white people your world is less complex that has a more diversified spectrum of skin tones). We're getting there though. That's what this talk about algorithmic bias is often about. But if you just take a model from a small world you shouldn't expect similar results in anything but the exact same small world that the researchers tested on.

TLDR: research is small world, production is large world. Don't expect a small world model to work as effectively in the large world.

I wish the media would prefer using machine learning (ML) over artificial intelligence (AI) in their headlines.

But the headline "Artificial intelligence fails to adapt" is more outrageous.

The point is spot on, many AI systems would be more expensive in practice to maintain than the problems they solve. People are enamored by AI but the practicality of its application in many business problems is challenged in my experience. Better off just hiring a human in many cases.

Title seems somewhat misleading. He said ML often performs poorly on out of sample inputs. Seems different from being “a long way from real world use.” I don’t think anyone would argue ML is not being used in the real world!

Yes and no. Let's quote a bit more:

> “All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says. “The full cycle of a machine learning project is not just modeling. It is finding the right data, deploying it, monitoring it, feeding data back [into the model], showing safety—doing all the things that need to be done [for a model] to be deployed

Healthcare has some special needs in regards to what "real world use" means. Especially the "showing safety" part he mentions.

That's way different from some recommendation engine application, where it doesn't really matter, whether your ML approach just creates a bunch of bad feedback loops and people get sent into rabbit holes of bad music. No lives are at stake in that sense but the recommendation engine still "performs poorly on out of sample inputs" and is so to speak, "a long way from real world use". It's just that either nobody notices or even if they do, again, no lives are at stake and so it's OK that we're getting banana software (i.e. software that ripens in the hands of customers).

An intermediate case between life-and-death healthcare and who-cares music recommendations is machine translation. Thanks to advances in AI and to companies like Google and DeepL providing translation services for free, MT is now being used widely in the real world. Sometimes it performs miraculously, enabling effective communication and cooperation among people without a common language, and sometimes it fails horribly.

They are trying to generalized "Andrew Ng comment about AI application to healthcare" to all the application of AI. When these journalist learn to properly report.

From the last paragraph: "This gap between research and practice is not unique to medicine, Ng pointed out, but exists throughout the machine learning world.

“All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says."

Only because it's been wildly over-hyped and tech journalists + startups have over-promised on it. It's perfectly effective in its place.

Andrew Ng himself generalizes this as reported at the end of the article: “All of AI, not just healthcare, has a proof-of-concept-to-production gap,” he says.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact