I feel like ML is slowly transitioning to a more stable phase, where companies are no longer just green-lighting any ML model to be put in production in the hopes that they will bring unreasonably effective results without proof.
I think this is a good thing, but it is forcing us to ask the question of "how do I know if my model is good enough to replace the current system?". And I don't think that's something we've seriously considered from a business perspective until now.
I discuss a bit about that in the linked blogpost.
Well, not really. For academic purposes, sure. Better standard metric given the dataset = better model performance. But for industry purposes, it can get really tricky.
Remember, at the end of the day the only performance you should care about is in terms of dollar values of getting those models to production. How do you measure that? There's infrastructure costs, there's the risk of the model performance degrading over time, there's short term effects on user satisfaction, and long term effects on user retention. Which of those things should you consider for _your_ model?
Say you add a dynamic pricing algorithm to your site, that changes prices of items based on demand. You try it out with A/B testing to see whether it lifts profits compared to your current MSRP & manual tweaks strategy. That should be enough, right?
Well, maybe, but what about the mid-term effects on user satisfaction? Most users dislike when they see prices changing too often, they feel cheated. Also, you may get users to overpay with your dynamic pricing algorithm because they've heard that your site has fair prices and so they trust you without checking alternatives. But once you put the algorithm in place, maybe in a couple of months you _won't_ have the reputation of being a cheap site anymore.
Your model was leveraging the fact that your site has a good reputation by overpricing people who would trust your prices, slowly damaging your good reputation. Long term, you're losing customers.
How do you account for that? There's things you can try, but I wouldn't say this is "pretty easy".
Let's think of a different kind of example. You have experts doing quality control for the cheese you produce. They do a good job, but they're expensive, so you reduce the amount of manual QA you do, by running all cheese through a CV QA model.
Well, if your CV model has a good F1 score classifying cheese with defects, you're good to go, right? Well, maybe.
Now a couple of months go by and the company makes some small changes to the lightning conditions in the factories. The people making those decisions don't know that can affect your quality control algorithm, so they don't let you know they're doing that. By the time people realize your algorithm is performing poorly you've already sent a lot of low-quality cheese to be sold. What's the monetary effect of that? How could that have been estimated from the beginning? Experts are more robust to changes in details like lighting conditions in the factory, changes in cheese manufacturing processes, etc. So comparing the performance of your CV QA system _at time of training_ with the performance of experts as just F1 scores misses all of the potential risks of automating the process, and the fact that one of those systems degrades over time while the other one doesn't.
These are the kinds of things I talk about in the blogpost.
Part of the point is that most ML education stops when you have an evaluation metric. But that evaluation metric is not a business metric, and it's also only valid for your current dataset. Being able to predict how your model will degrade over time and whether just retraining on a schedule can fix that easily enough is something that's usually left out of the curriculum, and therefore most new data scientists don't really think about it.
I think this is a good thing, but it is forcing us to ask the question of "how do I know if my model is good enough to replace the current system?". And I don't think that's something we've seriously considered from a business perspective until now.
I discuss a bit about that in the linked blogpost.