
Dear AI startups: Your ML models are dying quietly - jimmyechan
https://sanau.co/ML-models-are-dying-quietly
======
ohazi
I don't see how this has anything to do with AI or ML. It's a great
description of why you might want to prefer strongly typed languages, avoid
"the data _is_ the schema" key-value monstrosities like mongodb, and maybe
think about writing some sanity-check-tests that need to pass before
deployment, though.

No system should ever fail silently if a required field suddenly goes missing
or has the wrong type/unit.

~~~
nabla9
The problem with numerical programming is that all data is the same type. If
you access wrong row, multiply across wrong axis, function uses wrong
parameters, it's semantically wrong, but it can still work and system can
learn something.

~~~
amelius
Isn't there a typesystem that can help with this?

E.g., give rows a type, give columns a type, and then if you multiply two
matrices, see if the type of the columns of the first matrix and the type of
the rows of the second matrix match.

Also, the typesystem could use units, just as they are used in physics.

~~~
nabla9
You would like to have 'unit system' that would keep track of the units of
measure in general and the ability to use concepts like dimensional
homogeneity to avoid semantic errors while programming. Alas, they don't exist
as far as I know.

~~~
alkonaut
F# has units of measure. A frontend is (usually) not F#, but even for a more
loosely typed system such as your typical web frontend, it should be possible
to represent scalars as tuples of a number and an identifier. Eg. { qty: 12,
unit: "kgs" }.

------
luckyt
There's definitely a lot of opportunities for technical debt in machine
learning projects that don't exist in usual software development, which makes
careful design decisions even more important. Reminds me of this paper, which
talks about these issues and ways to avoid them:
[https://research.google.com/pubs/archive/43146.pdf](https://research.google.com/pubs/archive/43146.pdf)

~~~
jimmyechan
That's true! Thanks for sharing the paper. We'll take a look

------
jimmyechan
Hey HN, this is an article we wrote about things to watch out for as you
develop your machine learning models and deploy them on production.

We realize that it's really important for data science, product management and
engineering teams to discuss and ideally monitor any new or any changes in
data capture and processing that feeds into a machine learning model.

~~~
ayazhan
In this article, we showed a scenario where a small change to the front-facing
interface could dramatically reduce the accuracy/performance of a machine
learning model powering the application

------
distant_hat
This has nothing to do with startups, or established companies. Any time the
distribution of your data changes, your models need to be retrained. The model
can degrade even if the input data improves, e.g., if your geolocation feed
had a high error rate before but has suddenly gotten much better you need to
retrain the models.

------
j0057
The domain is flagged as malicious by ESET:

[https://www.virustotal.com/#/url/e4accf1e046c8266168b9038763...](https://www.virustotal.com/#/url/e4accf1e046c8266168b903876384f193b6457a5ff9bfecc72d47a08d1d397f7/detection)

------
nitrogen
Is it common for an ML model to be designed to make product recommendations
based on name and email, as in the example? That seems... problematic.

~~~
nostrademons
I figured that it was just for illustration, because the author couldn't think
of a better example. Some real-life examples that turn up stupidly often:

1\. The model uses click-through data as an input. Your frontend engineer
moves the UI element being clicked upon to a different portion of the page for
a certain category of results. This changes the baseline click-through rate.
The model assumed this feature had a constant baseline across all results, so
the new feature value now needs to be scaled to account for the different user
behavior. Nobody thinks to do this.

2\. The frontend engineer removes a seemingly-wasted HTTP fetch to reduce
latency. This fetch was actually being used to calibrate latency across
different datacenters, and was a crucial input to a data pipeline to a system
of servers (feeding the ML model) that the frontend team didn't control and
wasn't aware of.

3\. The frontend engineer accidentally triggers a browser bug in IE7 (gimme a
break, it was 9 years ago) that prevents clicks from registering when RTL text
is mixed with LTR. Click-through rates decline precipitously in Arabic-
speaking nations. This is interpreted by an ML model as all results being
poorly performing in Arabic countries, so it promptly starts cycling through
results, killing ones that had shown up before with no clicks.

4\. A fiber cable is cut across the Pacific. This results in high latency for
all Chinese users, which makes them abandon their sessions. A ML model
interprets this as Chinese people being less interested in the news headlines
of that day.

5\. A ML model for detecting abusive traffic uses spikes in the volume of
searches for any one single query over short periods of time as a signal.
Michael Jackson dies. The model flags everyone searching for him as a bot.

6\. A ML model for search suggestions uses follow-up queries as a signal. The
NYTimes crossword puzzle comes out. Everybody goes down the list of clues and
Googles them. Suddenly, [houston baseball player] suggests [bird sound] as
related.

~~~
ayazhan
Thanks nostrademons, these are great examples. You're right, name and email
are just for illustration. Would you mind if we use your feedback and some of
your examples to improve the article? If yes, should we credit your HN
account?

~~~
nostrademons
I'd actually rather that you keep them general (eg. just talk about
clickthrough data or changing latency conditions) and don't credit my account.
The past employer in question is relatively easy to lookup from my past
comment history, and while there's nothing really confidential in the
examples, stories about how they do things or how things go wrong tend to blow
up in the news, and they like the publicity only when it's positive.

~~~
ayazhan
Ok, sounds good. we'll keep it generic and won't mention the source. Thank you
for sharing! We think this is something AI companies can benefit from in the
future.

------
PeterisP
It's not really specific to machine learning - everything in big corporate
pre-ML reporting, data analytics, business intelligence and management
information systems domain (which is a mainstream field of IT systems with
decades of history and lots and lots of systems experience) have the same
issues. Any pipeline of business data collection and analysis tends to be
reliant of lots and lots of factors external to that system, and dependent on
particular details of every business process involved.

It's just well-known things being rediscovered because people are treating
this as a new field - but everything mentioned in this article would be the
same for company tracking some other sales efficiency metric twenty years ago,
except that the "change in front end framework" would be "change in preorder
sales reporting templates", causing very similar problems in your data
analysis.

------
eoinmurray92
Sanau seems like a new version of Sagemaker
[https://aws.amazon.com/sagemaker/](https://aws.amazon.com/sagemaker/) where
you write code in Jupyter and it auto converts to endpoints.

I've used these solutions, but while Jupyter is amazing (my startup
[https://kyso.io](https://kyso.io) started as a way to share notebooks) I'm
not sure if deducing which cells to convert into an endpoint is the way to go
- especially since you will need to also host model files and extra data?

~~~
tixocloud
Your startup looks quite interesting and I can actually see potential for
commercial usage. How's the growth rate been?

~~~
eoinmurray92
Its awesome - we've recently launched a team version [https://kyso.io/for-
teams](https://kyso.io/for-teams) and our beta users seem to love it

------
massaman_yams
It's not just data format changes; model accuracy can be impacted by changes
in the distributions of values, even if their types remain the same.

That's why production ML systems should monitor for data drift, model
accuracy, and a host of other factors that may not be obvious at first. That's
part of what TensorFlow Extended does.

See also: "What’s your ML Test Score? A rubric for ML production systems"
[https://ai.google/research/pubs/pub45742](https://ai.google/research/pubs/pub45742)

------
oli5679
When you deploy a machine learning model, you need to monitor it's performance
in production.

If the model's ROC-AUC falls by 0.1, the mean/s.d. of one of it's inputs
changes by more than 50% or the number of N.a.s for an input increases
suddenly or if the monitoring report dies, then the model owner should get an
alert quickly.

------
tumanian
“Data pipelines die quietly” is a more appropriate name for the article based
on the example. And its true, all data pipelines need counters monitored
continuously. A simple <number of records that didnt parse> metric on grafana
would prevent this error.

------
anotheryou
Sounds like you should manage your metrics and refine your model using false
detection/classifications (which you should try to catch and measure).

------
debaserab2
Garbage in, garbage out. This has nothing to do with ML models.

------
leowoo91
That article could be a good example of modern spam.

------
BloodyLobster
boring story. cannot believe it goes on one of today's hottest.

