
Effective testing for machine learning systems - brzozowski
https://www.jeremyjordan.me/testing-ml/
======
vii
The practices identified here make sense. Testing pipelines on a small amount
of data to make sure there are not simple typos, then doing aggregate
statistical tests and specific example condition tests will catch many issues.

The paper "Making Contextual Decisions with Low Technical Debt"
[https://arxiv.org/pdf/1606.03966.pdf](https://arxiv.org/pdf/1606.03966.pdf)
goes deeper. Testing and monitoring deployments are very similar. The idea of
shadow testing new models (seeing how their outputs would differ from the
production models on real data) has been very important for identifying issues
in my experience.

This can be generalised to comparing models on historic data which greatly
speeds up evaluation. This is different from cross-validation as it is not
about correctness, just how _different_ the new output is. This is like the
pattern in UX development of a test harness that compares differences in a
screenhot. If the differences look good, then ship it!

------
jsnctl
Nice article. Agree that this is a nascent but really interesting and
important area for machine learning as a discipline. The topics in this
article labelled as "invariance testing" and "expectation tests" hint at the
broader challenge with empirically defined functions: that the input domain
for a given model can be significantly larger in scope and complexity than the
datasets that the asset has been exercised through at the training, testing
and validation process. This will be addressed by some of the highest
performing teams, but I'd hazard a guess that there's many models in
production nowadays that aren't instrumented to consider concepts like the
extremes and dynamics of their valid input spaces, developer-introduced scope
creep as models get more integrated with CI/CD practices, and many other
tricky and complex behaviours that might get lost amongst the noise of ML
development.

"Testing" in ML as it stands just now is essentially still development &
application logic of the learned function, not testing in the same sense as we
consider it in other aspects of software. The "post-train" area will need to a
see lot of advances if we're to remain confident in our ML models in
production (provided they continue to proliferate into more areas of
software).

