
Technical Debt in Machine Learning Systems (2015) [pdf] - earino
http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
======
alfalfasprout
The combination of glue code and pipeline jungles are, along with feature
engineering, one of the biggest pain points we've observed in users. This
stuff gets copied and pasted everywhere, turns unmaintainable, and then is
next to impossible to optimize.

It's as if a lot of ML framework authors believe that most users are
researchers... in reality, data is rarely clean, rarely in the right format,
and usually needs to be intermingled and transformed with other data before it
can be useful.

~~~
kmax12
Part of the problem is that if you gave 20 developers/data scientists/ml
engineers the same the set of data and asked them the do data prep and feature
engineering, you'd probably have them come back with 20 different approaches.

To avoid pipeline jungles, teams need to agree to certain API's that their
data processing code will follow e.g scikit-learn helped many people
standardized around fit/predict/transform for their machine learning
algorithms. In the future, I expect we'll see this expand to other parts of
the process, such as feature engineering.

Towards that goal, I work on an open source library trying to do this for
feature engineering called Featuretools. You can check it out here:
[https://github.com/FeatureLabs/featuretools/](https://github.com/FeatureLabs/featuretools/)

~~~
ende
AutoNormalize (part of FeatureTools, to those unfamiliar) is one of those most
useful libraries I’ve used in awhile.

------
joe_the_user
This seems like a very useful framework to consider ML systems in.

The thing about an ML system as such is that such a system is intended to turn
big mounds of data into a predictions/classification without a human having to
directly considered the multitude of questions otherwise addressed in large
scale software design. IE, a multitude of boundaries and criteria are replaced
by one criteria - "it works". The thing is that this set of boundaries and
criteria still exists even if they individual setting the system considers the
situation solved. This manifests both as the world changing over time and as
other people not being perhaps as satisfied with the results of the system as
those who created it, this being just two potential gotchas.

------
mistrial9
YNews
[https://news.ycombinator.com/item?id=17341128](https://news.ycombinator.com/item?id=17341128)

