Hacker News new | past | comments | ask | show | jobs | submit login
Technical Debt in Machine Learning Systems (2015) [pdf] (nips.cc)
107 points by earino 13 days ago | hide | past | web | favorite | 5 comments

The combination of glue code and pipeline jungles are, along with feature engineering, one of the biggest pain points we've observed in users. This stuff gets copied and pasted everywhere, turns unmaintainable, and then is next to impossible to optimize.

It's as if a lot of ML framework authors believe that most users are researchers... in reality, data is rarely clean, rarely in the right format, and usually needs to be intermingled and transformed with other data before it can be useful.

Part of the problem is that if you gave 20 developers/data scientists/ml engineers the same the set of data and asked them the do data prep and feature engineering, you'd probably have them come back with 20 different approaches.

To avoid pipeline jungles, teams need to agree to certain API's that their data processing code will follow e.g scikit-learn helped many people standardized around fit/predict/transform for their machine learning algorithms. In the future, I expect we'll see this expand to other parts of the process, such as feature engineering.

Towards that goal, I work on an open source library trying to do this for feature engineering called Featuretools. You can check it out here: https://github.com/FeatureLabs/featuretools/

AutoNormalize (part of FeatureTools, to those unfamiliar) is one of those most useful libraries I’ve used in awhile.

This seems like a very useful framework to consider ML systems in.

The thing about an ML system as such is that such a system is intended to turn big mounds of data into a predictions/classification without a human having to directly considered the multitude of questions otherwise addressed in large scale software design. IE, a multitude of boundaries and criteria are replaced by one criteria - "it works". The thing is that this set of boundaries and criteria still exists even if they individual setting the system considers the situation solved. This manifests both as the world changing over time and as other people not being perhaps as satisfied with the results of the system as those who created it, this being just two potential gotchas.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact