Hacker News new | comments | ask | show | jobs | submit login
Things I wish I knew earlier about Machine Learning (peadarcoyle.wordpress.com)
234 points by ColinWright on Dec 24, 2015 | hide | past | web | favorite | 11 comments

Also see (Sculley 2014):

Machine Learning: The High-Interest Credit Card of Technical Debt:


I could have sworn I saw this linked from the HN front-page earlier today, but couldn't find it when I looked a few minutes ago (dug back through more than several pages under both "new" and "news").

I would almost state "Clean datasets (like MNIST) considered harmful" because without exception the data scientists I meet who work with them just shovel real data blindly into ML code and kvetch when they don't beat simpler existing approaches right out of the gate.

And without fail so far, just cleaning up the data (feature-scaling, removing input errors etc) changes this. I vacillate on building an automagic tool to do this for them, but IMO such a tool would send tsunamis of technical debt up their data pipelines as they got even sloppier about data until the tool was overwhelmed by stuff it couldn't detect. I'd much rather handle the lesser of two evils of them getting nowhere until they realize this isn't magic.

Would you write something for biologists entitled "working on Drosophila considered harmful"?

Drosophila is a real noisy living system, and as a biologist myself, and unlike the data scientists with whom I work, none of us would consider the fruit fly to be an approximation of a human being.

Further, the biologists and chemists I know are aware of their limitations and do their best to surpass them. The data scientists I work with OTOH constantly tell me how code should be written despite knowing little more than single-threaded Python and Hadoop. See also their contempt for code maintenance and engineering or "ops." Of course, this is an unfair strawman to the smart ones who aren't this way, but I'm seeing more and more of this as time goes on, not less.

Perfect example: the myriad of supposed improvements to neural network training for which only MNIST results are ever shown. Wouldn't it be great if there were a 6+ TFLOP processor available along with an open source specification for and multiple vendors providing servers with 8 of these wonderful gadgets that could be used to train, say AlexNet, with this new and improved algorithm? But that would require ops(tm) and ops is for engineers, not data scientists. Or TLDR: they did try it and it lost to Nesterov Momentum or something similar on larger systems (results not shown).


One thing that I didn’t realise until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself.

Honestly, it seems like any standard machine learning application (based on train/test/deploy/etc) is going to be more subject to decay/bit-rot/etc than some equivalently complex application with a codified specification.

It seems like something over time is going to be a problem for any institution which relies on such applications.

Reminds me of John Foreman (data scientist at Mailchimp) and his excellent blog, including this piece, Surviving Data Science at the Speed of Hype: http://www.john-foreman.com/blog/surviving-data-science-at-t...

> Over the course of my career, I've built optimization models for a number of businesses, some large, like Royal Caribbean or Coke, some smaller, like MailChimp circa 2012. And the one thing I've learned about optimization models, for example, is that as soon as you've "finished" coding and deploying your model the business changes right under your nose, rendering your model fundamentally useless. And you have to change the optimization model to address the new process...

> Process stability and "speed of BLAH" are not awesome bedfellows. Supervised AI models hate pivoting. When a business is changing a lot, that means processes get monkeyed with. Maybe customer support starts working in different shifts, maybe a new product gets released or prices are changed and that shifts demand from historical levels, or maybe your customer base changes to a younger demographic than your ML models have training data for targeting...

Until the machines are literally thinking for themselves, it seems that being able to engineer a stable infrastructure for rapid debugging and sampling of feedback is as paramount as being able to do the math for the modeling.

Wait? What has bit-rot todo with machine learning? I dont understand. Where can i learn about it?

Bit rot is the degradation of the performance of a ML model over time. When a model is trained it is learning a set of assumptions about the "universe" and over time the universe changes which can invalidate some of those assumptions which degrades the models performance.

For example you create a model to classify tweets by sentiment and then people start referring to your company as sick and phat which your model doesn't understand.

The bigger challenge is that most models attempt to predict human behavior but humans have the ability to learn and so simply deploying the model changes the universe.

For example your FICO score is an attempt to predict your credit default risk but over time people learn how to trivially manipulate their scores which degrades the ability to use FICO scores to predict default risk.

This is where we break out the old Zawinski quote:

  Some people, when confronted with a problem, think
 “I know, I'll use regular expressions.”   
  Now they have two problems.
I actually had to tell someone this who wanted to build a machine classifier. They wanted the classifier as a step in getting data for their research. The problem was as far as I knew in the literature building the classifier would be a research project of its own.


Most of (relative) successes of machine learning have involved a circuit something like

(input data) --> ML --> (human acceptable output)

We see this in translation, image classification, face recognition.

But if we a circuit of

(input data) --> ML --> (data for further significant processing by computer)

We have a problem. The ML can't be a real "black box" at all here in the sense it can't learn the correct classifications without regard to how those classifications are going to be used. Since it's learning "what humans do", the code can't be that much more reliable than humans but it also has the problem of not being adaptable in the fashion that humans (sometimes) are adaptable.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact