
The Perilous World of Machine Learning for Fun and Profit - ColinWright
http://www.john-foreman.com/blog/the-perilous-world-of-machine-learning-for-fun-and-profit-pipeline-jungles-and-hidden-feedback-loops
======
danso
The OP wrote a great post some time ago (it was posted on HN and was probably
how I discovered him): "The Forgotten Job of a Data Science: Editing"
[http://www.john-foreman.com/blog/the-forgotten-job-of-a-
data...](http://www.john-foreman.com/blog/the-forgotten-job-of-a-data-
scientist-editing...the) theme of that post is similar to the one here:

> _In a business, predictive modeling and accuracy is a means, not an end.
> What’s better: A simple model that’s used, updated, and kept running? Or a
> complex model that works when you babysit it but the moment you move on to
> another problem no one knows what the hell it’s doing? Robert Holte argued
> simplicity versus accuracy in his rather interesting and thoroughly-titled
> paper “Very Simple Classiﬁcation Rules Perform Well on Most Commonly Used
> Datasets.” In that paper, the "air quotes" AI model he used was a single
> decision stump – one rule. In code, it’d basically be an IF statement. He
> simply looked through a training dataset and found the single best feature
> for splitting the data in reference to the response variable and used that
> to form his rule. The scandalous part of Holte’s approach: his results were
> pretty good!_

The OP has a book (the link is on his site) that I've bought...it purports to
teach non-technical people how to do machine learning via _Excel_...I haven't
gone through all of the exercises, mostly because I shudder at having to open
Excel to do any kind of data work, but the mental framework that he builds,
demonstrating via examples in Excel, is pretty remarkable...I think it's
admirable that a data scientist is doing such significant work in demystifying
(and maybe even devaluing) his own profession.

~~~
flashman
I'll just step in and say that, as a non-coder who nevertheless has to
understand how code works, Foreman's book "Data Smart" is an invaluable primer
on the theory and practice of machine learning.

Certainly I'm no machine learning pro, but as a discipline it's a whole lot
less mysterious to me now than it was before I read that book.

------
tel
I feel like these are all reasonably obvious gigantic landmarks taught to
anyone with a stats or pattern recognition background. It's sort of just the
lay of the land with statistics, modeling, optimization.

That doesn't mean at all that more people shouldn't be aware of them and think
critically about them. It also doesn't mean at all that the implications of
these features of ML are understood by the entirety of the organization which
is affected by them.

But I sort of feel like it's the (perhaps unstated) primary job of a data
scientist/modeler/mathematician/statistician/what-have-you employed at any
company investing in this kind of technology. It feels about akin to a company
building their software without using version control. Sure some people do
that, but... c'mon!

If you're failing to do it then you're probably not listening to your experts
enough. If they're not screaming it's because they're wrapping blankets around
their head trying to block out the noise.

~~~
tel
After having skimmed the paper, I think it has a subtly different focus which
is vastly more important to consider. In particular:

> At a system-level, a machine learning model may subtly erode abstraction
> boundaries. It may be tempting to re-use input signals in ways that create
> unintended tight coupling of otherwise disjoint systems. Machine learning
> packages may often be treated as black boxes, resulting in large masses of
> “glue code” or calibration layers that can lock in assumptions. Changes in
> the external world may make models or input signals change behavior in
> unintended ways, ratcheting up maintenance cost and the burden of any debt.
> Even monitoring that the system as a whole is operating as intended may be
> difficult without careful design.

In other words, due to either unavoidable or marketing-based complexities, ML
projects have a tendency to blow up abstraction boundaries and induce loops
leading to unexpected tight coupling. In other words, you can interpret this
less as a feature of ML as a task and more as a feature of how ML algorithms
tend to operate, especially when operated in "black box" mode.

ML may, when practiced in some ways, be antithetical to good software design.

Again, quoting the paper

> Indeed, arguably the most important reason for using a machine learning
> system is precisely that _the desired behavior cannot be effectively
> implemented in software logic without dependency on external data_.
> (Emphasis mine)

This rings _very_ true in my experience building ML systems. It's an
unexpected challenge in software terms. You're often used to many kinds of
behavior as being undefined until runtime (e.g. needing user input before they
can be understood), but typically the _function_ of a program is something you
fully internalize statically—this is the cornerstone of understandable code.

OTOH, ML says that even the very behavior of the system must be deferred until
"runtime" information can propagate back. This is especially pronounced in
online decision-making tools, but plays out over much longer time periods even
with batch algorithm design. This "phase distinction" problem is something you
will feel immediately in building ML-based code.

It's probably the case that black box ML algorithms exacerbate this. I don't
have a lot of experience there, though.

------
jlees
Definitely agree with this. I do see more folks out there (like myself) with
one foot in both the data science and production engineering worlds, and my
personal approach to the problem of black boxes would be to have more of us!
Rather than some 'sucker dev' implementing a statistician's hack, a full-stack
mindset means the data science is done with engineering considerations all
along.

But even then, half the joy of data science is not really knowing what you're
doing at first as you research and experiment, just as your first attempt at a
production system for something complex will also not necessarily be the final
thing. If you have both of these at once, you can end up with a 'production'
system that looks pretty well-designed and, indeed, works well enough at first
-- until the product outgrows it, leading to a somewhat uncomfortable
transition later on.

My personal anecdote starts with my PhD research which initially used a
combination of Perl scripts, WEKA and god-knows-what-else, on fixed pre-
formatted corpora, and became a Python engine powering a Django web-app using
the Twitter API - not exactly rocket surgery at Google scale, but I found the
whole shift in mindset from research to execution extremely useful and totally
recommend to any data scientists who have focused solely on research that you
try the other side of the fence, at least a little.

Stripping things down and simplifying, getting that 80/20 balance right,
that's also some of the investigative fun of data science and we shouldn't get
carried away with our own cleverness to ignore the poor sods who have to use
the models we build years from now. I was consulting on a project which had
very complex visions of a layered AI system using machine learning to make key
decisions, but given the stage of the project and the depth of ML knowledge of
the team, it was way easier for the initial stages to use what was effectively
a switch statement on pre-defined values instead. You can always add
complexity!

~~~
atrilla
I see lots of common points with the "lean" method here (at the expense of
technical debt), but can't forget Knuth's advice that premature optimization
is the root of all evil.

To me, the utmost important aspect to take into account is the "key" of the
business. I had a similar story with my PhD research (AI+NLP): I wanted to
focus it on adding value to commercial products. I worked on the core of the
implementation, I built it with Java, with a configurable pipeline, plus an
online app with servlets. I did attain getting noticed by some companies, but
at the moment of closing future lines with them, my adviser told me this was
not what he expected from me, and I was forced to get back to the non-useful-
goal-centred research of academia. I am still glad I did what I did,
regardless of how useless it was eventually, but my knowledge grew, and I
learned how important it is to know your business model before you do anything
at all. I learned how unbalanced is the academic market, always relying on
public funds to survive instead of worrying about building useful appealing
stuff. I realized where I wanted to be, so I dropped out and joined the
private industry where I now feel very fulfilled.

My present approach goes from small to big, little by little, getting as much
feedback as I can so I can fix mistakes asap and prevent them from getting
bigger and more difficult to manage. My agreement with your words.

------
mwexler
The google paper the OP is summarizing is much, much better than the summary.
[http://research.google.com/pubs/pub43146.html](http://research.google.com/pubs/pub43146.html)
leading to
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43146.pdf)

------
lovelettr
My attempt at abridgment.

Do not use more data sources than necessary. You can incur technical debt. [1]

Metaphor about dead bodies at the top of hill.

Be cognizant of the fact that in some use cases by making use of the
information output by the predictive model you can reduce the effectiveness of
the original data model. This is called a feedback loop.

The goal of ML is business. Not some new ML algorithm/technique or programming
language paradigm.

[1]
[http://research.google.com/pubs/pub43146.html](http://research.google.com/pubs/pub43146.html)

------
iandanforth
"the pedestrianization of data science techniques"

This is the most condescending phrase I've read in a long time.

Otherwise the article had some good points, but that attitude really ruins it
for me.

~~~
mistermcgruff
Hey, author here. It does sound kinda bigoted, doesn't it? Lemme clarify. I
would really love for more people to learn this stuff. Truly. That's why I
wrote a book to teach it to folks in Excel. What freaks me out a bit is that
in order to enlarge the ML pie, a lot of vendors are trying to democratize
machine learning not by teaching it to more people but by putting the gun in
hands of folks who don't actually understand how supervised machine learning
works and where it can go wrong.

I don't think people need PhDs to learn or do this stuff. But I think they
need more than a "machine learning made easy!" app and a gung-ho attitude.

I feel that I'm halfway between main street and an ivory tower...not sure
where that puts me. The upstairs bar at a pizza parlor?

~~~
shostack
I just recently posted in a HN machine learning thread asking for beginners
resources. This sounds right up my alley.

I'm teaching myself Ruby (and other stuff), but consider myself pretty
advanced with Excel and web analytics in general. This seems like a great way
for me to get my feet wet in the deeper science of things with tools I'm
already very familiar with (moreso than Ruby at least).

John, can you clarify a bit on how much background is needed in various areas
of math to get the most out of this book? Or do you feel you do a solid job of
teaching that as one progresses through the chapters?

~~~
mistermcgruff
A semester of linear algebra (or just a willingness to Wikipedia a few things)
plus Excel experience is all you need.

That said, the book does require a lot of effort, because the techniques are
worked through step by step.

But once you learn all the guts of the algorithms, you never have to implement
them again! The last chapter moves the reader into R package land with the
confidence that you now know what those packages are basically doing and what
to watch out for.

------
muser
I'm curious about the real world implementation risk and if anyone has a
methodology to proactively deal with external factors affecting the model
performance. Such as if a feature is highly predictive of a certain outcome is
there a framework to measure the volatility based on information outside of
the dataset ie. product changes, marketing campaigns etc.

