

Principles for Applying Machine Learning Techniques - lpolovets
http://blog.factual.com/5-principles-for-applying-machine-learning-techniques

======
bravura
This post is very light on information, and is confounded by the use of
idiomatic terminology.

Here's my summarized rewrite of this blog with my added commentary:

1\. Worry about the outliers. If you are working with a billion instance
dataset, make sure to worry about the 1000 hard examples (0.0001%).
Personally, I think this is terrible advice. You should only worry about
getting 99.9999% accuracy if the problem really merits it. Otherwise, you're
focusing your energy on diminishing returns.

A more generous rewrite of this point 1 is: Don't use a model that is only
good at modelling the easy data points. If you can't correctly model many of
the data points, use a more sophisticated model that can correctly infer
generalizations across more of the data. This good point is obscured by the
suspicious claim that you should focus on 4.5 sigma events.

2\. Pay particular attention to examples that are near the margin. i.e.
Examples that the model is unsure about. This is good practice when doing
exploratory data analysis. I'm not really going to go near the half of this
point that is non-information.

[You should also do exploratory data analysis on examples that the model gets
wrong. One of my key ML principles: You learn about your model by breaking
it.]

3\. Do exploratory data analysis to see if your model isn't fitting some class
of examples, and figure out why. Essentially a restatement of points 1 and 2.

4\. Listen to the data. Good advice, and a restatement of 1-3.

5\. I'm not going to touch this.

I'm going to do a quick pass at my 5 principles for applying ML techniques:

1\. Figure out what accuracy is acceptable, and be honest here. If you don't
need 4.5 sigma accuracy, then you can solve the problem as needed and move on
to _different_ problems you have. Train a baseline model and see if it's good
enough. If not, then proceed.

2\. Choose a model that scales to a lot of examples and is semi-supervised.

3\. Find unlabelled data sets that make your data set much bigger, and do
semi-supervised training.

4\. Do exploratory data analysis. Break your model to understand what kind of
mistakes it makes. Figure out if you would need a far more sophisticated model
or feature set to get marginal returns, or if you can make a big impact with
small changes. If the latter, go to step five.

5\. Use expert knowledge to improve the feature set, and capture information
that a human expert would need to do the task themself.

~~~
lpolovets
(Disclosure: I'm not the author of the post, but I do work at Factual)

First, thanks for your feedback (you and everyone else who commented). It's
hard to figure out the right tone for a post like this (how technical do we
should get, how much should we focus on general observations vs specific
examples, etc.), and the clear feedback from HN is that deeper and more
technical posts are better.

Re: worrying about outliers being terrible advice

I definitely agree with you that working on the one-in-a-million cases is
rarely a good use of time, and the blog post should probably have talked about
3 sigma events instead of 4.5 sigma events. The spirit of the point is that
when you have a lot of data, things that are "rare" still happen a lot. It's
kind of like how MapReduce assumes that machine failures, while rare
individually, happen pretty frequently when you're working with clusters of
thousands of machines. Discounting machine failures because a single machine
is unlikely to fail leads to problems. For Factual, concretely, if we make a
mistake that effects just 0.1% of our data, than that's going to be 50k+
businesses that are not listed correctly in our dataset. That might be a lot
of failed phone calls, or people driving to places that don't exist, or so on,
and we take that seriously.

------
xiaoma
I'm taking the Stanford Coursera class on Machine Learning with Andrew Ng, and
I wholly recommend it. It just started this week, so it's not too late to join
if this blog post has whet your appetite and you're looking for more.

<https://class.coursera.org/ml-2012-002/class/index>

~~~
Rickasaurus
I took this course last fall, not too much effort as college courses go and
well worth the time. Anyone who is curious about machine learning should most
definitely take it.

------
Wilduck
This article is actually much narrower in scope than the title would seem to
suggest. If you don't let the first sentence sink in, it barely even makes
sense.

> Here at Factual we apply machine learning techniques to help us build high
> quality data sets out of the gnarly mass of data that we gather from
> everywhere we can find it.

So really, this blog post deals with the topic of "principles for applying
machine learning techniques _to data cleaning_ ".

Clearly this is an appropriate topic for this company, as they're product is
essentially API access to pre-cleaned/curated data sets. However, the post
itself is lacking depth. The work I do involves a lot of time cleaning my
company's internal data sets, so I definitely recognize the pain of the
corner/boundary/special case. However, anyone who has worked with data
cleaning (read: everyone who has worked with data) would know this pain, they
wouldn't need a blog post to point it out.

I would, however, be interested in knowing what sorts of machine learning
techniques they're applying to the problem. When I clean data, the process is
largely manual, probably in part because I'm not working with as large of data
sets. Maybe they don't want to reveal their secret sauce, but I think a more
technical blog post could serve to highlight how good their data cleaning is,
and therefore how high quality their product is.

------
code_scrapping
If I understood, the first 3 points can be considered as outliers. Typically,
you do want to ignore them, do to them being mistakes in measurements or
special cases. In any case, handling of the special cases does not do well for
generalization of the algorithms. So I would be against those points.

4th points is true, except that it's more of a zen advice then actual useful
information.

5th is a comercial.

Not a very good read, and I generally found that no machine learning knowledge
gets explained in 5 short paragraphs with no technical terms (except "sigma",
which I guess is standard deviation).

------
lightcatcher
Like others, I found this quite light on information. The best writeup I've
seen of how to integrate machine learning into an actual infrastructure of
code, machines, and people is this: [http://www.quora.com/What-are-the-keys-
to-operationalizing-a...](http://www.quora.com/What-are-the-keys-to-
operationalizing-a-machine-learning-ranking-system-from-an-organization-
engineering-management-point-of-view/answer/Brandon-Ballinger)

Full disclosure: I was impressed enough upon reading this post that I messaged
Brandon (the author) and am now interning at the startup he cofounded that
solves problems that heavily involve machine learning.

------
crusso
_This post is very light on information_

That's generous. I've played around with a couple of ML techniques and would
love to apply ML to more of my day-to-day work.

The title of the article was total link bait and I'm sorry I read through 90%
of it looking for something semi-useful.

~~~
bravura
You can try the MetaOptimize Q+A forum:

<http://metaoptimize.com/qa/>

I started it so that people could talk about the practice of ML, and the
technical details that too often are not discussed in academic publications.

------
lpolovets
If anyone is interested, the follow-up post is here:
<http://news.ycombinator.com/item?id=4444833> . In response to the feedback,
we tried to make a follow-up deeper and more concrete.

