Here's my summarized rewrite of this blog with my added commentary:
1. Worry about the outliers. If you are working with a billion instance dataset, make sure to worry about the 1000 hard examples (0.0001%). Personally, I think this is terrible advice. You should only worry about getting 99.9999% accuracy if the problem really merits it. Otherwise, you're focusing your energy on diminishing returns.
A more generous rewrite of this point 1 is: Don't use a model that is only good at modelling the easy data points. If you can't correctly model many of the data points, use a more sophisticated model that can correctly infer generalizations across more of the data. This good point is obscured by the suspicious claim that you should focus on 4.5 sigma events.
2. Pay particular attention to examples that are near the margin. i.e. Examples that the model is unsure about. This is good practice when doing exploratory data analysis. I'm not really going to go near the half of this point that is non-information.
[You should also do exploratory data analysis on examples that the model gets wrong. One of my key ML principles: You learn about your model by breaking it.]
3. Do exploratory data analysis to see if your model isn't fitting some class of examples, and figure out why. Essentially a restatement of points 1 and 2.
4. Listen to the data. Good advice, and a restatement of 1-3.
5. I'm not going to touch this.
I'm going to do a quick pass at my 5 principles for applying ML techniques:
1. Figure out what accuracy is acceptable, and be honest here. If you don't need 4.5 sigma accuracy, then you can solve the problem as needed and move on to different problems you have. Train a baseline model and see if it's good enough. If not, then proceed.
2. Choose a model that scales to a lot of examples and is semi-supervised.
3. Find unlabelled data sets that make your data set much bigger, and do semi-supervised training.
4. Do exploratory data analysis. Break your model to understand what kind of mistakes it makes. Figure out if you would need a far more sophisticated model or feature set to get marginal returns, or if you can make a big impact with small changes. If the latter, go to step five.
5. Use expert knowledge to improve the feature set, and capture information that a human expert would need to do the task themself.
First, thanks for your feedback (you and everyone else who commented). It's hard to figure out the right tone for a post like this (how technical do we should get, how much should we focus on general observations vs specific examples, etc.), and the clear feedback from HN is that deeper and more technical posts are better.
Re: worrying about outliers being terrible advice
I definitely agree with you that working on the one-in-a-million cases is rarely a good use of time, and the blog post should probably have talked about 3 sigma events instead of 4.5 sigma events. The spirit of the point is that when you have a lot of data, things that are "rare" still happen a lot. It's kind of like how MapReduce assumes that machine failures, while rare individually, happen pretty frequently when you're working with clusters of thousands of machines. Discounting machine failures because a single machine is unlikely to fail leads to problems. For Factual, concretely, if we make a mistake that effects just 0.1% of our data, than that's going to be 50k+ businesses that are not listed correctly in our dataset. That might be a lot of failed phone calls, or people driving to places that don't exist, or so on, and we take that seriously.
If I could add one more tip, it would be to make sure that all of your input and output features have a zero mean and a variance of 1 (something close to the normal distribution).
> Here at Factual we apply machine learning techniques to help us build high quality data sets out of the gnarly mass of data that we gather from everywhere we can find it.
So really, this blog post deals with the topic of "principles for applying machine learning techniques to data cleaning".
Clearly this is an appropriate topic for this company, as they're product is essentially API access to pre-cleaned/curated data sets. However, the post itself is lacking depth. The work I do involves a lot of time cleaning my company's internal data sets, so I definitely recognize the pain of the corner/boundary/special case. However, anyone who has worked with data cleaning (read: everyone who has worked with data) would know this pain, they wouldn't need a blog post to point it out.
I would, however, be interested in knowing what sorts of machine learning techniques they're applying to the problem. When I clean data, the process is largely manual, probably in part because I'm not working with as large of data sets. Maybe they don't want to reveal their secret sauce, but I think a more technical blog post could serve to highlight how good their data cleaning is, and therefore how high quality their product is.
4th points is true, except that it's more of a zen advice then actual useful information.
5th is a comercial.
Not a very good read, and I generally found that no machine learning knowledge gets explained in 5 short paragraphs with no technical terms (except "sigma", which I guess is standard deviation).
Full disclosure: I was impressed enough upon reading this post that I messaged Brandon (the author) and am now interning at the startup he cofounded that solves problems that heavily involve machine learning.
That's generous. I've played around with a couple of ML techniques and would love to apply ML to more of my day-to-day work.
The title of the article was total link bait and I'm sorry I read through 90% of it looking for something semi-useful.
I started it so that people could talk about the practice of ML, and the technical details that too often are not discussed in academic publications.