Here's my summarized rewrite of this blog with my added commentary:
1. Worry about the outliers. If you are working with a billion instance dataset, make sure to worry about the 1000 hard examples (0.0001%). Personally, I think this is terrible advice. You should only worry about getting 99.9999% accuracy if the problem really merits it. Otherwise, you're focusing your energy on diminishing returns.
A more generous rewrite of this point 1 is: Don't use a model that is only good at modelling the easy data points. If you can't correctly model many of the data points, use a more sophisticated model that can correctly infer generalizations across more of the data. This good point is obscured by the suspicious claim that you should focus on 4.5 sigma events.
2. Pay particular attention to examples that are near the margin. i.e. Examples that the model is unsure about. This is good practice when doing exploratory data analysis. I'm not really going to go near the half of this point that is non-information.
[You should also do exploratory data analysis on examples that the model gets wrong. One of my key ML principles: You learn about your model by breaking it.]
3. Do exploratory data analysis to see if your model isn't fitting some class of examples, and figure out why. Essentially a restatement of points 1 and 2.
4. Listen to the data. Good advice, and a restatement of 1-3.
5. I'm not going to touch this.
I'm going to do a quick pass at my 5 principles for applying ML techniques:
1. Figure out what accuracy is acceptable, and be honest here. If you don't need 4.5 sigma accuracy, then you can solve the problem as needed and move on to different problems you have. Train a baseline model and see if it's good enough. If not, then proceed.
2. Choose a model that scales to a lot of examples and is semi-supervised.
3. Find unlabelled data sets that make your data set much bigger, and do semi-supervised training.
4. Do exploratory data analysis. Break your model to understand what kind of mistakes it makes. Figure out if you would need a far more sophisticated model or feature set to get marginal returns, or if you can make a big impact with small changes. If the latter, go to step five.
5. Use expert knowledge to improve the feature set, and capture information that a human expert would need to do the task themself.
First, thanks for your feedback (you and everyone else who commented). It's hard to figure out the right tone for a post like this (how technical do we should get, how much should we focus on general observations vs specific examples, etc.), and the clear feedback from HN is that deeper and more technical posts are better.
Re: worrying about outliers being terrible advice
I definitely agree with you that working on the one-in-a-million cases is rarely a good use of time, and the blog post should probably have talked about 3 sigma events instead of 4.5 sigma events. The spirit of the point is that when you have a lot of data, things that are "rare" still happen a lot. It's kind of like how MapReduce assumes that machine failures, while rare individually, happen pretty frequently when you're working with clusters of thousands of machines. Discounting machine failures because a single machine is unlikely to fail leads to problems. For Factual, concretely, if we make a mistake that effects just 0.1% of our data, than that's going to be 50k+ businesses that are not listed correctly in our dataset. That might be a lot of failed phone calls, or people driving to places that don't exist, or so on, and we take that seriously.
If I could add one more tip, it would be to make sure that all of your input and output features have a zero mean and a variance of 1 (something close to the normal distribution).