Hacker News new | comments | show | ask | jobs | submit login
Principles for Applying Machine Learning Techniques (factual.com)
57 points by lpolovets 1609 days ago | hide | past | web | 12 comments | favorite

This post is very light on information, and is confounded by the use of idiomatic terminology.

Here's my summarized rewrite of this blog with my added commentary:

1. Worry about the outliers. If you are working with a billion instance dataset, make sure to worry about the 1000 hard examples (0.0001%). Personally, I think this is terrible advice. You should only worry about getting 99.9999% accuracy if the problem really merits it. Otherwise, you're focusing your energy on diminishing returns.

A more generous rewrite of this point 1 is: Don't use a model that is only good at modelling the easy data points. If you can't correctly model many of the data points, use a more sophisticated model that can correctly infer generalizations across more of the data. This good point is obscured by the suspicious claim that you should focus on 4.5 sigma events.

2. Pay particular attention to examples that are near the margin. i.e. Examples that the model is unsure about. This is good practice when doing exploratory data analysis. I'm not really going to go near the half of this point that is non-information.

[You should also do exploratory data analysis on examples that the model gets wrong. One of my key ML principles: You learn about your model by breaking it.]

3. Do exploratory data analysis to see if your model isn't fitting some class of examples, and figure out why. Essentially a restatement of points 1 and 2.

4. Listen to the data. Good advice, and a restatement of 1-3.

5. I'm not going to touch this.

I'm going to do a quick pass at my 5 principles for applying ML techniques:

1. Figure out what accuracy is acceptable, and be honest here. If you don't need 4.5 sigma accuracy, then you can solve the problem as needed and move on to different problems you have. Train a baseline model and see if it's good enough. If not, then proceed.

2. Choose a model that scales to a lot of examples and is semi-supervised.

3. Find unlabelled data sets that make your data set much bigger, and do semi-supervised training.

4. Do exploratory data analysis. Break your model to understand what kind of mistakes it makes. Figure out if you would need a far more sophisticated model or feature set to get marginal returns, or if you can make a big impact with small changes. If the latter, go to step five.

5. Use expert knowledge to improve the feature set, and capture information that a human expert would need to do the task themself.

(Disclosure: I'm not the author of the post, but I do work at Factual)

First, thanks for your feedback (you and everyone else who commented). It's hard to figure out the right tone for a post like this (how technical do we should get, how much should we focus on general observations vs specific examples, etc.), and the clear feedback from HN is that deeper and more technical posts are better.

Re: worrying about outliers being terrible advice

I definitely agree with you that working on the one-in-a-million cases is rarely a good use of time, and the blog post should probably have talked about 3 sigma events instead of 4.5 sigma events. The spirit of the point is that when you have a lot of data, things that are "rare" still happen a lot. It's kind of like how MapReduce assumes that machine failures, while rare individually, happen pretty frequently when you're working with clusters of thousands of machines. Discounting machine failures because a single machine is unlikely to fail leads to problems. For Factual, concretely, if we make a mistake that effects just 0.1% of our data, than that's going to be 50k+ businesses that are not listed correctly in our dataset. That might be a lot of failed phone calls, or people driving to places that don't exist, or so on, and we take that seriously.

Couldn't agree more with your assessment. The idea of paying attention to 4.5 sigma events is crazy. The majority of classification tasks require broad accuracy and a measure of confidence. If the edge cases and boundaries are that important for your problem, use a one-class/exemplar svm to identify them.

If I could add one more tip, it would be to make sure that all of your input and output features have a zero mean and a variance of 1 (something close to the normal distribution).

What, you don't feel your data whispering to you?

I'm taking the Stanford Coursera class on Machine Learning with Andrew Ng, and I wholly recommend it. It just started this week, so it's not too late to join if this blog post has whet your appetite and you're looking for more.


I took this course last fall, not too much effort as college courses go and well worth the time. Anyone who is curious about machine learning should most definitely take it.

This article is actually much narrower in scope than the title would seem to suggest. If you don't let the first sentence sink in, it barely even makes sense.

> Here at Factual we apply machine learning techniques to help us build high quality data sets out of the gnarly mass of data that we gather from everywhere we can find it.

So really, this blog post deals with the topic of "principles for applying machine learning techniques to data cleaning".

Clearly this is an appropriate topic for this company, as they're product is essentially API access to pre-cleaned/curated data sets. However, the post itself is lacking depth. The work I do involves a lot of time cleaning my company's internal data sets, so I definitely recognize the pain of the corner/boundary/special case. However, anyone who has worked with data cleaning (read: everyone who has worked with data) would know this pain, they wouldn't need a blog post to point it out.

I would, however, be interested in knowing what sorts of machine learning techniques they're applying to the problem. When I clean data, the process is largely manual, probably in part because I'm not working with as large of data sets. Maybe they don't want to reveal their secret sauce, but I think a more technical blog post could serve to highlight how good their data cleaning is, and therefore how high quality their product is.

If I understood, the first 3 points can be considered as outliers. Typically, you do want to ignore them, do to them being mistakes in measurements or special cases. In any case, handling of the special cases does not do well for generalization of the algorithms. So I would be against those points.

4th points is true, except that it's more of a zen advice then actual useful information.

5th is a comercial.

Not a very good read, and I generally found that no machine learning knowledge gets explained in 5 short paragraphs with no technical terms (except "sigma", which I guess is standard deviation).

Like others, I found this quite light on information. The best writeup I've seen of how to integrate machine learning into an actual infrastructure of code, machines, and people is this: http://www.quora.com/What-are-the-keys-to-operationalizing-a...

Full disclosure: I was impressed enough upon reading this post that I messaged Brandon (the author) and am now interning at the startup he cofounded that solves problems that heavily involve machine learning.

If anyone is interested, the follow-up post is here: http://news.ycombinator.com/item?id=4444833 . In response to the feedback, we tried to make a follow-up deeper and more concrete.

This post is very light on information

That's generous. I've played around with a couple of ML techniques and would love to apply ML to more of my day-to-day work.

The title of the article was total link bait and I'm sorry I read through 90% of it looking for something semi-useful.

You can try the MetaOptimize Q+A forum:


I started it so that people could talk about the practice of ML, and the technical details that too often are not discussed in academic publications.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact