Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to Data Mining (utoronto.ca)
224 points by ssn on May 22, 2011 | hide | past | web | favorite | 25 comments

That tree's a good example of how not to do it.

For one thing, it's representing something that's not really a tree. "Support Vector Machine" and "Neural Networks" appear more than once as leaf nodes. Like any Chinese Encyclopedia classifications they often succumb to the temptation to add nodes that say "Other" to keep the length of the branches constant. (They could probably think of some name for what neural nets and the SVM have in common -- they've got all day to think about this stuff because they get paid to teach and to do research, it's not like they are harried practitioners.)

At some point I quit distinguishing regression and classification. There was a time when I knew some tricks for classification and regression seemed mysterious. Once I got over my mental block it seemed pretty obvious that much of my box of tricks worked for regression too.

Another issue is that it's not a good graphic for the web. You could probably print this out and read it but you can't take it in at glance on the web which destroys the purpose of it being an infographic

You probably have more background in data mining that the rest of us. For those of us with limited or no exposure to data mining techniques, I think the graphic gives me some background about each technique that I otherwise wouldn't have had. At a glance, I understood that there are two types of hierarchical clustering method - agglomerative and divisive. That is a small amount of new knowledge, but it's more than I could have gleaned from a glance at a block of text. I don't think the graphic is without merit.

The graphic does what it's intended to do: give a high-level overview of machine learning concepts and techniques. Even if not perfect, it's a lot better than a block of text.

it still isn't the way i'd do it. the sheer number of different methods is intimidating -- there are a lot of commonalities of the methods that aren't depicted.

if you wanted to make things clear you're better off picking ONE problem and ONE method applied to the problem. the real challenges in understanding ML or DM are pretty much the same with all methods.

a graphic like that creates the illusion that it's transfered some knowledge when really it hasn't. that's why infographics are dangerous. if HN was my site i'd have it reject any article that has an image that's > 800 pixels high in it.

"it still isn't the way i'd do it."

I look forward to your attempt!

It just helped me a lot. Last night I'd drilled down reading about Support Vector machines and seeing that chart put a bunch of topics in focus.

The only thing this graphic does is that it generates confusion and reinforces misconceptions. A tree diagram is completely inadequate for this task - they say that mathematics is about "analogies between analogies" for a reason.

I also disagree with you. I consider the tree extremely easy to follow even though is not perfect. I guess some people use visualization more than others, which is completely fine.

I also wouldn't construct the hierarchy in this way.

I would put the statistics under modelling or summatory analysis, not under exploratory analysis. I wouldn't divide methods by sufficient descriptive statistics - I'd divide them by domain and loss function. I would name an artificial neural network.

Or perhaps I would arrange them by computational complexity.

That said, this is a good summary in that it's an overview, and overviews are always helpful.

Can you point us to anything that you would see as a good introduction to the prediction side of machine learning?

The "explaining the past" (and I suppose the regression) section is covered well by most universities statistics departments, but the "predicting the future" stuff is usually limited to one or two Artificial Intelligence classes in the computer science faculty.

> Like any Chinese Encyclopedia classifications

As in "those that belong to the Emperor", etc., as Borges put it ( http://www.multicians.org/thvv/borges-animals.html )

(I suppose, but I do have a pro-Borges bias ;-)

I think it's a nice overview. I don't mind that certain things appear more than once as "leaf nodes", because it shows that the same methods can be used for different things. Visualizing this with only one "leaf node" each would have been more messy in my opinion. I also think the differentiation of Classification and Regression is justified, because while Regression can be used for Classification, it's not quite the same thing.

I'm not an expert, but in my opinion the difference is that Classification is sorting a basket of apples and bananas into two separate baskets, while Regression is predicting which fruit will come out of the basket after X apples and Y bananas.

As a quick rule of thumb would be to never trust anyone who claims "regression" is a separate topic from "classification". Oh yes, and in this case it is beyond awful.

Why? I can think of another quick rule of thumb: never trust anyone who doesn't give explanation of the statements one claims.

Looking at the four nodes below "regression" and those below "classification" in the OP, we see they are the same. The reason is that the same methods can often be used for both.

This is what the comment means when it says the division between regression and classification is often artificial (i.e., imposed by people who are concentrating on taxonomy).

It's just a comment.

Can you explain why you feel that way?

He's making a fair point, which is that certain types of classification methods basically involve "fitting" a hyperplane to a dataset in such a way that the hyperplane divides the data into different classes [1]. In trivial cases, we may be able to do this in two dimensions, in which case the "hyperplane" is just a line bisecting two different classes of data. It's easy to see how this can be conceptualized as a type of regression.

[1] http://en.wikipedia.org/wiki/Support_vector_machine

Classification is regression with a discrete finite range.

Regression is often used to mean regression with infinite range, but especially since so many classification techniques are just searching for a regular partition of an infinite range, the distinction isn't made much anymore. Whether the rare techniques that are specifically over a discrete finite range still count as regression depends on who you ask - usually yes.

This is one of the first observations someone learning the subject would make.

It's often practically true, though.

While every method I can think of differentiates itself at a lower level than regression/classification, and thus has interpretations in each, it's not always feasible or easy to make the "flip". Often times the approximation steps that make the algorithm tractable depend on the objective function.

This is really, really cool, thanks Dr. Sayad. I love how this allows people to see the entire process and then drill into each step as deeply as they want to go.

I really think this is the way complex topics need to be taught. It's so easy to get caught in the weeds and lose track of where you are in the overall picture, an approach like this is extremely helpful.

That's a great overview of data mining. I am wondering if you can somehow insert multivariate analysis into this flow chart. Multivariate analysis may be a good substitute for clustering methods. Thank you for providing such a clear picture.

another thought is: if it's possible (I do notice the copyright for this), could we put it on wiki, so everyone can update it after approved by Dr. Saed Sayad to generate a comprehensive one.

I enjoyed reading all your comments. I would be glad to moderate the "wiki" version.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact