

An Introduction to Data Mining - ssn
http://chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm

======
PaulHoule
That tree's a good example of how not to do it.

For one thing, it's representing something that's not really a tree. "Support
Vector Machine" and "Neural Networks" appear more than once as leaf nodes.
Like any Chinese Encyclopedia classifications they often succumb to the
temptation to add nodes that say "Other" to keep the length of the branches
constant. (They could probably think of some name for what neural nets and the
SVM have in common -- they've got all day to think about this stuff because
they get paid to teach and to do research, it's not like they are harried
practitioners.)

At some point I quit distinguishing regression and classification. There was a
time when I knew some tricks for classification and regression seemed
mysterious. Once I got over my mental block it seemed pretty obvious that much
of my box of tricks worked for regression too.

Another issue is that it's not a good graphic for the web. You could probably
print this out and read it but you can't take it in at glance on the web which
destroys the purpose of it being an infographic

~~~
dpatru
The graphic does what it's intended to do: give a high-level overview of
machine learning concepts and techniques. Even if not perfect, it's a lot
better than a block of text.

~~~
PaulHoule
it still isn't the way i'd do it. the sheer number of different methods is
intimidating -- there are a lot of commonalities of the methods that aren't
depicted.

if you wanted to make things clear you're better off picking ONE problem and
ONE method applied to the problem. the real challenges in understanding ML or
DM are pretty much the same with all methods.

a graphic like that creates the illusion that it's transfered some knowledge
when really it hasn't. that's why infographics are dangerous. if HN was my
site i'd have it reject any article that has an image that's > 800 pixels high
in it.

~~~
jimbokun
"it still isn't the way i'd do it."

I look forward to your attempt!

------
asrk
I think it's a nice overview. I don't mind that certain things appear more
than once as "leaf nodes", because it shows that the same methods can be used
for different things. Visualizing this with only one "leaf node" each would
have been more messy in my opinion. I also think the differentiation of
Classification and Regression is justified, because while Regression can be
used for Classification, it's not quite the same thing.

I'm not an expert, but in my opinion the difference is that Classification is
sorting a basket of apples and bananas into two separate baskets, while
Regression is predicting which fruit will come out of the basket after X
apples and Y bananas.

------
dvse
As a quick rule of thumb would be to never trust anyone who claims
"regression" is a separate topic from "classification". Oh yes, and in this
case it is beyond awful.

~~~
lliiffee
Can you explain why you feel that way?

~~~
grammr
He's making a fair point, which is that certain types of classification
methods basically involve "fitting" a hyperplane to a dataset in such a way
that the hyperplane divides the data into different classes [1]. In trivial
cases, we may be able to do this in two dimensions, in which case the
"hyperplane" is just a line bisecting two different classes of data. It's easy
to see how this can be conceptualized as a type of regression.

[1] <http://en.wikipedia.org/wiki/Support_vector_machine>

------
jasonkolb
This is really, really cool, thanks Dr. Sayad. I love how this allows people
to see the entire process and then drill into each step as deeply as they want
to go.

I really think this is the way complex topics need to be taught. It's so easy
to get caught in the weeds and lose track of where you are in the overall
picture, an approach like this is extremely helpful.

------
lightoverhead
That's a great overview of data mining. I am wondering if you can somehow
insert multivariate analysis into this flow chart. Multivariate analysis may
be a good substitute for clustering methods. Thank you for providing such a
clear picture.

~~~
lightoverhead
another thought is: if it's possible (I do notice the copyright for this),
could we put it on wiki, so everyone can update it after approved by Dr. Saed
Sayad to generate a comprehensive one.

------
saedsayad
I enjoyed reading all your comments. I would be glad to moderate the "wiki"
version.

