
A Programmer's Guide to Data Mining - carlosgg
http://guidetodatamining.com/
======
TrainedMonkey
While scanning table of contents, I was like this is simple stuff. But then I
dived in a chapter and I was converted. This is good because it shows you how
to use all those techniques in a real world, with examples mining data from
twitter and Facebook streams. Probably best hands on guide I saw for data
mining/sentiment analysis.

------
terramars
before i say anything directly about the book, i'd like to point out that for
simple systems (like these), the most challenging parts are overwhelmingly
data collection, normalization / featurization, and model testing, rather than
actually creating or using models. while there are rare cases where a simple
solution (hey, let's throw naive bayes at it) will give you a good answer,
these are almost always because someone did a very good job collecting and
sanitizing the input. furthermore, stuff like the twitter movie sentiment
analysis - while great in theory - rarely ends up doing what you expect in
practice. product recommendation and collaborative filtering are proven to
work very well in practice, but sentiment systems are a totally different
monster.

onto the book - it looks promising for an intro to recommendation systems. no
opinion about classification yet. doesn't appear to have anything on graphs or
network effects which is somewhat disappointing. that being said i need to
review bayesian stuff / teach myself some of the harder stuff and it will be
nice to have a practical walkthrough.

that being said no one should be implementing these themselves (except the
dumb stuff like distance metrics).. it's useful to learn but scikit-learn is
amazing when it comes to fancy algorithms.

------
ville
This looks nice. I've also heard many recommendations for the book Programming
Collective Intelligence[1], which touches the same subjects and also has
examples in Python. Now I'm tempted to read both :)

[1]:
[http://shop.oreilly.com/product/9780596529321.do](http://shop.oreilly.com/product/9780596529321.do)

~~~
samuel
Take a look to «building machine learning systems with python». It's a great
read. If you are interested I reviewed it at O'Reilly's site.

~~~
jlees
Both are good recommendations. Building ML systems isn't really about building
ML systems, it's more about introductory machine learning techniques using
Python libraries, and covers much of the same ground. I found Programming
Collective Intelligence more practical, and the copy of Building Machine
Learning Systems I have does have quite a few errors in it, which I suppose
makes for more interesting experimentation (fixing example code certainly
helps you understand it, no?).

------
natebod
Looking through chapter 6 on Bayesian Classifiers. I do not think it is
correct from page 52. He appears to be using the p.d.f of the standard normal
distribution for point estimators. I have training in classical/frequentist
stats, so correct me if I'm wrong, but probability estimates from a pdf are
given by the area under the curve, the value at a point is meaningless. In
fact the probability at a given point is always zero.

~~~
yetanotherphd
You are right, the author uses "probability" and "probability density"
interchangeably although this is technically incorrect. In spite of the loose
language, the presentation is still essentially correct.

------
pigscantfly
As an alternative for anyone who wants to delve a little further into data
mining, I'm currently taking the Stanford data mining class, STATS202. The
book we're using has been really great (published this year) and covers a
great deal more than this site seems to. It's called "An Introduction to
Statistical Learning with Applications in R." It's free online through the
Stanford libraries, but I'm not sure about accessing it for free elsewhere.
The lectures are also probably recorded online somewhere, if anyone is really
interested.

~~~
davis
Here's the link to the online Stanford course:
[http://online.stanford.edu/course/statistical-learning-
winte...](http://online.stanford.edu/course/statistical-learning-winter-2014).
It is being taught by two of the authors too. I definitely recommend checking
it out.

------
SeppoErviala
Check out gensim if you want to do topic modeling or similarity comparisons in
Python.

[http://radimrehurek.com/gensim/](http://radimrehurek.com/gensim/)

It has good implementations of various algorithms, some of which support
streaming or dirstribution, and it allows loading and dumping data in various
formats.

I've used it for building content based recommender using tf-idf, lsi and
similarity index. After the index is built, queries to it are really fast. It
can handle quite large corpuses with little memory.

~~~
sbrother
Second this, I'm surprised you don't read more about it here. We use it in
production to recommend image searchterms based on unstructured text, and it
performs better with a few lines of python code than anything our team could
write in a lower level language in months. It's REALLY fast once you've built
an index.

The reason for that is a pretty epic list of dependencies (have fun explaining
why the prod boxes need a fortran compiler), but in terms of efficiency and
speed of development it's an obvious choice.

~~~
Radim
:-)

Hopefully the SciPy & BLAS dependencies will only get easier to install from
now on... Continuum Analytics received shit loads of money and some of it is
going towards better scientific Python packaging, I believe.

------
crandles
This is from one of my college professors, nice to see it make it on HN, and
it looks like there's a bit more material since I used it in class. I found it
helpful in explaining basic concepts (more-so than the bland textbook that I
had to pay for).

------
garraeth
I just scanned it but it looks awesome! Thanks for putting this together! I
didn't look terribly hard but did you mention the work of Ziegler and Golbeck:
"Investigating interactions of trust and interest similarity"? It's a bit old
(2006) but I think it's a great reference for real-world engines and helped me
a ton back in the day.

------
sown
This is neat! Fantastic, even! The math is less theoretical and more systems
oriented. The choice of python, modern psuedocode that runs, is great, too.
The naive Bayes chapter is useful, too. One might want to look at Udacity's AI
course for more info about this topic or as a supplement. Bayes seems to be
one of those things where the math is short and difficult; I've been reading
about it recently, myself. Just practice, I guess. To engineer stuff with it
you may not need to understand it perfectly (until you get bugs ;). Anyways,
it's still good. It's a hard topic and Bayes law/tricks appear in AI often so
it's worth knowing more about.

Thank you, Ron Zacharski!

(disclaimer: you do not want my opinion regarding any topic).

------
gautamnarula
This looks great! Is there an email list or any other way I can get
notifications as new material is added/revised?

------
sushirain
After reading chapter two, my conclusion is that this book is also suitable
for high-school level. Not many books simplify things so much as this book.
The Python implementation even avoids Numpy, which makes it very easy to
understand (even though using Numpy is more practical).

------
frik
Is the code also available in C like syntax? (C, C++, PHP, JS, etc)

Porting Python code can be painful. (I checked the chapter 7 py file and it
isn't filled with functional style code, though various kinds of arrays with
index starting with 1 or so may still be an issue)

~~~
brianobush
sounds like a great holiday project! seriously, most of these algorithms are
easy to code up in C. Parsing and encoding handling can be tricky. If you
stick with latin-1 encoded materials, you should be fine.

~~~
minikomi
I'm enjoying porting it to racket to confirm I'm learning it correctly. A
great way to learn.

------
cmao3
My feeling is that it's very interesting book even for high school kids.

------
karangoeluw
This is awesome. How about a PDF with all chapters combined?

------
lovegratisbooks
As of January 5, 2014, the pdf for this book will be available for free, with
the consent of the publisher, on the book website.

------
nashequilibrium
I actually went through this book almost two years ago, i remember the author
did not finish it, but i enjoyed it! Thanks!

------
ewharton
This is great - I love that it's in Python

------
LambdaAlmighty
Not bad as an introductory text, but the code could use some love.
Disappointing when it says "programmer's" in the title.

Ever heard of PEP8 for Python coding style? List comprehensions?

I'm afraid this falls in no man's land, with code too weak for practitioners
and theory too weak for theoreticians.

~~~
nkarpov
...you're seriously nitpicking on the _least_ important things here and then
using it as evidence to say it's 'too weak for practitioners'... come on.

