
Ask HN: Statistics for hackers? - haliax
Hi again HN,<p>I've been trying to learn more about statistics of late, motivated by some really fantastic applications I've seen, like automated composition of music, medical models, and stock market tools.<p>Atm I've been going through the book Elements of Statistical Learning, which I got from the frontpage a few days ago. But it's kind of slow going, since without really knowing how things relate to each other all I can do is go through it sequentially. What I want is to jump in with both feet, and start writing cool code.<p>Does anyone know of good books or articles for someone in my situation? Or you give me sort of a minimal spanning roadmap for what I need before I can start having some fun?<p>I know about basic probability theory, bayesian text classification and hidden markov models, but that's about all.
======
drats
I can't speak highly enough of "Programming Collective Intelligence" by Toby
Segaran. It's not everything, you'd need other books, but it covers "fantastic
applications" of techniques. It has really clear explanations from real-world
scenarios, followed by extremely clear python code and with a second
explanation with code and maths of each technique used in the appendix. Check
out the contents on Amazon.

~~~
gtani
some other helpful books:

\- Data Mining, by Witten and Franke; describes basics with rigor, including
how to use Weka, which they wrote

[http://www.amazon.com/Data-Mining-Practical-Techniques-
Manag...](http://www.amazon.com/Data-Mining-Practical-Techniques-
Management/dp/0120884070/)

a couple java-based books from Manning:

\- Collective Intelligence in Action (by Satnam Alag) and

\- Algorithms of the Intelligen Web (Marmanis, Babenko)

-

------
waldrews
Sounds like you're more into probability modelling and machine learning than
statistics in the traditional hypothesis testing sense. Besides ESL, a book
I'd recommend is Bishop's Pattern Recognition and Machine Learning. It starts
from the beginning of probability theory applied to computer science problems,
and covers every modern topic.

videolectures.net is filled with lectures on CS-flavored probability modelling
and machine learning topics. The best bet is the multi-hour "tutorial" lecture
series and minicourses; it may take a while to choose the right starting
point.

For serious stats and probability without the CS flavoring (not useful for the
quick-road-to-hacking-power agenda):

For classical deep stats theory, everyone I know begins with Cassela and
Berger's Statistical Inference. Don't expect algorithms in this though.

On the probability side: Feller's An Introduction to Probability Theory and
Its Applications. Deep, readable, sometimes funny, full of "whoa" insights.
Would be hard to actually grok every chapter in both volumes, but you read it
for insight into the power of probability and then use it as a reference.

(grad student in stats, among other things)

------
etal
Try this:

<http://www.bmj.com/collections/statsbk/index.dtl>

The examples you gave make me think you've done some applied things with those
specific techniques, but haven't covered the theory and related areas in
depth. That's fine; the Square One series is simpler but comprehensive, so
you'll be in good shape after that to investigate more on your own.

~~~
haliax
Thanks! I'll give it a shot!

------
caffeine
Go download David MacKay's Information Theory, Inference and Learning
Algorithms (free book). Go through the part on Bayes and the part on Neural
Nets (and the info. theory part if you want to, which is fascinating but not
as directly relevant), which is a total of roughly 20-30 chapters, some very
short. Do as many exercises as you can do (i.e. try them all, fail and come
back later if necessary), and try implementing those algorithms. That will get
you boned up on this stuff generally.

From there:

Standard references are Hastie and Tibshirani which you already have, Pattern
Recognition by Duda Hart and Stork, and PRML by Chris Bishop (though I found
it boring - too many unmotivated equations). All of Statistics and especially
All of Nonparametric Statistics by Wasserman are both excellent books which
will fairly rapidly get you introduced to large swaths of statistical models.
Papoulis (1993) is quite a good reference on statistics in general, and Joy &
Cover is the usual reference of choice for information theory (which is very
relevant to what you're interested in), but neither of those are much fun to
actually read.

You seem less interested in classification/ML problems and more interested in
straight-up stats and/or timeseries stuff. So some slightly deeper references:

\- Given your interests you might absolutely love Kevin Murphy's PhD thesis on
Dynamic Bayes Nets, which are excellent for describing phenomena in all three
fields you mentioned.

\- Check out Geoff Hinton's work, especially on deep belief nets (there's a
Google tech talk and a lot of papers).

\- Hinton and Ghahramani have a tutorial called "Parameter Estimation for
Linear Dynamical Systems", which could be directly applicable to the models
you're talking about

\- If you're interested in these dynamic, causal models you'll want to learn
about EM (which you should know already since you know HMMs), and its
generalization Variational Bayes. MacKay has a terse chapter on variational
inference; <http://www.variational-bayes.org/vbpapers.html> has more. One of
those is an introductory paper by Ghahramani and some others, which is nice.

\- Pretty much everything on <http://videolectures.net> will excite you.

Some of those references (esp. the VB stuff) can get slightly hairy in terms
of the maths level required (depending on your background). Bayesian Data
Analysis with R (by Jim Albert), or Crawley's R book (for a more frequentist
approach), can get you started using R which can avoid you needing to
implement all this stuff yourself, as much of it is already implemented. This
might be your fastest route to writing code that does cool stuff - understand
what the algo is, use somebody else's implementation, apply it to your own
problem.

------
tokenadult
Here (perhaps for onlookers more than for your exact case) are more two
favorite recommendations for free Web-based resources on what statistics is as
a discipline, both of which recommend good textbooks for follow-up study:

"Advice to Mathematics Teachers on Evaluating Introductory Statistics
Textbooks" by Robert W. Hayden

<http://statland.org/MyPapers/MAAFIXED.PDF>

"The Introductory Statistics Course: A Ptolemaic Curriculum?" by George W.
Cobb

[http://repositories.cdlib.org/cgi/viewcontent.cgi?article=10...](http://repositories.cdlib.org/cgi/viewcontent.cgi?article=1002&context=uclastat/cts/tise)

Both are excellent introductions to what statistics is as a discipline and how
it is related to, but distinct from, mathematics.

A very good list of statistics textbooks appears here:

[http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources...](http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources.html)

------
silentbicycle
This isn't directed at the author of the top post, but _The Cartoon Guide to
Statistics_ is actually pretty good. It's a quick read (cartoons, hey), but
works well as a quick refresher, or would be enough of an intro to pick up
terminology for more pointed questions. (It's also fairly cheap, libraries
might have it, etc.)

------
haliax
If it helps, the sort of questions that interest me are: what types of sounds
are pleasing to the ear, or, what rules do I have to constrain randomness
within before I can generate music that sounds like _____, or (presumably with
a factor model) what team is most like to win the world cup =P

~~~
keenerd
You could try looking at existing generative music schemes. This one is
classic[al]:

<http://en.wikipedia.org/wiki/Musikalisches_Würfelspiel>

------
sh1mmer
I've been pretty impressed by the O'Reilly book on statistic:
[http://www.amazon.com/Statistics-Nutshell-Desktop-
Reference-...](http://www.amazon.com/Statistics-Nutshell-Desktop-Reference-
OReilly/dp/0596510497/ref=sr_1_1?ie=UTF8&s=books&qid=1256505041&sr=8-1)

~~~
fhars
I am pretty ambivalent about that book. It covers a lot of territory with
pointers to deeper literature with a clear emhasis on using these techniques
with a statistics program like SPSS (or R), which is nice, but it still wastes
a lot of space printing intermediate tables for simple examples which don't do
much except filling pages.

But the real bummer is the editorial quality. There aren't any three
consecutive pages without major typographical or editorial errors like missing
parantheses in complex formulas or cases where they obviolsly replaced
examples with simpler ones but forgot to change the illustrations together
with the text.

------
pmichaud
This is a site that's really good for beginners:

<http://www.statisticshowto.com>

And it's really relevant here because the approach to everything is step by
step -- the author of the site probably doesn't realize it, but the tutorial
steps practically read like pseudo code... seems like it could really help
you.

Also, there are some calculators on there, and I've seen the code, which isn't
bad, and it's not obfuscated so if you want to get an idea of how to implement
something, you can just look at the source directly.

------
lliiffee
One of the things EoSL emphasizes is that simple methods can often give very
good results. If you have been through the first few chapters of the book, you
should be able to do some cool stuff with nearest neighbors or linear
classifiers. (The reality is that on most problems, fancy methods give a
slight increase over simple, especially when they are implemented by someone
really skilled at getting good performance out of simple methods.)

Nearest neighbors methods can be implemented in something like 3 lines, so you
have no excuse!

------
tsally
Hidden Markov models are serious stuff. If you already understand them,
leverage that power and go build an application! No point in learning the
theory if you aren't going to apply it.

~~~
haliax
I've built some text generators, and a POS tagger. But, for much of the more
interesting stuff I want to do, it seems like I need things like time-series,
or regression models -- which elude me entirely at the moment.

------
dotBen
The trick for me is to learn a good foundation in statistics but to know that
you don't need to learn everything.

A friend who founded a startup that makes heavy uses of statistics likes to
trawl academic papers for algorithms that help his business.

Think of research papers a bit like a well documented private object/library
-- you know what data it accepts, you know what it returns, but you don't need
to know how it works.

Just make sure your code reflects exactly the formulae/model documented in the
paper and you're good.

~~~
eru
And the assumptions in the paper.

------
Mongoose
I'm in a similar situation, as I'm trying to decide whether or not to minor in
statistics on top of computer engineering. From a hacker's perspective, I've
found that playing around with the language R is the best way to relate to the
field. Check our the tutorial _R for Programmers_ linked from here:
<http://heather.cs.ucdavis.edu/~matloff/r.html>

------
Anon84
<http://news.ycombinator.com/item?id=902478>

------
whimsy
As far as application of statistics to automated composition goes, definitely
check out the EMI project over at UC Santa Cruz.

------
jakecarpenter
iTunesU has some courses for stats, but they may not be suitable for learning
in a hurry.

-jc

------
zackattack
You should write a book, statistics for hackers. I would buy it. Make sure it
explains things really well. The best person to teach someone is a beginner _,
because they understand the beginner's perspective. So you are in a unique
position to create this.

O'Reillys statistics in a nutshell is a good _reference* book, but not quite a
textbook. Here you go. Including my refid.
[http://www.amazon.com/gp/product/0596510497?ie=UTF8&tag=...](http://www.amazon.com/gp/product/0596510497?ie=UTF8&tag=httpwwwhiph02-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0596510497)

*True masters have beginner perspectives, so they are good teachers as well.

~~~
haliax
I'm tempted. Though might a blog be better? Also, it might be nice to have
someone to do it with, as then I could get explanations for things that
befoggle me.

~~~
zackattack
It'd be better to have it with a table of contents. You could probably get
sufficient explanations through a combination of research on wikipedia and
then probing #math on freenode.

~~~
anonymousDan
Just out of interest, how long does it generally take for a question to get
answered on #math? Does it vary quite a lot, or is it fairly stable (assuming
someone can actually answer it)?

~~~
zackattack
i typically get answers immediately, but i've only asked questions up to
college algebra

------
joeycfan
That's a lot, actually.

