
Think Stats: Probability and Statistics for Programmers - samrat
http://greenteapress.com/thinkstats/html/index.html
======
tokenadult
My favorite references on statistics get to an idea that I think you will be
inclined to emphasize: DATA is what matters in statistics, more than
mathematical manipulation. See what you think about these, and best wishes for
further revisions of your interesting online guide.

<http://statland.org/MAAFIXED.PDF>

The first link is a BEAUTIFUL and thought-provoking discussion of what's
dangerous about having a new mathematics professor choose the statistics
textbooks for an introductory college class in statistics, with advice on what
to look for in statistics textbooks.

<http://escholarship.org/uc/item/6hb3k0nz>

The second link is by a very famous statistician, with discussion of how the
statistics curriculum could be revised to better emphasize the most important
ideas.

Put the key ideas from these two resources in your own words, and you will
have a good guide to programmers about how to think about statistics.

~~~
AllenDowney
Very interesting. I certainly agree with the idea from the first article, on
the importance of data, and from the second article, on the usefulness of
simulation.

~~~
AllenDowney
I read the first article more carefully, and it inspired this new blog post:
[http://allendowney.blogspot.com/2011/08/jimmy-nut-company-
pr...](http://allendowney.blogspot.com/2011/08/jimmy-nut-company-problem.html)

------
JeanPierre
If you've gone through this book and want to learn more ways to use statistics
and probability, you should look at the videos from mathematicalmonk[1] on
youtube - he's like Khan for Machine Learning and more rigorous probability.
It helps learning Machine Learning and Hidden Markov Models pretty well, at
least from my experience.

<http://www.youtube.com/user/mathematicalmonk>

~~~
tathagatadg
Awesome .. just what I was looking for!

------
mvzink
Who's going try reading some of this before doing the ML or AI Stanford
courses?

~~~
edtechre
That's partly why I bought it. My university didn't have a statistics
department, so I never had the chance to take a statistics course that was
worthwhile.

Statistics and linear algebra really should be required by all CS programs.
It's funny that at many schools those courses are not, yet Calculus is. First,
Calculus should have been handled in HS. Second, I've never had a use for
Calculus professionally or for anything I've worked on in my free time.

~~~
edtechre
And on that note, does anyone have any other good book suggestions regarding
ML?

~~~
maurits
Of the top of my head,

\- Machine Learning by Tom M Mitchell <http://www.cs.cmu.edu/~tom/mlbook.html>

For general reading and introductions I also like:

\- Pattern Classification by Richard Duda

\- Pattern Recognition and Machine Learning by Christopher Bishop

For a bit more emphasis on statistics and math, I usually dive in to

\- Classification,Parameter Estimation and State Estimation by van der Heijden

And last, but certainly not least:

\- Information Theory, Inference, and Learning Algorithms by David MacKay,
available here:

<http://www.inference.phy.cam.ac.uk/mackay/itila/>

~~~
edtechre
Excellent, many thanks.

I've read O'Reilly's Collective Intelligence. It's a great introductory
survey, but it was very light on theory.

I also own Collective Intelligence in Action. It had more explanation of
theory than O'Reilly's offering, but most of the chapters devolved into how to
use Java data mining framework X.

------
DanielRibeiro
No mention of Z-test at all[1]? Guess, at least for hypothesis testing,
Wikipedia did a more comprehensive job[2]

[1] <http://en.wikipedia.org/wiki/Z-test>

[2] <http://en.wikipedia.org/wiki/Statistical_hypothesis_testing>

~~~
AllenDowney
Rather than get into a catalog of tests, I take the approach that "There is
only one test." I wrote more about it here:
[http://allendowney.blogspot.com/2011/05/there-is-only-one-
te...](http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html)

~~~
Jach
Furthermore, another reason the entire classic test-based approach is bad
since it encourages having binary hypotheses when many real-world problems
don't, they may be many or composite or even infinite. (Of course, many real-
world problems can be reduced to binary ones, which is one reason the approach
became popular.)

If you know enough probability theory, statistics is just a special case. The
nice thing about using probability theory is if you do decide to use a 'test',
all of your assumptions are put forth first. As E.T. Jaynes says:

    
    
        In estimating a location parameter, for example, the sample median M is often cited as
        a more robust estimator than the sample mean. But here it is obvious that this
        ‘robustness’ is bought at the price of insensitivity to much of the relevant
        information in the data. Many different data sets all have the same
        median; the values above or below the sample median may be moved about arbitrarily
        without affecting the estimate. Yet those data values surely contain information
        highly relevant to the question being asked, and all this is lost. We would have
        thought that the whole purpose of data analysis is to extract all the information
        we can from the data.
        
        Thus, while we agree that robust/resistant properties may be desirable in some cases,
        we think it important to emphasize their cost in performance. In the literature,
        ad hoc procedures have been advocated on no more grounds than that they are ‘robust’
        or ‘resistant’, with no mention of the quality of the inference they deliver, much
        less any comparison of performance with alternative methods; yet alternative methods
        such as Bayesian ones are criticized on grounds of lack of robustness,
        without any supporting factual evidence.
    

A recent probability book I've started that I think is pretty good is
<http://uncertainty.stat.cmu.edu/>

------
edtechre
Great book. But I made the mistake of buying this book on Kindle just a few
hours before I saw it show up here.

:(

~~~
AllenDowney
Sorry! It's really not meant to be a trick. The book is under a CC license. I
provide the PDF version at Green Tea Press, the LaTeX source for anyone who
wants to make a modified version, and a not very good HTML version (some of
the math is broken). O'Reilly provides the printed and Kindle versions.

Right now all versions have the same content, but I will continue to revise,
so you can think of the version on Green Tea Press as the draft of the second
edition.

I know it can be confusing, but I hope the benefits of the free license make
up for it.

~~~
lylejohnson
Thanks for the clarification, Allen. It's a great tutorial, I was glad to pay
for it, and I will be recommending it to others!

