
Ask HN: I want to learn statistics and data mining - oscardelben
I want to learn statistics and data mining. I can't afford to buy many books right now, so I'm asking for one or two on the topic, and also one or two available online, as ordering through amazon would take a month for shipping.<p>I'd like to get your opinions before spending time on random books. Thanks a lot.<p>P.S. My math background is not very strong but I'm willing to learn.
======
owinebarger
I'll make a non-mathematical observation about data-mining. Real world data
sets are messy. Humans are just not the most reliable data entry machines.
Data sets are not always gathered for the purpose you intend to use them. You
have to validate/scrub/realign your data set before wasting time on an
analysis that could be rendered meaningless by these issues.

It's the old "Garbage In, Garbage Out" principle, but it's easy to forget
until the real world hammers it into you.

------
johnmyleswhite
If your math background is weak enough that you don't feel totally at home
using calculus and linear algebra, I'd recommend "Principles of Data Mining",
which they use for the intro to ML class at Princeton. The ratio of English to
equations is pretty high, which might help you to get into the swing of
things.

If you think that you can learn enough math to follow the material, you should
watch the Stanford lectures given by Andrew Ng. They're available freely
online through iTunes U. Andrew is also one of the most respected experts in
the field. As he points out, you will eventually need to understand the math
behind the methods. Otherwise, a day will come when you start wasting time
coding algorithms that plainly don't apply to the problem at hand.

------
mechanical_fish
This is probably not what you're looking for, but while trying to figure out
if you could still download a free copy of Drake's out-of-print _Fundamentals
of Applied Probability Theory_...

[HINT: The answer is yes: [http://ocw.mit.edu/OcwWeb/Electrical-Engineering-
and-Compute...](http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-
Science/6-041Spring-2005/RelatedResources/index.htm)]

...I found this potentially interesting page full of references:

[http://www.cscs.umich.edu/~crshalizi/notebooks/probability.h...](http://www.cscs.umich.edu/~crshalizi/notebooks/probability.html)

I post it here so that I can find it again later, but I fear that this is not
the answer to your question. These appear to be more probability theory than
nuts-and-bolts statistical analysis [1], and they may require more math
background than I have (and I managed to handwave my way to a solid state
physics Ph.D.)

Try browsing around the MIT OpenCourseWare site. One of the great things about
online courses, even if you don't _take_ the courses, is that they often
publish syllabuses that contain a lot of good textbook references.

[EDIT: Ooh, look, if you follow links from the page I mentioned earlier you
get here!

[http://www.cscs.umich.edu/~crshalizi/notebooks/teaching-
stat...](http://www.cscs.umich.edu/~crshalizi/notebooks/teaching-
statistics.html) ]

\---

[1] Not that _I_ can reliably tell the difference. I know just enough about
probability and statistics to know that I should probably study them some
more. ;)

------
phren0logy
If you are starting from scratch without a very strong math background, I'd
recommend:

1.) Head First Statistics -- Pretty good, but beware the section on Bayes
Theorem which is a bit off. This is a quick and casual intro, but worthwhile.
I used it to refresh me on my college stats course (which was a long time
ago), and I like it. There's also Head First Data Analysis, which I haven't
read but could be a reasonable companion. HF Data Analysis uses Excel and R.

2.) Using R for Introductory Statistics (Verzani) -- Good explanations and
exercises, and you will also learn R. This second point is actually pretty
important, because it's a very valuable tool. Whereas the Head First Stats
book walks though pretty simple problems that you work out in pencil, the
Verzani book has many real-world data sets to explore that would be
impractical to do by hand. That said, I think it's valuable to work things out
in pencil with the first book before you move on to this one.

After these books, Elements of Statistical Learning seems to be the current
favorite.

------
jmount
Unfortunately my favorites are a bit on the mathy side, so you may want wait
for other commenters with better advice. But, there are two books I really
feel should not be missed. For statistics: "Statistics, Third Edition" David
Freedman, Robert Pisani, Roger Purves. For machine learning: "The Elements of
Statistical Learning" Trevor Hastie, Robert Tibshirani and Jerome Friedman.

~~~
jaf656s
fyi, The Elements of Statistical Learning can be found here as a free pdf

<http://www-stat.stanford.edu/~tibs/ElemStatLearn/>

~~~
joebottherobot
there's also an R package with the book's material:
[http://cran.r-project.org/web/packages/ElemStatLearn/index.h...](http://cran.r-project.org/web/packages/ElemStatLearn/index.html)

------
agbell
Video Lectures has a ton of DM and machine learning lectures up for free.

<http://videolectures.net/Top/Computer_Science/>

------
jimbokun
The Strang book is pretty much indispensable, I think.

[http://www.amazon.com/Linear-Algebra-Applications-Gilbert-
St...](http://www.amazon.com/Linear-Algebra-Applications-Gilbert-
Strang/dp/0030105676/ref=sr_1_1?ie=UTF8&s=books&qid=1258826887&sr=8-1)

The course lectures are online, too.

[http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/Video...](http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/index.htm)

I've been taking courses related to Machine Learning, and never having taken a
Linear Algebra course, I find myself going to this text and watching these
lectures a lot to bring myself up to speed.

------
tokenadult
Statistics is a very interesting subject, and it is a distinct subject from
mathematics proper. Here (in what is becoming a FAQ post for HN) are two
favorite recommendations for free Web-based resources on what statistics is as
a discipline, both of which recommend good textbooks for follow-up study:

"Advice to Mathematics Teachers on Evaluating Introductory Statistics
Textbooks" by Robert W. Hayden

<http://statland.org/MyPapers/MAAFIXED.PDF>

"The Introductory Statistics Course: A Ptolemaic Curriculum?" by George W.
Cobb

[http://repositories.cdlib.org/cgi/viewcontent.cgi?article=10...](http://repositories.cdlib.org/cgi/viewcontent.cgi?article=1002&context=uclastat/cts/tise)

Both are excellent introductions to what statistics is as a discipline and how
it is related to, but distinct from, mathematics.

A very good list of statistics textbooks appears here:

[http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources...](http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources.html)

------
mrlebowski
For DM, try <http://cs345a.stanford.edu>, just browse through the slides they
are pretty good. Also, if you can find - go for some video lectures about the
topic. [try MIT, Berkeley ocw sites].

------
thumper
It's not directly related, but I just ran into an interesting article on data
analysis. It includes pointers to some good texts on topic, but also has some
tips which aren't covered in the standard texts.

"Three Things Statistics Textbooks Don't Tell You" by Seth Roberts:
[http://sethroberts.net/about/2005_Three_Things_Statistics_Te...](http://sethroberts.net/about/2005_Three_Things_Statistics_Textbooks_Don%27t_Tell_You%20_Dec_2005.pdf)

------
steveeq1
Programming Collective Intelligence also has high marks from a lot of
programmers whom I respect: [http://www.amazon.com/Programming-Collective-
Intelligence-Bu...](http://www.amazon.com/Programming-Collective-Intelligence-
Building-
Applications/dp/0596529325/ref=sr_1_1?ie=UTF8&s=books&qid=1258827924&sr=8-1)

~~~
gtani
the 2 books on amazon's Bought This Item Also Bought" blurb are more rigorous,
and quite useful books, also covering topics like Lucene/SOLR with full java
code listsings:

Algorithms of the Intelligent Web by H. Marmanis

Collective Intelligence in Action by Satnam Alag

------
durana
Stanford has a Machine Learning course available in iTunes U.

Direct link to course in iTunes:
[http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.s...](http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.edu.1615003397)

Stanford iTunes U site: <http://itunes.stanford.edu/>

------
caffeine
Please search HN for this topic. I've probably answered it 10 times here at
least, and other have answered it even more.

------
eas
Here is another free book, which teaches statistics through R:

Introduction to Statistical Thought
<http://www.math.umass.edu/~lavine/Book/book.html>

I can also recommend the Statistical Aspects of Data Mining lectures by David
Mease (on Google Video & stats202.com).

------
elq
"Machine Learning: An Algorithmic Perspective" by Stephen Marsland is very
good if you want to see the ideas in code and if you're not mathematically
ready for hastie or bishop.

<http://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html>

------
agbell
I recommend starting with weka and this great book:
[http://www.amazon.com/Data-Mining-Practical-Techniques-
Manag...](http://www.amazon.com/Data-Mining-Practical-Techniques-
Management/dp/0120884070/ref=sr_1_3?ie=UTF8&s=books&qid=1258831010&sr=8-3)

------
keeneg
This site has a ton of references to books and papers, with some of them
seeming fairly introductory:

<http://www.helixpartners.com/references/>

------
gtt
If your math background is not strong enough I'd recommend you to spend a few
weeks on algebra and calculus. It'll pay you back for sure.

To find research papers use scholar.google.com

------
eob
Thanks for these links everyone

~~~
oscardelben
Thanks also from me. It'll take time to digest all the information but a
thread like this is very valuable to distill from hundreds of sources.

------
HilbertSpace
Early in my career, I asked that question. Eventually I got some answers.

Broadly usually the 'big bucks' are in things that are new at least in some
important sense. To find such things, one recommended approach is to stand "on
the shoulders of giants". Then you may still do some original work, but you
have one heck of a base. My view is that Moore's law, the Internet, etc. are
not nearly fully exploited, that enormous value remains, and the best way to
get some of this value is to find a big unsolved problem where a good solution
will be very valuable, stand on the shoulders of some giants, do some original
work to get some powerful 'secret sauce' for solving the problem (which
appears not to be easy to solve otherwise), start a business, please
customers, get revenue, grow the business and sell it.

Here, however, just what problem, giants, original work, secret sauce are NOT
easy to see. Broadly we are trying to see, no, MAKE, the future, and usually
that is not easy. While we can see what applications have been made of
statistics in the past, what statistics will be applied where in the future is
not easy to see.

Yes, broadly statistics is highly promising. Or, in 'information technology'
we want valuable, new information, are awash in cycles, bytes, bandwidth,
infrastructure software, and data, and want to use these to create the
information. Here, on appropriate problems, statistics can commonly, easily,
totally 'blow away' anything else. The marriage between (A) 'information
technology' in business and (B) statistics has yet to be consummated, even to
reach to 'going steady' stage, assuming the traditional order of events.

For 100 years or so, people have come to statistics from various areas of
work. Usually they had some data and some questions. Some of the areas have
been educational and psychological testing, experiments and testing in
agriculture and medicine, industrial quality control, model building in
economics, experimental work in physics and chemistry, investing, attempts to
create mathematically based sciences in the social sciences, especially
economics, psychology, sociology, and political science. In sociology, old
examples would be James Coleman, Pete Rossi (professors of my wife when she
got her Ph.D.), Leo Goodman.

Likely the best fast, practical path into statistics is via books, courses,
etc. intended for students in the social sciences. These students commonly do
not have good backgrounds in mathematics. For the mathematical prerequisites,
generally can get by, as a start, with just high school first year algebra.
With this path can learn about probability distributions, the central limit
theorem, the law of large numbers, statistical estimation and confidence
intervals, hypothesis testing, cross tabulation, analysis of variance,
regression analysis, principle components analysis, and more.

For more, statistics is a rock solid part of mathematics, as solid as any part
of pure mathematics, e.g., topology, geometry, analysis, algebra, etc. That
is, statistics is based solidly on theorems and proofs, sometimes relatively
deep ones.

Statistics as theorems and proofs is called 'mathematical statistics'. Long
standard has been (with TeX markup):

Alexander M.\ Mood, Franklin A.\ Graybill, and Duane C.\ Boas, {\it
Introduction to the Theory of Statistics, Third Edition,\/} McGraw-Hill, New
York.\ \

The main prerequisite for this book is just a not very good course in
calculus, and the book actually makes not much use of calculus. Mostly all a
student will need from calculus is that it can find the area under a curve.
Since the book has long been standard, can't really ignore it, but it's ugly.
And often, with just calculus, the book doesn't really give solid proofs of
the results. E.g., their treatment of sufficient statistics has some nice
intuition, but their proof is junk. The subject cries out for a good book, but
I'm not trying to write one or waiting for someone else to.

Can get some of the flavor of mathematical statistics done with high quality,
as mathematics, in, say (with TeX markup):

Jean-Ren\'e Barra, {\it Mathematical Basis of Statistics,\/} ISBN
0-12-079240-0, Academic Press, New York.\ \

Robert J.\ Serfling, {\it Approximation Theorems of Mathematical
Statistics,\/} ISBN 0-471-02403-1, John Wiley and Sons, New York.\ \

P.\ Billingsley, {\it Convergence of Probability Measures,
2\raise0.5ex\hbox{ed},\/} ISBN: 0-471-19745-9, John Wiley, New York.\ \

R.\ S.\ Lipster and A.\ N.\ Shiryayev, {\it Statistics of Random Processes I,
II,} ISBN 0-387-90226-0, Springer-Verlag, New York.\ \

However, pursued mathematically, statistics has some relatively advanced
prerequisites some of which curiously are not popular in US university
mathematics departments.

For the prerequisites,

High School. Should have had high school first and second year algebra
(reasonable facility with algebraic manipulations, the binomial theorem,
complex numbers, both of which will see again in important ways), plane
geometry (where nearly all the work was proofs -- first place to learn about
proofs), trigonometry (usually assumed in calculus and important in, say,
analysis of organ tone harmonics and, thus, the most important example of an
infinite dimensional Hilbert space), analytic geometry (especially the conic
sections, especially ellipses which definitely will see again), and, if can,
solid geometry (for more intuition in three dimensions).

College. Need a standard calculus course, not necessarily a very comprehensive
or difficult one because will do the subject all over again, maybe two or
three times, and more later, WITH the proofs!

Then need linear algebra, that is, how to work with data of several
dimensions, which is just crucial. The big result is the polar decomposition,
and there get to think about ellipses and get to use complex numbers. Also the
course is an introduction to functional analysis and Hilbert space. Use any
popular book to get started but in the end cover the classic, Halmos, "Finite
Dimensional Vector Spaces'. Halmos wrote this when he was an assistant to von
Neumann and intended it to be a finite dimensional introduction to Hilbert
space (which once von Neumann had to explain to Hilbert) which it is. It also
has some multi-linear algebra, of interest to exterior algebra now popular in
relativity, but likely for nearly any business applications of statistics for
the next several decades can skip that chapter.

Then need some advanced calculus. That is a poorly organized, huge, catch-all
subject beyond any one course. The usual start is 'baby' Rudin, 'Principles of
Mathematical Analysis'. So, that's calculus with the proofs. Warning: The book
is severe, succinct, with zero pictures. Have to draw your own pictures in
your head. The book is packed solidly with powerful material just awash in
important applications from statistics, economics, and engineering to physics,
but there is hardly a hint of the applications in the book. I enjoyed the
book, but few people will enjoy it or even get through it. Hint: Get a really
good teacher! Then for more, popular is Spivak, 'Calculus on Manifolds',
mostly because it is short. Actually, it's too short. I prefer Fleming,
'Functions of Several Variables' until get to the exterior algebra chapter at
which time, if care, can now get the thin

Henri Cartan, {\it Differential Forms,\/} ISBN 0-486-45010-4, Dover, Mineola,
NY, 2006.\ \

in English.

Between Halmos, baby Rudin, and Spivak, you will have covered Harvard's Math
55 with a colorful description at

[http://www.american.com/archive/2008/march-april-magazine-
co...](http://www.american.com/archive/2008/march-april-magazine-contents/why-
can2019t-a-woman-be-more-like-a-man/?searchterm=Sommers)

Harvard tries to cover these three for freshman, but in most math departments
the material will take you through all or nearly all of a focused
undergraduate pure math major.

If somewhere take a course in abstract algebra, e.g., with a little group
theory, then that might help!

Graduate School. Might learn a little more about topology, say, from Simmons,
'Introduction to Topology and Modern Analysis'. So, get good with metric
spaces and get started on duality.

The next big step is a course in measure theory and functional analysis. The
Simmons work will help. Baby Rudin will be crucial; Halmos is recommended. So,
with measure theory, do calculus over again and in a very different and much
more powerful way and a way just crucial, even central, for mathematical
approaches to statistics. The functional analysis will concentrate on
representation theorems, the Radon-Nikodym theorem, and Hilbert and Banach
spaces. Long popular, from Stanford and sometimes aimed at statistics
students, is Royden, 'Real Analysis'. It's gorgeous. Should also read the real
half of Rudin, 'Real and Complex Analysis'; it's a few steps up in difficulty
from baby Rudin. Again, hint: Get a good course from a good teacher who can
get you over the material without getting stuck. Then go back and study the
material 2-3 more times, apply it, do some original research in it, and
finally begin to understand it.

We're talking high, top, center crown jewels of civilization here; the stuff
is of just awesome power; my view is that it is, for the rest of this century,
one of the main pillars of increases in economic productivity via the
exploitation of Moore's law; on 'what to program', computer science is stuck
and this material is the most promising way forward; as a famous restaurant
owner once said about some Morey St. Denis, "you won't find better".

Now are ready for probability. I recommend:

Leo Breiman, {\it Probability,\/} ISBN 0-89871-296-3, SIAM, Philadelphia.\ \

M.\ Lo\\`eve, {\it Probability Theory, I and II, 4th Edition,\/} Springer-
Verlag, New York.\ \

Kai Lai Chung, {\it A Course in Probability Theory, Second Edition,\/} ISBN
0-12-174650-X, Academic Press, New York.\ \

Jacques Neveu, {\it Mathematical Foundations of the Calculus of
Probability,\/} Holden-Day, San Francisco.\ \

Neveu is succinct, gorgeous, but not easy. This material is NOT popular in US
departments of mathematics. At Princeton, see Cinlar.

Then should make some progress with stochastic processes: The big book is
Gihman and and Skorohod, right, in three volumes, but mostly people settle for
shorter treatments. Whatever, should learn about Poisson processes, Markov
processes (discrete time, finite state space is enough to get started),
Brownian motion, and martingales. Might also learn about second order
stationary processes. A good course in stochastic processes is NOT easy to
find, especially in mathematics departments.

Now are ready to attack statistics mathematically! I don't know of a good,
single 'mathematical statistics' book at this level. Instead, there are many
books -- I gave some above -- and then the journals. Thankfully, the field is
relatively close to applications; so can take a practical problem and
concentrate on what is relevant to it. One of my papers was some new work, at
this level, in mathematical statistics for a problem in practical computing
and computer science. The computer science community struggled terribly with
the mathematics. So, it was some progress in computer science that community
will have to struggle to understand.

One approach to work in computing is just to try things, that is, just to
throw things against the wall and see if they appear to stick. Or, maybe the
truth of the situation really is a simple statistical model. Likely that model
will fit the data well. So, try many simple statistical models. We use these
models mostly ignoring the mathematical assumptions; mostly we are proceeding
'heuristically', that is, with guessing. If any of the models fit well, then
they can be considered candidates for the truth. So, are throwing things
against the wall to see if they fit. This approach also called 'data mining'.

Problems:

(1) Will be quite limited in what statistical models can use. That is, will be
drawing from a cookbook instead of being a real chef who can create good, new
dishes appropriate for the available ingredients and customers!

(2) Don't have much, e.g., have not proceeded mathematically where from the
deductive logic of assumptions and proofs actually know in advance some good
things about the results. Something like breaking into a pharmacy, mixing up a
lot of pills, taking them, and seeing if feel better! Uh, I'll pass and let
you do that without me!

(3) May have gone through a lot of computer time in an 'exponential,
combinatorial explosion' of efforts throwing against the wall.

(4) Have ignored a LOT in statistics that can add to what know about the
results.

(5) Will be tempted to conclude have have found 'causality' but will likely
not have.

(6) Will be tempted to conclude that have a model that predicts, but that is
on shaky ground and risky and needs more work.

Applied to important problems, this approach can be dangerous.

There are not many healthy statistics departments. Much of the career interest
is in biostatistics, especially related to FDA rules.

It appears that among the top statistics departments are Berkeley and UNC.
Since Breiman and Brillinger are at Berkeley and since Stanford, long good in
statistics, is not far away, if I were looking for a Ph.D. in statistics then
I'd pick Berkeley.

There is a general problem getting a 'job' in a technical field and likely
also with statistics. The assumption in US business is still as in factories
150 years ago: The supervisor knows more than the subordinate; the subordinate
is supposed just to add common labor to what the supervisor says. In
particular job descriptions are written by the supervisors, not the
subordinates!

Well, there are nearly no supervisors in US business who have even as much as
a weak little hollow hint of a tiny clue about the material described here.
So, won't need that material to qualify for the job descriptions. Moreover, if
actually know such material and let that fact leak out, then will likely not
make it past the first level HR English major phone screen person who will
tremble and conclude that you are not like the employees they have! If you do
get hired and someone in your management chain discovers that you have used
some mathematics they don't understand, you might be on the way out the door,
especially is your work was valuable for the company!

Of course, the solution is to find a valuable application and start your own
business. While maybe biomedical venture capital can understand crucial, core
technical content, in information technology venture capital, likely you will
be trying to explain this stuff to, say, history majors who worked in fund
raising, marketing, general management, or financial analysis or have a
background in just relatively elementary parts of computing. Just will NOT
find more than six people, maybe not more than zero people, in US venture
capital who can work the exercises in Royden or explain the strong law of
large numbers. Sorry 'bout that! So, if you explain that the value of your
venture is the powerful material in your 'secret sauce', then you will be
regarded as a kook, far outside the mainstream of venture funded
entrepreneurs, discarded, maybe even laughed at. As it is, some of the venture
people are making money now, and the rest just want to be more like the ones
who are making money. Looking for anything really new, powerful, and valuable
is just NOT in the picture.

So, once you have some results in users, customers, revenue, etc., then maybe
you can get some venture funding; just why at that point, owning 100% of the
business, you would take venture funding, a Board that can fire you, etc. is
less clear! Or venture funding is not for everyone! Or venture firms prefer to
give money to people who don't need it!

For the real power of the 'secret sauce', you have just to keep that a secret!

Once mathematicians have yachts, at the venture firms math will be to info
tech like biochem is to biotech. In the meanwhile note that a valuable
application of statistics can put you on the Forbes 400 where there are not
many people! Generally if you are making a valuable application of advanced or
new statistics, then you will not know many people who understand what you are
doing. Or, if lots of people understood it, then it wouldn't be valuable!

~~~
ced
_statistics is a rock solid part of mathematics, as solid as any part of pure
mathematics_

The impression left by reading Jaynes (The Logic of Science) was that a huge
part of conventional statistics was a hodge-podge of ad hoc methods. The One
True Way out of the mess being, of course, Bayesian statistics. What's your
take? Do most of the books you advocate follow Bayes?

Also, thanks for the great post. I'd love to know where these statistics have
taken _you_.

~~~
HilbertSpace
The materials I listed never or nearly never mention 'Bayesian statistics',
'subjective probabilities', or 'prior probabilities'.

For what has been done in statistics over the past 100 years or so, each
research library has a large section of books and journals. Here my interest
was to respond to the question about how to get started in statistics and to
outline a future for statistics, especially for exploiting Moore's law for
more in economic productivity.

For what I have done in statistics, my interests are in business, and there a
good application mostly means starting a new business. I'm doing that, but I'm
not supposed to describe the 'secret sauce' in public.

I can give an introduction that might be of some interest in computer science
and practical computing.

Given a 9th grade math teacher and one class with 20 boys and another class
with 18 girls, do the boys and girls do the same or is there a difference? Uh,
maybe as in the URL I gave, some 'feminists' will be very picky about any
claims of a difference!

Broadly what we do is make a 'hypothesis' that gives us enough in mathematical
assumptions to do some probability calculations. This hypothesis is called the
'null' hypothesis apparently because we intend to reject it, that is, find it
to be 'null'. Our intention is to conclude that the hypothesis leads to
something of very low probability, so low we reject the hypothesis. Then we
know something that appears to be false. Yes, this is less good and complete
knowledge than we could want, but maybe this result is good considering how
little in data and assumptions we used!

So, put all 20 + 18 scores in a pot, stir the pot, pull out 18 scores and
average, average the remaining 20, take the difference in the averages, do
this maybe 1 million times (thanking Moore's law), get the empirical
distribution of the differences, pick a small number, say, 1%, for the really
angry feminists, 0.1%, pick the region in the 'tails' with this fraction of
the differences, get the difference for the real data before stirring, and see
where it is. If that real difference is in the region, then have some bad news
for the angry ones: Either boys and girls are not the same or they are the
same and we have obsereved something rare, too rare to be believed. Amazing
that anyone would suspect that by the 9th grade boys and girls were not the
same!

Now, make some mathematics out of this!

When the hypothesis is true, we reject it, and make an error, 1% (or 0.1% or
whatever) of the time. So, we have rejected the null hypothesis when it is
true; this is called Type I error. Yup, the other possible error is to accept
the null hypothesis when it is false, and that is Type II error. Semi-,
pseudo-, quasi-amazing.

Hmm .... Here we have an introduction to distribution-free, that is, 'non-
parametric', hypothesis testing based on ranks or permutations.

The stirring of the pot is called 're-sampling'. Actually, when we do the
mathematics, we will likely want all the combinations of 38 things taken 18 at
a time, and that might be 33,578,000,610. So, instead of straining Moore's law
with all 33 billion, we just 'sample', in this case, 're-sample'.

So, we see that we get to select the probability of Type I error, that is, the
1%, in advance and get what we select. Progress.

Now suppose the class of 20 boys also takes English, general science, and
history. Similarly for the class of 18 girls. So, now on each student we have
4 scores instead of just 1. How how to do the test? Hmm ...!

Or, suppose we are given a server farm and a network. We select a 'system' we
want to monitor in real-time for health and wellness. Suppose that this system
can report data on each of 12 relevant variables 100 times a second.

Our null hypothesis is that the system is healthy. Then, an instance of Type I
error is a 'false alarm'. Suppose we want the false alarm rate to be, say, 1 a
month.

Then an instance of Type II error is a missed detection of a real problem.

Then for tolerating that rate of false alarms, we want the lowest rate of
missed detections we can get.

So, how do we construct our monitoring system?

We would like to use the classic Neyman-Pearson result. Here, however, we are
asked for complete information on when our system is 'sick', and likely we
don't have that.

Still we can select our rate of false alarms and do something smart with the
12 variables on problems never seen before, i.e., 'zero-day' problems.

So, we have obtained some automation of system monitoring with adjustable,
known false alarm rate and, if we look a little, with some nice guarantees on
detection rate. Progress in 'computer science'!

~~~
ced
First, thanks for the long answer.

 _So, put all 20 + 18 scores in a pot, stir the pot, pull out 18 scores and
average, average the remaining 20, take the difference in the averages, do
this maybe 1 million times (thanking Moore's law)_

I can't believe there isn't a well-defined analytical expression for that...
What you described sounds like the kind of inference I would tackle through
Bayesian hypothesis testing, while you use... Monte Carlo?

I suspect I'm missing something. Anyway, it sounds like a very interesting
problem and if I were around, I'd ask for an interview. Good luck.

~~~
HilbertSpace
"I can't believe there isn't a well-defined analytical expression for that..."

Well, yes, there is some solid mathematics behind the pot stirring!

Writing out some appropriate mathematics, will likely want all 33 billion
combinations of 38 things taken 18 at a time.

With this math, there is no use of 'prior probabilities', and the Monte Carlo
is just a fast way to replace finding all 33 billion combinations.

For more, can see (with TeX markup):

E.\ L.\ Lehmann, {\it Nonparametrics: Statistical Methods Based on Ranks,\/}
ISBN 0-8162-4994-6, Holden-Day, San Francisco, 1975.\ \

Jaroslav H\'ajek and Zbyn\v ek \v Sid\'ak, {\it Theory of Rank Tests,\/}
Academia, Prague, 1967.\ \

Sidney Siegel, {\it Nonparametric Statistics for the Behavioral Sciences,\/}
McGraw-Hill, New York, 1956.\ \

So, it's old material. There are many such hypothesis tests.

But the old material essentially always has, for the student case, only one
number on each student. The part about what to do when each student has 4
scores can take us into the journals and maybe start some more research.
Similarly for the 12 numbers from the 'system' to be monitored.

~~~
ced
_With this math, there is no use of 'prior probabilities'_

But the full hypothesis is "Given the data, are girls better than boys at this
exam?" and clearly, the prior probability is relevant. Maybe in this case one
might want to use a 50-50% prior, but in general, if the hypothesis was
instead "Given the [same] data, can we conclude that this paranormal event
really happened?" then a healthy skeptical prior would be in order.

Anyway, regardless of the "prior" issue, I've thought some more about your
original problem, and I'm not so sure about your methodology. From my
perspective, if you want to reach a "girls better than boys on this test in
this class - true or false" conclusion, then individual variance is a crucial
issue. Assuming that all girls and boys would _always_ get the exact same
result were they to take the same test over and over again, then you have a
variance of 0. And thus, one could simply check if (average of boys) <
(average of girls) and conclude accordingly. At the other extreme, if students
show huge individual variance (eg.: their score depends on whether they had
breakfast that morning), then the test results are almost meaningless... So
the outcome is crucially dependent on this variance, which your problem
description makes no mention of, and which Monte Carlo methods really do
nothing to recover. One would have to make assumptions about it.

Maybe a better example (closer to what your other "system health" problem)
would be: a boy takes an exam 20 times, a girl takes the exam 20 times, they
get such and such results (assuming they don't improve inbetween). Is the girl
better than the boy? Then one could assume a Gaussian distribution of test
results for both, estimate their average and variance, then check for overlap
between the 2 gaussians and conclude accordingly.

Maybe your MC really does boil down to something similar, but I don't see it.
And sadly, I can't quite construct an argument about what I think is
problematic with it. It just doesn't feel right.

Thanks for the references, I might check them eventually, though they seem too
specialized for my needs. Someone recently posted this book on HN, and people
had good things to say about it:

The Elements of Statistical Learning - Data Mining, Inference, and Prediction

<http://www-stat.stanford.edu/~tibs/ElemStatLearn//>

It's next on my reading list, but was absent from yours. It's available as a
PDF --- the determining factor for someone with no library access.

~~~
HilbertSpace
"But the full hypothesis is 'Given the data, are girls better than boys at
this exam?' and clearly, the prior probability is relevant."

No, prior probabilities have nothing to do with it.

We state our 'null' hypothesis that the boys and girls do equally well. This
hypothesis has nothing to do with a belief of prior probabilities or belief of
any probabilities at all. Instead, we state this hypothesis as something that
will give us some mathematical assumptions to do some calculations to reject
it and, then, conclude that it was false.

Generally in hypothesis testing we don't believe the null hypothesis as prior
probabilities; indeed, likely we don't believe it at all and are stating it to
reject it and conclude it is false.

In more detail, we assume that 20 boys and 18 girls are 38 independent samples
from some one distribution. It turns out, we don't need to say anything about
that distribution because we are being 'distribution-free'. In particular, we
get to ignore the Gaussian distribution. GOOD.

Independent? Okay: Suppose we DO give you the true distribution of the data
and the first 37 scores. Now you get to guess score 38. Do the 37 scores help
you beyond just the distribution? No. Same for any subset of the scores. Then,
we have independence.

With this null hypothesis, the average of the scores of the 20 boys and the
average of the scores of the 18 girls should be 'close'. How close? Well,
under the null hypothesis and with the values we observed, we have a way to
proceed: The distribution of the difference in the scores, with everything we
do know given and fixed, we can find. For this distribution, basically we look
at all the 33 billion or so differences obtained by taking all combinations of
38 things taken 18 at a time. Justification? If work at it mathematically,
then under the null hypothesis can show that each of those 33 billion cases
was equally probable.

Then we pick a small number, say, 1% for the size of our Type I error, that
is, the probability of rejecting the null hypothesis when it is true.

Then we find the differences in the 1% tail of the 33 billion differences.

Then we look at the difference from our actual data. That difference will be
one of the 33 billion. We see if that difference is in the 1% tail.

If the difference is in the 1% tail, then one of two things is true:

(A) The null hypothesis is true, the boys and girls are the same, that is,
independent samples from the same distribution, and with our actual data the
difference is relatively large, out in a tail, and we have observed something
that happens only 1% of the time.

(B) The null hypothesis is false, that is, in some way the boys and girls are
different. That is, we still believe the independence assumption, so what is
false is just that the mean for the boys is different from the mean for the
girls.

If the 1% is so small we don't believe (A), then we conclude (B).

Variance has nothing to do with it.

Welcome to distribution-free 'two sample' hypothesis testing 101.

~~~
ced
I've been reading Jaynes again this week, and he's just very, very convincing.
And so I'm trying to read everything you wrote through these Bayesian glasses,
but sadly, I'm not successful. Jaynes is rather critical of Fisher's
hypothesis testing, on the ground that you can't accept or reject an
hypothesis on its own; you need an alternative to compare it to, and that
alternative needs to make definite predictions. I don't see what the
alternative to your null hypothesis is (the negation of the null hypothesis
does not make definite predictions)

