Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: I want to learn statistics and data mining
105 points by oscardelben on Nov 21, 2009 | hide | past | web | favorite | 37 comments
I want to learn statistics and data mining. I can't afford to buy many books right now, so I'm asking for one or two on the topic, and also one or two available online, as ordering through amazon would take a month for shipping.

I'd like to get your opinions before spending time on random books. Thanks a lot.

P.S. My math background is not very strong but I'm willing to learn.

I'll make a non-mathematical observation about data-mining. Real world data sets are messy. Humans are just not the most reliable data entry machines. Data sets are not always gathered for the purpose you intend to use them. You have to validate/scrub/realign your data set before wasting time on an analysis that could be rendered meaningless by these issues.

It's the old "Garbage In, Garbage Out" principle, but it's easy to forget until the real world hammers it into you.

If your math background is weak enough that you don't feel totally at home using calculus and linear algebra, I'd recommend "Principles of Data Mining", which they use for the intro to ML class at Princeton. The ratio of English to equations is pretty high, which might help you to get into the swing of things.

If you think that you can learn enough math to follow the material, you should watch the Stanford lectures given by Andrew Ng. They're available freely online through iTunes U. Andrew is also one of the most respected experts in the field. As he points out, you will eventually need to understand the math behind the methods. Otherwise, a day will come when you start wasting time coding algorithms that plainly don't apply to the problem at hand.

This is probably not what you're looking for, but while trying to figure out if you could still download a free copy of Drake's out-of-print Fundamentals of Applied Probability Theory...

[HINT: The answer is yes: http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Compute...]

...I found this potentially interesting page full of references:


I post it here so that I can find it again later, but I fear that this is not the answer to your question. These appear to be more probability theory than nuts-and-bolts statistical analysis [1], and they may require more math background than I have (and I managed to handwave my way to a solid state physics Ph.D.)

Try browsing around the MIT OpenCourseWare site. One of the great things about online courses, even if you don't take the courses, is that they often publish syllabuses that contain a lot of good textbook references.

[EDIT: Ooh, look, if you follow links from the page I mentioned earlier you get here!

http://www.cscs.umich.edu/~crshalizi/notebooks/teaching-stat... ]


[1] Not that I can reliably tell the difference. I know just enough about probability and statistics to know that I should probably study them some more. ;)

If you are starting from scratch without a very strong math background, I'd recommend:

1.) Head First Statistics -- Pretty good, but beware the section on Bayes Theorem which is a bit off. This is a quick and casual intro, but worthwhile. I used it to refresh me on my college stats course (which was a long time ago), and I like it. There's also Head First Data Analysis, which I haven't read but could be a reasonable companion. HF Data Analysis uses Excel and R.

2.) Using R for Introductory Statistics (Verzani) -- Good explanations and exercises, and you will also learn R. This second point is actually pretty important, because it's a very valuable tool. Whereas the Head First Stats book walks though pretty simple problems that you work out in pencil, the Verzani book has many real-world data sets to explore that would be impractical to do by hand. That said, I think it's valuable to work things out in pencil with the first book before you move on to this one.

After these books, Elements of Statistical Learning seems to be the current favorite.

Unfortunately my favorites are a bit on the mathy side, so you may want wait for other commenters with better advice. But, there are two books I really feel should not be missed. For statistics: "Statistics, Third Edition" David Freedman, Robert Pisani, Roger Purves. For machine learning: "The Elements of Statistical Learning" Trevor Hastie, Robert Tibshirani and Jerome Friedman.

fyi, The Elements of Statistical Learning can be found here as a free pdf


there's also an R package with the book's material: http://cran.r-project.org/web/packages/ElemStatLearn/index.h...

The Berkeley webcast* Introduction to Statistics, uses "Statistics, Fourth Edition" by Freeman. I don't know what the differences are, but I think it's mostly for the exercises involved.

* http://webcast.berkeley.edu/course_details_new.php?seriesid=...

I messed up. Statistics 4th edition is the current edition (the one to get), in this case there is no content related reason to deliberately seek out the older edition.

Make that 4th edition (sorry about the mistake).

Video Lectures has a ton of DM and machine learning lectures up for free.


The Strang book is pretty much indispensable, I think.


The course lectures are online, too.


I've been taking courses related to Machine Learning, and never having taken a Linear Algebra course, I find myself going to this text and watching these lectures a lot to bring myself up to speed.

Statistics is a very interesting subject, and it is a distinct subject from mathematics proper. Here (in what is becoming a FAQ post for HN) are two favorite recommendations for free Web-based resources on what statistics is as a discipline, both of which recommend good textbooks for follow-up study:

"Advice to Mathematics Teachers on Evaluating Introductory Statistics Textbooks" by Robert W. Hayden


"The Introductory Statistics Course: A Ptolemaic Curriculum?" by George W. Cobb


Both are excellent introductions to what statistics is as a discipline and how it is related to, but distinct from, mathematics.

A very good list of statistics textbooks appears here:


For DM, try http://cs345a.stanford.edu, just browse through the slides they are pretty good. Also, if you can find - go for some video lectures about the topic. [try MIT, Berkeley ocw sites].

It's not directly related, but I just ran into an interesting article on data analysis. It includes pointers to some good texts on topic, but also has some tips which aren't covered in the standard texts.

"Three Things Statistics Textbooks Don't Tell You" by Seth Roberts: http://sethroberts.net/about/2005_Three_Things_Statistics_Te...

Programming Collective Intelligence also has high marks from a lot of programmers whom I respect: http://www.amazon.com/Programming-Collective-Intelligence-Bu...

the 2 books on amazon's Bought This Item Also Bought" blurb are more rigorous, and quite useful books, also covering topics like Lucene/SOLR with full java code listsings:

Algorithms of the Intelligent Web by H. Marmanis

Collective Intelligence in Action by Satnam Alag

I agree, this was the eye opener for me. When it comes to data mining, I'll take practise over theory any day.

Diving directly into actual code samples, and a large plus for using Python, this book is one of the few I actually keep on my desk.

Stanford has a Machine Learning course available in iTunes U.

Direct link to course in iTunes: http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.s...

Stanford iTunes U site: http://itunes.stanford.edu/

Please search HN for this topic. I've probably answered it 10 times here at least, and other have answered it even more.

Here is another free book, which teaches statistics through R:

Introduction to Statistical Thought http://www.math.umass.edu/~lavine/Book/book.html

I can also recommend the Statistical Aspects of Data Mining lectures by David Mease (on Google Video & stats202.com).

"Machine Learning: An Algorithmic Perspective" by Stephen Marsland is very good if you want to see the ideas in code and if you're not mathematically ready for hastie or bishop.


I recommend starting with weka and this great book: http://www.amazon.com/Data-Mining-Practical-Techniques-Manag...

This site has a ton of references to books and papers, with some of them seeming fairly introductory:


If your math background is not strong enough I'd recommend you to spend a few weeks on algebra and calculus. It'll pay you back for sure.

To find research papers use scholar.google.com

Thanks for these links everyone

Thanks also from me. It'll take time to digest all the information but a thread like this is very valuable to distill from hundreds of sources.

Early in my career, I asked that question. Eventually I got some answers.

Broadly usually the 'big bucks' are in things that are new at least in some important sense. To find such things, one recommended approach is to stand "on the shoulders of giants". Then you may still do some original work, but you have one heck of a base. My view is that Moore's law, the Internet, etc. are not nearly fully exploited, that enormous value remains, and the best way to get some of this value is to find a big unsolved problem where a good solution will be very valuable, stand on the shoulders of some giants, do some original work to get some powerful 'secret sauce' for solving the problem (which appears not to be easy to solve otherwise), start a business, please customers, get revenue, grow the business and sell it.

Here, however, just what problem, giants, original work, secret sauce are NOT easy to see. Broadly we are trying to see, no, MAKE, the future, and usually that is not easy. While we can see what applications have been made of statistics in the past, what statistics will be applied where in the future is not easy to see.

Yes, broadly statistics is highly promising. Or, in 'information technology' we want valuable, new information, are awash in cycles, bytes, bandwidth, infrastructure software, and data, and want to use these to create the information. Here, on appropriate problems, statistics can commonly, easily, totally 'blow away' anything else. The marriage between (A) 'information technology' in business and (B) statistics has yet to be consummated, even to reach to 'going steady' stage, assuming the traditional order of events.

For 100 years or so, people have come to statistics from various areas of work. Usually they had some data and some questions. Some of the areas have been educational and psychological testing, experiments and testing in agriculture and medicine, industrial quality control, model building in economics, experimental work in physics and chemistry, investing, attempts to create mathematically based sciences in the social sciences, especially economics, psychology, sociology, and political science. In sociology, old examples would be James Coleman, Pete Rossi (professors of my wife when she got her Ph.D.), Leo Goodman.

Likely the best fast, practical path into statistics is via books, courses, etc. intended for students in the social sciences. These students commonly do not have good backgrounds in mathematics. For the mathematical prerequisites, generally can get by, as a start, with just high school first year algebra. With this path can learn about probability distributions, the central limit theorem, the law of large numbers, statistical estimation and confidence intervals, hypothesis testing, cross tabulation, analysis of variance, regression analysis, principle components analysis, and more.

For more, statistics is a rock solid part of mathematics, as solid as any part of pure mathematics, e.g., topology, geometry, analysis, algebra, etc. That is, statistics is based solidly on theorems and proofs, sometimes relatively deep ones.

Statistics as theorems and proofs is called 'mathematical statistics'. Long standard has been (with TeX markup):

Alexander M.\ Mood, Franklin A.\ Graybill, and Duane C.\ Boas, {\it Introduction to the Theory of Statistics, Third Edition,\/} McGraw-Hill, New York.\ \

The main prerequisite for this book is just a not very good course in calculus, and the book actually makes not much use of calculus. Mostly all a student will need from calculus is that it can find the area under a curve. Since the book has long been standard, can't really ignore it, but it's ugly. And often, with just calculus, the book doesn't really give solid proofs of the results. E.g., their treatment of sufficient statistics has some nice intuition, but their proof is junk. The subject cries out for a good book, but I'm not trying to write one or waiting for someone else to.

Can get some of the flavor of mathematical statistics done with high quality, as mathematics, in, say (with TeX markup):

Jean-Ren\'e Barra, {\it Mathematical Basis of Statistics,\/} ISBN 0-12-079240-0, Academic Press, New York.\ \

Robert J.\ Serfling, {\it Approximation Theorems of Mathematical Statistics,\/} ISBN 0-471-02403-1, John Wiley and Sons, New York.\ \

P.\ Billingsley, {\it Convergence of Probability Measures, 2\raise0.5ex\hbox{ed},\/} ISBN: 0-471-19745-9, John Wiley, New York.\ \

R.\ S.\ Lipster and A.\ N.\ Shiryayev, {\it Statistics of Random Processes I, II,} ISBN 0-387-90226-0, Springer-Verlag, New York.\ \

However, pursued mathematically, statistics has some relatively advanced prerequisites some of which curiously are not popular in US university mathematics departments.

For the prerequisites,

High School. Should have had high school first and second year algebra (reasonable facility with algebraic manipulations, the binomial theorem, complex numbers, both of which will see again in important ways), plane geometry (where nearly all the work was proofs -- first place to learn about proofs), trigonometry (usually assumed in calculus and important in, say, analysis of organ tone harmonics and, thus, the most important example of an infinite dimensional Hilbert space), analytic geometry (especially the conic sections, especially ellipses which definitely will see again), and, if can, solid geometry (for more intuition in three dimensions).

College. Need a standard calculus course, not necessarily a very comprehensive or difficult one because will do the subject all over again, maybe two or three times, and more later, WITH the proofs!

Then need linear algebra, that is, how to work with data of several dimensions, which is just crucial. The big result is the polar decomposition, and there get to think about ellipses and get to use complex numbers. Also the course is an introduction to functional analysis and Hilbert space. Use any popular book to get started but in the end cover the classic, Halmos, "Finite Dimensional Vector Spaces'. Halmos wrote this when he was an assistant to von Neumann and intended it to be a finite dimensional introduction to Hilbert space (which once von Neumann had to explain to Hilbert) which it is. It also has some multi-linear algebra, of interest to exterior algebra now popular in relativity, but likely for nearly any business applications of statistics for the next several decades can skip that chapter.

Then need some advanced calculus. That is a poorly organized, huge, catch-all subject beyond any one course. The usual start is 'baby' Rudin, 'Principles of Mathematical Analysis'. So, that's calculus with the proofs. Warning: The book is severe, succinct, with zero pictures. Have to draw your own pictures in your head. The book is packed solidly with powerful material just awash in important applications from statistics, economics, and engineering to physics, but there is hardly a hint of the applications in the book. I enjoyed the book, but few people will enjoy it or even get through it. Hint: Get a really good teacher! Then for more, popular is Spivak, 'Calculus on Manifolds', mostly because it is short. Actually, it's too short. I prefer Fleming, 'Functions of Several Variables' until get to the exterior algebra chapter at which time, if care, can now get the thin

Henri Cartan, {\it Differential Forms,\/} ISBN 0-486-45010-4, Dover, Mineola, NY, 2006.\ \

in English.

Between Halmos, baby Rudin, and Spivak, you will have covered Harvard's Math 55 with a colorful description at


Harvard tries to cover these three for freshman, but in most math departments the material will take you through all or nearly all of a focused undergraduate pure math major.

If somewhere take a course in abstract algebra, e.g., with a little group theory, then that might help!

Graduate School. Might learn a little more about topology, say, from Simmons, 'Introduction to Topology and Modern Analysis'. So, get good with metric spaces and get started on duality.

The next big step is a course in measure theory and functional analysis. The Simmons work will help. Baby Rudin will be crucial; Halmos is recommended. So, with measure theory, do calculus over again and in a very different and much more powerful way and a way just crucial, even central, for mathematical approaches to statistics. The functional analysis will concentrate on representation theorems, the Radon-Nikodym theorem, and Hilbert and Banach spaces. Long popular, from Stanford and sometimes aimed at statistics students, is Royden, 'Real Analysis'. It's gorgeous. Should also read the real half of Rudin, 'Real and Complex Analysis'; it's a few steps up in difficulty from baby Rudin. Again, hint: Get a good course from a good teacher who can get you over the material without getting stuck. Then go back and study the material 2-3 more times, apply it, do some original research in it, and finally begin to understand it.

We're talking high, top, center crown jewels of civilization here; the stuff is of just awesome power; my view is that it is, for the rest of this century, one of the main pillars of increases in economic productivity via the exploitation of Moore's law; on 'what to program', computer science is stuck and this material is the most promising way forward; as a famous restaurant owner once said about some Morey St. Denis, "you won't find better".

Now are ready for probability. I recommend:

Leo Breiman, {\it Probability,\/} ISBN 0-89871-296-3, SIAM, Philadelphia.\ \

M.\ Lo\`eve, {\it Probability Theory, I and II, 4th Edition,\/} Springer-Verlag, New York.\ \

Kai Lai Chung, {\it A Course in Probability Theory, Second Edition,\/} ISBN 0-12-174650-X, Academic Press, New York.\ \

Jacques Neveu, {\it Mathematical Foundations of the Calculus of Probability,\/} Holden-Day, San Francisco.\ \

Neveu is succinct, gorgeous, but not easy. This material is NOT popular in US departments of mathematics. At Princeton, see Cinlar.

Then should make some progress with stochastic processes: The big book is Gihman and and Skorohod, right, in three volumes, but mostly people settle for shorter treatments. Whatever, should learn about Poisson processes, Markov processes (discrete time, finite state space is enough to get started), Brownian motion, and martingales. Might also learn about second order stationary processes. A good course in stochastic processes is NOT easy to find, especially in mathematics departments.

Now are ready to attack statistics mathematically! I don't know of a good, single 'mathematical statistics' book at this level. Instead, there are many books -- I gave some above -- and then the journals. Thankfully, the field is relatively close to applications; so can take a practical problem and concentrate on what is relevant to it. One of my papers was some new work, at this level, in mathematical statistics for a problem in practical computing and computer science. The computer science community struggled terribly with the mathematics. So, it was some progress in computer science that community will have to struggle to understand.

One approach to work in computing is just to try things, that is, just to throw things against the wall and see if they appear to stick. Or, maybe the truth of the situation really is a simple statistical model. Likely that model will fit the data well. So, try many simple statistical models. We use these models mostly ignoring the mathematical assumptions; mostly we are proceeding 'heuristically', that is, with guessing. If any of the models fit well, then they can be considered candidates for the truth. So, are throwing things against the wall to see if they fit. This approach also called 'data mining'.


(1) Will be quite limited in what statistical models can use. That is, will be drawing from a cookbook instead of being a real chef who can create good, new dishes appropriate for the available ingredients and customers!

(2) Don't have much, e.g., have not proceeded mathematically where from the deductive logic of assumptions and proofs actually know in advance some good things about the results. Something like breaking into a pharmacy, mixing up a lot of pills, taking them, and seeing if feel better! Uh, I'll pass and let you do that without me!

(3) May have gone through a lot of computer time in an 'exponential, combinatorial explosion' of efforts throwing against the wall.

(4) Have ignored a LOT in statistics that can add to what know about the results.

(5) Will be tempted to conclude have have found 'causality' but will likely not have.

(6) Will be tempted to conclude that have a model that predicts, but that is on shaky ground and risky and needs more work.

Applied to important problems, this approach can be dangerous.

There are not many healthy statistics departments. Much of the career interest is in biostatistics, especially related to FDA rules.

It appears that among the top statistics departments are Berkeley and UNC. Since Breiman and Brillinger are at Berkeley and since Stanford, long good in statistics, is not far away, if I were looking for a Ph.D. in statistics then I'd pick Berkeley.

There is a general problem getting a 'job' in a technical field and likely also with statistics. The assumption in US business is still as in factories 150 years ago: The supervisor knows more than the subordinate; the subordinate is supposed just to add common labor to what the supervisor says. In particular job descriptions are written by the supervisors, not the subordinates!

Well, there are nearly no supervisors in US business who have even as much as a weak little hollow hint of a tiny clue about the material described here. So, won't need that material to qualify for the job descriptions. Moreover, if actually know such material and let that fact leak out, then will likely not make it past the first level HR English major phone screen person who will tremble and conclude that you are not like the employees they have! If you do get hired and someone in your management chain discovers that you have used some mathematics they don't understand, you might be on the way out the door, especially is your work was valuable for the company!

Of course, the solution is to find a valuable application and start your own business. While maybe biomedical venture capital can understand crucial, core technical content, in information technology venture capital, likely you will be trying to explain this stuff to, say, history majors who worked in fund raising, marketing, general management, or financial analysis or have a background in just relatively elementary parts of computing. Just will NOT find more than six people, maybe not more than zero people, in US venture capital who can work the exercises in Royden or explain the strong law of large numbers. Sorry 'bout that! So, if you explain that the value of your venture is the powerful material in your 'secret sauce', then you will be regarded as a kook, far outside the mainstream of venture funded entrepreneurs, discarded, maybe even laughed at. As it is, some of the venture people are making money now, and the rest just want to be more like the ones who are making money. Looking for anything really new, powerful, and valuable is just NOT in the picture.

So, once you have some results in users, customers, revenue, etc., then maybe you can get some venture funding; just why at that point, owning 100% of the business, you would take venture funding, a Board that can fire you, etc. is less clear! Or venture funding is not for everyone! Or venture firms prefer to give money to people who don't need it!

For the real power of the 'secret sauce', you have just to keep that a secret!

Once mathematicians have yachts, at the venture firms math will be to info tech like biochem is to biotech. In the meanwhile note that a valuable application of statistics can put you on the Forbes 400 where there are not many people! Generally if you are making a valuable application of advanced or new statistics, then you will not know many people who understand what you are doing. Or, if lots of people understood it, then it wouldn't be valuable!

statistics is a rock solid part of mathematics, as solid as any part of pure mathematics

The impression left by reading Jaynes (The Logic of Science) was that a huge part of conventional statistics was a hodge-podge of ad hoc methods. The One True Way out of the mess being, of course, Bayesian statistics. What's your take? Do most of the books you advocate follow Bayes?

Also, thanks for the great post. I'd love to know where these statistics have taken you.

The materials I listed never or nearly never mention 'Bayesian statistics', 'subjective probabilities', or 'prior probabilities'.

For what has been done in statistics over the past 100 years or so, each research library has a large section of books and journals. Here my interest was to respond to the question about how to get started in statistics and to outline a future for statistics, especially for exploiting Moore's law for more in economic productivity.

For what I have done in statistics, my interests are in business, and there a good application mostly means starting a new business. I'm doing that, but I'm not supposed to describe the 'secret sauce' in public.

I can give an introduction that might be of some interest in computer science and practical computing.

Given a 9th grade math teacher and one class with 20 boys and another class with 18 girls, do the boys and girls do the same or is there a difference? Uh, maybe as in the URL I gave, some 'feminists' will be very picky about any claims of a difference!

Broadly what we do is make a 'hypothesis' that gives us enough in mathematical assumptions to do some probability calculations. This hypothesis is called the 'null' hypothesis apparently because we intend to reject it, that is, find it to be 'null'. Our intention is to conclude that the hypothesis leads to something of very low probability, so low we reject the hypothesis. Then we know something that appears to be false. Yes, this is less good and complete knowledge than we could want, but maybe this result is good considering how little in data and assumptions we used!

So, put all 20 + 18 scores in a pot, stir the pot, pull out 18 scores and average, average the remaining 20, take the difference in the averages, do this maybe 1 million times (thanking Moore's law), get the empirical distribution of the differences, pick a small number, say, 1%, for the really angry feminists, 0.1%, pick the region in the 'tails' with this fraction of the differences, get the difference for the real data before stirring, and see where it is. If that real difference is in the region, then have some bad news for the angry ones: Either boys and girls are not the same or they are the same and we have obsereved something rare, too rare to be believed. Amazing that anyone would suspect that by the 9th grade boys and girls were not the same!

Now, make some mathematics out of this!

When the hypothesis is true, we reject it, and make an error, 1% (or 0.1% or whatever) of the time. So, we have rejected the null hypothesis when it is true; this is called Type I error. Yup, the other possible error is to accept the null hypothesis when it is false, and that is Type II error. Semi-, pseudo-, quasi-amazing.

Hmm .... Here we have an introduction to distribution-free, that is, 'non-parametric', hypothesis testing based on ranks or permutations.

The stirring of the pot is called 're-sampling'. Actually, when we do the mathematics, we will likely want all the combinations of 38 things taken 18 at a time, and that might be 33,578,000,610. So, instead of straining Moore's law with all 33 billion, we just 'sample', in this case, 're-sample'.

So, we see that we get to select the probability of Type I error, that is, the 1%, in advance and get what we select. Progress.

Now suppose the class of 20 boys also takes English, general science, and history. Similarly for the class of 18 girls. So, now on each student we have 4 scores instead of just 1. How how to do the test? Hmm ...!

Or, suppose we are given a server farm and a network. We select a 'system' we want to monitor in real-time for health and wellness. Suppose that this system can report data on each of 12 relevant variables 100 times a second.

Our null hypothesis is that the system is healthy. Then, an instance of Type I error is a 'false alarm'. Suppose we want the false alarm rate to be, say, 1 a month.

Then an instance of Type II error is a missed detection of a real problem.

Then for tolerating that rate of false alarms, we want the lowest rate of missed detections we can get.

So, how do we construct our monitoring system?

We would like to use the classic Neyman-Pearson result. Here, however, we are asked for complete information on when our system is 'sick', and likely we don't have that.

Still we can select our rate of false alarms and do something smart with the 12 variables on problems never seen before, i.e., 'zero-day' problems.

So, we have obtained some automation of system monitoring with adjustable, known false alarm rate and, if we look a little, with some nice guarantees on detection rate. Progress in 'computer science'!

First, thanks for the long answer.

So, put all 20 + 18 scores in a pot, stir the pot, pull out 18 scores and average, average the remaining 20, take the difference in the averages, do this maybe 1 million times (thanking Moore's law)

I can't believe there isn't a well-defined analytical expression for that... What you described sounds like the kind of inference I would tackle through Bayesian hypothesis testing, while you use... Monte Carlo?

I suspect I'm missing something. Anyway, it sounds like a very interesting problem and if I were around, I'd ask for an interview. Good luck.

"I can't believe there isn't a well-defined analytical expression for that..."

Well, yes, there is some solid mathematics behind the pot stirring!

Writing out some appropriate mathematics, will likely want all 33 billion combinations of 38 things taken 18 at a time.

With this math, there is no use of 'prior probabilities', and the Monte Carlo is just a fast way to replace finding all 33 billion combinations.

For more, can see (with TeX markup):

E.\ L.\ Lehmann, {\it Nonparametrics: Statistical Methods Based on Ranks,\/} ISBN 0-8162-4994-6, Holden-Day, San Francisco, 1975.\ \

Jaroslav H\'ajek and Zbyn\v ek \v Sid\'ak, {\it Theory of Rank Tests,\/} Academia, Prague, 1967.\ \

Sidney Siegel, {\it Nonparametric Statistics for the Behavioral Sciences,\/} McGraw-Hill, New York, 1956.\ \

So, it's old material. There are many such hypothesis tests.

But the old material essentially always has, for the student case, only one number on each student. The part about what to do when each student has 4 scores can take us into the journals and maybe start some more research. Similarly for the 12 numbers from the 'system' to be monitored.

With this math, there is no use of 'prior probabilities'

But the full hypothesis is "Given the data, are girls better than boys at this exam?" and clearly, the prior probability is relevant. Maybe in this case one might want to use a 50-50% prior, but in general, if the hypothesis was instead "Given the [same] data, can we conclude that this paranormal event really happened?" then a healthy skeptical prior would be in order.

Anyway, regardless of the "prior" issue, I've thought some more about your original problem, and I'm not so sure about your methodology. From my perspective, if you want to reach a "girls better than boys on this test in this class - true or false" conclusion, then individual variance is a crucial issue. Assuming that all girls and boys would always get the exact same result were they to take the same test over and over again, then you have a variance of 0. And thus, one could simply check if (average of boys) < (average of girls) and conclude accordingly. At the other extreme, if students show huge individual variance (eg.: their score depends on whether they had breakfast that morning), then the test results are almost meaningless... So the outcome is crucially dependent on this variance, which your problem description makes no mention of, and which Monte Carlo methods really do nothing to recover. One would have to make assumptions about it.

Maybe a better example (closer to what your other "system health" problem) would be: a boy takes an exam 20 times, a girl takes the exam 20 times, they get such and such results (assuming they don't improve inbetween). Is the girl better than the boy? Then one could assume a Gaussian distribution of test results for both, estimate their average and variance, then check for overlap between the 2 gaussians and conclude accordingly.

Maybe your MC really does boil down to something similar, but I don't see it. And sadly, I can't quite construct an argument about what I think is problematic with it. It just doesn't feel right.

Thanks for the references, I might check them eventually, though they seem too specialized for my needs. Someone recently posted this book on HN, and people had good things to say about it:

The Elements of Statistical Learning - Data Mining, Inference, and Prediction


It's next on my reading list, but was absent from yours. It's available as a PDF --- the determining factor for someone with no library access.

"But the full hypothesis is 'Given the data, are girls better than boys at this exam?' and clearly, the prior probability is relevant."

No, prior probabilities have nothing to do with it.

We state our 'null' hypothesis that the boys and girls do equally well. This hypothesis has nothing to do with a belief of prior probabilities or belief of any probabilities at all. Instead, we state this hypothesis as something that will give us some mathematical assumptions to do some calculations to reject it and, then, conclude that it was false.

Generally in hypothesis testing we don't believe the null hypothesis as prior probabilities; indeed, likely we don't believe it at all and are stating it to reject it and conclude it is false.

In more detail, we assume that 20 boys and 18 girls are 38 independent samples from some one distribution. It turns out, we don't need to say anything about that distribution because we are being 'distribution-free'. In particular, we get to ignore the Gaussian distribution. GOOD.

Independent? Okay: Suppose we DO give you the true distribution of the data and the first 37 scores. Now you get to guess score 38. Do the 37 scores help you beyond just the distribution? No. Same for any subset of the scores. Then, we have independence.

With this null hypothesis, the average of the scores of the 20 boys and the average of the scores of the 18 girls should be 'close'. How close? Well, under the null hypothesis and with the values we observed, we have a way to proceed: The distribution of the difference in the scores, with everything we do know given and fixed, we can find. For this distribution, basically we look at all the 33 billion or so differences obtained by taking all combinations of 38 things taken 18 at a time. Justification? If work at it mathematically, then under the null hypothesis can show that each of those 33 billion cases was equally probable.

Then we pick a small number, say, 1% for the size of our Type I error, that is, the probability of rejecting the null hypothesis when it is true.

Then we find the differences in the 1% tail of the 33 billion differences.

Then we look at the difference from our actual data. That difference will be one of the 33 billion. We see if that difference is in the 1% tail.

If the difference is in the 1% tail, then one of two things is true:

(A) The null hypothesis is true, the boys and girls are the same, that is, independent samples from the same distribution, and with our actual data the difference is relatively large, out in a tail, and we have observed something that happens only 1% of the time.

(B) The null hypothesis is false, that is, in some way the boys and girls are different. That is, we still believe the independence assumption, so what is false is just that the mean for the boys is different from the mean for the girls.

If the 1% is so small we don't believe (A), then we conclude (B).

Variance has nothing to do with it.

Welcome to distribution-free 'two sample' hypothesis testing 101.

I've been reading Jaynes again this week, and he's just very, very convincing. And so I'm trying to read everything you wrote through these Bayesian glasses, but sadly, I'm not successful. Jaynes is rather critical of Fisher's hypothesis testing, on the ground that you can't accept or reject an hypothesis on its own; you need an alternative to compare it to, and that alternative needs to make definite predictions. I don't see what the alternative to your null hypothesis is (the negation of the null hypothesis does not make definite predictions)

I love that you went out of your way to create a new account named HilbertSpace to provide such an insightful response. Well done.

lots of good info, but you somehow strike me as the kind of guy who will kill a fly with a nuclear bomb...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact