Hacker News new | comments | show | ask | jobs | submit login
All of Statistics, by Larry Wassserman (2013) [pdf] (cmu.edu)
171 points by rfreytag on May 15, 2016 | hide | past | web | favorite | 52 comments

Happy to see a book like this trending on hn, especially with a sentence like: "Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid." in it's preface. I definitely agree, since I wasted a lot of time doing fruitless surgeries before I went and learned about band aids in depth.

From my look at Part 1, it has some great coverage of the basics, all of which are important. Some of the fundamentals that I see left out are rightly left out since they require experience in real analysis to appreciate, and maybe aren't very actionable. There's few proofs, but, since the goal is a quick understanding, I can also appreciate this.

It looks to me like a great intro of statistics for CS people, as the author says.

> "Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid."

Having studied both statistics and neural networks, I'm not sure if I completely agree with that quote. There are lots of neural network applications that have little to do with statistics (image recognition with convolutional neural networks for example).

I am pretty sure that the author means neural networks for statistical applications though.

Image classification has everything to do with statistics: you're guessing the probability distribution over the classes conditional on the input image vector; the model is trained through a process of statistical inference (using gradient descent).

You have stated something that someone with a high school-level (e.g. superficial) understanding of probability/statistics would understand. Most research in neural networks requires only a very superficial level of statistics.

There are certainly many areas of neural networks where statistics is important (more theoretical areas), but those don't form the core of the research field.

Also, calling (stochastic) gradient descent a form of statistical inference, while technically correct, is a ridiculous stretching of the term. No researcher considers SGD to be a statistical inference algorithm.

I would agree with you and I also did research on neural networks when I was in college. I'm not sure if those downvoting you have actually done/seen neural network research. Deep learning is pretty engineering driven (as opposed to theory driven), right now.

I have a pretty weak understanding of statistics and from my perspective it was very common for grad students to also have a weak understanding (of course, those working in statistical learning theory had a strong understanding). This is a pretty ordinary occurrence - it shouldn't be surprising - this is sort of like pointing out that many theoretical statisticians have poor coding skills.

>There are lots of neural network applications that have little to do with statistics (image recognition with convolutional neural networks for example).

You're kidding, right? The most fundamental reasons that deep convnets work at all are statistical in nature.

Well, yes and no. It's like saying you need to understand Bernoulli's priciple before designing the SR-71. True, but 7 levels short of where you're working.

> "Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid."

Or like programming in C without also knowing assembly and compiler theory? Or flying a plane without having a degree in aerodynamics?

I think we can extract a lot of use from high level frameworks that abstract away much of the gritty statistics and math. For most applications all we need is to have well behaved, well tested libraries and some basic intuition about how they work.

Fortunately, in machine learning almost everything is a function from inputs X to outputs y, and we know how functions work from programming. It's easy to integrate in apps. The devil is in hyperparameter tuning, but we can get away with good initializations and some measure of web research.

In time people will just use precomputed neural nets for standard tasks like Syntaxnet (text parsing), Inception (image classification) ore use web APIs to hosted services (less secure for sensitive data). We make those better, maybe fine tune them to our needs and get away with it 100x faster.

There is also work in automated hyperparameter search. Machine learning could become a black box when they get good enough.

Straw man going on here. The problem is, whilst you can get numbers out without understanding the statistics underneath, you can easily misinterpret that output if you don't understand it.

Look at how p values are used in science journals for an example of poor stats knowledge affecting real life outputs.

There are some things in life that do require you to do the requisite reading. Things based on stats fall into that camp.

> since I wasted a lot of time doing fruitless surgeries before I went and learned about band aids in depth

I'm going to just assume you're being literal here. It's really brightened up my day to imagine the moment such a person discovered you could close up wounds after surgery.

Suggestions for a more rigorous treatment?

My favorites are Casella and Berger: https://books.google.com/books/about/Statistical_Inference.h...

and Hogg, McKean and Craig: https://www.pearsonhighered.com/program/Hogg-Introduction-to...

They are both good books. The first one is more rigorous, but the second one covers more breadth and I think has better exercises.

Anything with measure theory, though you will regret it :)

On the probability side, I thought Ash's "Probability & Measure Theory" was good for self-study, although some experience with real analysis is definitely a prerequisite. It can be pretty dense at times, but well worth going through.

stochastic calculus really hurt me. i did well for it, I TA'd for it, i look at it from time to time in my work, but I still find it very difficult. i don't think i'm much good at mathematics, i just try very hard and hope for the best.

Here's a free one I found in the event anyone else is curious:


Sadly, no link to free eBook, which is not surprising because it seems that the book is still in print, having been released as recently as 2004, and updated in 2005 and 2013.

This post links to the website supporting the book and provides links to errata, code and data. The links on the page to Springer and Amazon are broken: Here are valid links:



Here is the Google Books link:


Not sure about HN's policy on posting links to pirated material, but as a Freedom of Information supporter, I will note that the book can be found at http://gen.lib.rus.ec.


>In comments, it's ok to ask how to read an article and to help other users do so.

Depends on just how broad article is.

Can someone say in few sentences what Statistics is all about? I can't shake off the feeling that it is just glorified curve fitting.

Edit: Please stop the down votes, just an electrical engineer here, with one basic course in Probability and Stat. :)

It is just glorified curve fitting. Glorified curve fitting is a very rich field.

Another way of thinking about it (described in Wasserman's book) is that statistics is the inverse problem of probability.

Probability theory asks: given a process, what does its data look like? Statistics asks: given data, what process might have generated it?

> Probability theory asks: given a process, what does its data look like? Statistics asks: given data, what process might have generated it?

Excellent summary, thank you.

Pretty much any kind of mathematical modelling that involves uncertainty, really.

Making inferences and predictions from data, in the presence of uncertainty.

Analysis of the properties of procedures for doing the above.

If you want examples that avoid the feel of just "curve fitting" (assume you mean something like "inferring parameters given noisy observations of them") -- maybe look at models involving latent variables. Bayesian statistics has quite a few interesting examples.

Thanks! I had a course at uni named Probability and Statistics, but since it was first (and only) course in EE curriculum it was oriented toward probability, and Statistics was an afterthought (I only remember simple linear and multilinear regression). That is probably the main reason I only see curve fitting everywhere :)

Neural nets are glorified curve fitting. The are curves parameterized by the weight matrix. The weight matrix is relatively massive (e.g. 1M DOF), which makes the family of curves it generate essentially almost fluid like a piece of yarn. Now given a small amount of data, and a programmable piece of string, how well can you fit the data? Turns out the string is higher dimensional than the data, so you can fit any curve you like. The trick, it avoiding overfitting. Overfitting is the yarn warping its shape to fit noise that has no intrinsic meaning. That's what cross validation prevents ... overfitting. Stop moving the yarn to match the training data better when it fails to improve an independent performance test. Thats what machine learning is... figuring out algorithms that don't overfit and have some ability to generalize onto data not seen before. It's still basically glorified curve fitting.

Statistics is about inferring probabilities from data such that we can make predictions (where data are discernible differences of some quantities). Inference means finding out what the world is about using some sort of representation (a model). The entire project is basically concerned with (mostly lossy) compression: How to represent the complexity of the world such that we can reason about it with limited resources, i.e. estimate things we can't compute using things we can compute. If our statistic summarizes enough to allow us to make useful predictions, it is called a sufficient statistic.

Probability is at the heart of the project: frequencies that summarize reoccurring data. Instead of storing a reoccurring pattern multiple times, we just store it once and record how often it has occurred.

Statistics is applied Probability Theory. Statistics tries to find and characterize the probability distribution of 'Random Variables' through series of observations.

Basically, you count things and compare that to how many things you think you should have counted given your assumptions.

Statistics has, of course, grown from its beginnings as a means to summarize social/population conditions (the median number of serfs per farm, average bushels per acre, etc), yet: "estimating an accurate metric, (for example a 'central tendency'), from incomplete data" remains a central theme.

Inferring meaning from data.

As an EE, how would you explain concepts like a PN junction or field effect transistor without using statistical mechanics? (Ie, expected behaviour for ensembles of huge numbers of particles).

The models EE use are simplified, drift and diffusion current and electron and holes with their different mobilities and energy levels. Math apparatus used here, and strictly related to statistics, is limited to averaging, I would dare to say.

Great. Those formulae are derived from statistical representation of huge number of particles, eg electrons modelled as a nearly-free gas of fermions obeying Fermi-Dirac statistics. And so too would anything making use of PN junctions, band gaps, especially when considering temperature dependence.

I think you can agree now that your original observation of statistics as "glorified curve fitting" as a bit naive.

Quantum Monte Carlo simulation is pretty standard for modern semiconductor devices. Most of the effects of interest in highly scaled transistors, for example, cannot be properly accounted for otherwise.

On the basic materials level, density functional theory is the current gold standard, and it's extremely statistics heavy.

At the systems and architecture levels, you may be right, though.

estimating uncertainty. The curve you fit represents what you have in your data. The question is what is out there, in your target population.

While All of Statistics is wonderful in its genre, it really isn't a good place to start to learn statistics. Firstly because it focuses very heavily on the theory and contains very little on practical modeling. Secondly because the theory isn't even necessarily going to be very enlightening: frequentist statistics is a mathematical tour de force, using every possible hack you can think of to be able to draw statistical conclusions from nothing more than a few pen and paper calculations, but as a result frequentist theory won't actually give you any sort of deeper insight into the core theoretical foundations of probability and statistics.

If you're new to statistics, try Allen Downey's http://greenteapress.com/thinkstats2/index.html or Brian Blais' http://web.bryant.edu/~bblais/statistical-inference-for-ever.... Both are free.

Then, go in depth on regression. Not just feeding in the numbers and getting back a fitted model, but actually knowing how everything works, what the common issues are, how to interpret the estimates and so on. Once you've got that down, read Regression Modeling Strategies by Harrell to go really in depth.

Or if you're really just interested in prediction, Hastie and Tibshirani is wonderful of course.

Brian Blais's free book doesn't contain any reference to the Poisson Distribution.

For ML Hastie and Tibshirani ISLR is very good but is more for applications of machine learning: classification, regression and prediction.

The Poisson distribution is just a limiting case of the binomial distribution and not needed to explain any of the concepts that drive statistical inference, so its absence from an introductory text is hardly something to fuss about.

Who is this book supposed to be for? Given the heavy emphasis on formalism (theorem, proof, theorem, proof, theorem, proof), and the lack of a single example that actually computes a number, I hazard a guess that this book is not for people who actually want to apply statistics to real problems.

A while back I had to teach myself Fisher matrices and the Cramér–Rao bound to solve a problem I was working on. I quickly found that 90% of statistics textbooks and lecture notes on this subject are completely useless for people like me who want to arrive at a number, not some abstract expression involving angle brackets or measures or E[...] or whatever.

The Wikipedia article on Fisher information [0] is one such example of a resource that is full of useless formal crap that crowds out an explanation for real people about how to use this statistical tool. This book appears to be of the same ilk. (Also, this book apparently does not discuss the Cramér–Rao bound. Ironic given the book's title.)

If anyone is curious, the single best explanation of the Fisher matrix and the Cramér–Rao bound that I have found is tucked away in an appendix of the Report of the Dark Energy Task Force [1]. In one page they manage to concisely and clearly explain where the Fisher matrix comes from, how to compute it, and how to apply the Cramér–Rao bound.

[0] https://en.wikipedia.org/wiki/Fisher_information

[1] http://arxiv.org/abs/astro-ph/0609591

I found this book to be a godsend. I never took statistics and always wanted to better understand the deep conceptual ideas in the field. I had so many frustrating experiences with books that came highly recommend to me, and turned out to be not what I wanted at all. They spend chapters and chapters beating around the bush, conversationally talking about general ideas around data management and measurement bias and research design and different ways of charting data sets.

I cannot tell you how frustrating this was for me. I wanted just the meat: the core mathematical concepts on which statistical models and inferences are built. Don't tell me a folksy story about gathering soil samples, show me the tools and what they can do, both their power and their limitations. I can think for myself about how to apply those concepts.

I loved this book for being exceptionally clear and terse. I was hooked from the first sentence: "Probability is a mathematical language for quantifying uncertainty." That one sentence makes the concept clear in a way that the entire chapter on probability from "Statistics in a Nutshell" (http://www.amazon.com/Statistics-Nutshell-Sarah-Boslaugh/dp/...) did not.

I'm not someone who thrives on theorems and proofs, I thrive on concepts. And I found this book dense with clear explanations of the key concepts.

>Who is this book supposed to be for?

I don't have the book here at work so I can't quote the book's introduction, but in some sense the title is meant to be literal. It's an attempt to cram an entire 4-year undergraduate statistics program into a single book, and in my opinion it's mostly successful. This book is is my go-to reference for those "Ahhhh, I remember hearing about [insert statistical test here] back in college, what was it again?" moments.

When I was taking a class in Statistical Inference, we used a combination of Statistical Inference (Casella and Berger), Introduction to Mathematical Statistics (Hogg and Craig) and Probability and Statistics (Degroot and Schervish). If you're still interested in learning about Fisher Information and the Cramer-Rao Lower Bound, you can refer to pages 514 - 521 of Probability and Statistics 8th Ed. It has a number of proofs which you can skip if you're not interested but it also provides a number of examples using different distributions to calculate both the Fisher Information and the Cramer-Rao Lower Bound.

This is one of the few hard cover books I have found worth its price. The statistical principles are succinctly explained such that they can be quickly implemented.

This is a major piece of my machine learning self study curriculum:


Some links to problem set solutions there

One of my favorite books! Shows you how connected statistical inference and machine learning really are.

Can someone compare this with Hastie and Tibshirani (https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...) ? I wonder which one is more practical

The Hastie/Tibshirani book (http://www-bcf.usc.edu/~gareth/ISL/) is very practical.

How does this compare to, say "Introduction to Statistical Learning" and "Elements of Statistical Learning" by Trevor et al? As I understand, the former is also supposed to be a concise introduction to statistical concepts while the latter offers a more rigorous treatment. Where does this book fall in between?

This book is not between ISL and ESL. All of Statistics is an introductory course (and has a much wider scope, including advanced topics) while even the watered-down ISL assumes that the reader knows already a bit of statistics.

There's a typo in the title, it's Wasserman, 2 's' only. Can someone fix it?

I think it's just part of the neue Rechtschreibung.

Price of this book is steep :( $90+ for new and $70+ for used.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact