Hacker News new | past | comments | ask | show | jobs | submit login
What are the most important statistical ideas of the past 50 years? [pdf] (columbia.edu)
288 points by luu 36 days ago | hide | past | favorite | 69 comments



For anyone wanting to learn causal inference (the first item in the list), I highly recommend "Causal Inference, the Mixtape" by Scott Cunningham, a professor of economics at Baylor University. Scott has been writing this book incrementally in the open for the last couple of years, and recently completed and published it, and it is a thorough introduction to numerous techniques for inferring causation in different contexts. https://www.scunning.com/mixtape.html


https://mixtape.scunning.com/

Link to the free HTML version


Scott Cunningham is an old internet pal of mine, before he was famous!


I desperately want to get to a point where I'm able to quickly sketch out a statistical model for making a decision, but I'm struggling mightily on where to get started that doesn't include some kind of "hide the uncertainty" shell game, such as how I feel when Bayesianism gets thrown around.

Submissions like this give me hope, that if I just read more, I'll get there, but nothing listed here feels like it's usable in a "napkin math" scenario. I just bought two of Gerd Gigerenzer's latest books, and I've already obtained some of David Sklansky's more introductory work on gambling, but I'm concerned my focus is either too theoretical or too narrowly on gambling (though I think there's a lot to learn in how a gambler assesses a bet, and how it applies to other situations).

I guess if it were easy, we'd all be doing it...


Depending on the context, I'm a fan of the work of Douglas Hubbard, in his book How to Measure Anything[1]. His approach involves working out answers to things that might sometimes be done as a "back of the napkin" kind of thing, but in a slightly more rigorous way. Note that there are criticisms of his approach, and I'll freely admit that it doesn't guarantee arriving at an optimal answer. But arguably the criticisms of his approach ("what if you leave out a variable in your model?", etc.) apply to many (most?) other modeling approaches.

On a related note, one of the last times I mentioned Hubbard here, another book came up in the surrounding discussion, which looks really good as well. Guesstimation: Solving the World's Problems on the Back of a Cocktail Napkin[2] - I bought a copy but haven't had time to read it yet. Maybe somebody who is familiar will chime in with their thoughts?

[1]: https://www.amazon.com/How-Measure-Anything-Intangibles-Busi...

[2]: https://www.amazon.com/gp/product/0691129495/ref=ppx_yo_dt_b...


Let me second "How to measure anything." I think it should be required reading for human beings.


I would add that "How the Measure Anything in Cybersecurity Risk" should be a core part of Infosec literature.


How to Measure Anything is a fantastic book. Here are the most significant insights you learn in the book

- how to measure anything; Hubbard actually comes through on the promise of the title - after finishing the book you will truly feel that the scope of what you can measure is massive. He does this by a change in the definition of what it means to measure something, but you realize his definition is more correct than the everyday intuitive one.

- value of information; Hubbard gives a good introduction to the VOI concept in economics, which basically lets you put a price on any measurement or information and prioritize what to measure

- motivation for 'back of the napkin' calcs; through his broad experience he has seen how a lot of the most important things that affect a business go unmeasured, and how his approach to 'measuring anything' can empower people to really measure what matters.

Reading this book provided one half of what I have been searching for for a long time - a framework for thinking about data science activities which is not based on hype, fundamentally correct and still intuitive and practical.


A common mistake technical people make is to be too theoretical or overcomplicated with their work and decision making, where everything has to be some math model written out in a nicely formatted LaTeX document. Don't fall into that trap, it is very ineffective.

Stats is a tool like any other tool. Boil your question down to the fundamentals (first principles) and maybe stats is one of the tools you decide to use to solve part of it, where appropriate.

Most questions involving data can be answered with something as basic as a plot.


Easier said than done. The complete history of science is the human struggle to boil down the mathematical modelling of nature to its fundamentals.


  "mathematical modelling of nature"
My point is most real world problems people face (even most problems that data scientists face at work) shouldn't be modelled with formal math.

Basic reasoning, tinkering, plotting data and playing around with data in Excel or Python is usually sufficient to answer most questions, yet many will try to overcomplicate the issue with complex math or stats. Perhaps in an attempt to impress their peers, or perhaps because they've just come out of years of university that taught them to think theoretically.

I'm reminded of that meme going around.

70 IQ: plot data

100 IQ: support vector machines! deep Q-nets!

130 IQ: plot data


While I'm not sure how much the modern methods are amendable to napkin math as stated in the article a lot more methods use simulation which if you can code are pretty straightforward to get working.

Jake Vanderplas's presentation https://speakerdeck.com/jakevdp/statistics-for-hackers can give you some concrete ideas of how far you can get with just a random number generator.


What do you mean by hide the uncertainty? To me Bayesian modelling is about making explicit the uncertainty.


> I desperately want to get to a point where I'm able to quickly sketch out a statistical model for making a decision, but I'm struggling mightily on where to get started that doesn't include some kind of "hide the uncertainty" shell game, such as how I feel when Bayesianism gets thrown around.

Being comfortable around basic probability distributions is probably the main prerequisite I suggest. If you can understand the uses and analyses of Bernoulli, Binomial, Poisson, and normal RVs, you should be at a point where you can try to model real life phenomena. A surprising amount of life can be modeled as multiple weighted coin flips.


> I desperately want to get to a point where I'm able to quickly sketch out a statistical model for making a decision

What kind of decisions do you want to make? If you offer that up, others will be able to provide guidance to frameworks used to make those decisions.

The field of operations research uses topics from all sorts of fields to try to make better decisions. I would broadly classify modelling into two large categories deterministic (e.g., how to schedule a flight crew, or where to locate a warehouse) or stochastic (using probability models, e.g., how many people to staff in a bank). Of course there is no problem that will be strictly one form or the other. There will be points in a modelling problem where it's useful to apply tools from both categories.

A decision tree might be the simplest form. You map out the different decision points and can generate costs or profits and probabilities for the decision points. You take the one that has the highest expected value (sum of probability of outcome times value of outcome) of profit or lowest expected value of cost.

Sensitivity analysis may be another tool to use in decision making. You don't need to get three decimal accuracy. Come up with some upper and lower bounds and you'll have an idea of how good or how bad the outcome will be.

Determining the probabilities and bounds may be tough, and your decision will only be as good as the data your putting into the framework.


Decisions are moral/political/biological, not statistical


Typically. But if somebody manages to make statistically sound evidence-based decisions they will steamroller people who are making decisions using other factors.


I start with Wikipedia's list of maximum entropy distributions, pick the one that best represents the category for the kind of thing I'm reasoning about, and then update it with what data I happen to have.

It's usually simpler than it sounds, and helped me put together a lot of quick decision models, sales forecasts, etc.

https://en.m.wikipedia.org/wiki/Maximum_entropy_probability_...


Is that statistics? I think of statistics as "given historical data, infer future data". But it seems like what you want is to know which decision is best, which involves many more things (like estimating the impact/utility of each outcome), that seems more like economics?


Arguably it's just quibbling over a trivial terminological difference, but I get the feeling that you're thinking more about "Decision Theory"[1] (or "Decision Science") as opposed to "just Statistics". Decision Theory, of course, uses Statistics, and I guess one could argue the question of whether one is just a subfield of the other, or argue exactly where the dividing line is.

[1]: https://en.wikipedia.org/wiki/Decision_theory


When I took statistics in college, we started with a rather basic definition, something to the effect of: "A statistic is a function performed on a set." Statistics studies what you can infer when you know something about a set, but not everything about it, namely its precise contents. Often, what you are told about a set is something about its probability distribution, thus linking probability and statistics together.

A useful parallel can be drawn with situations involving measurements and data, since data often have the same feature of telling us something but not everything. This is what I believe makes statistics so useful for science.


I've grown to the idea that statistics is a form of data compression. It isn't so much, "infer future data" as it is "if the data we have is representative of all data, what is a number/equation that represents this data?". Usually with a certain framing.


Economics is also statistics.


> Economics is also statistics.

Economics is not Statistics (and definitely not statistics).

Most of the discipline focuses on testing models and making inferences on observational data. The techniques for dealing with that sort of data, of course, build on Statistics, but their nature is different enough that there is Econometrics.

A large part of economics is not empirical at all -- despite the fact that people get Nobel prizes pretending this not to be the case.

Even in the context of experimental economics, since the behavior of the observed vary depending on the mode of observation, the contexts in which the most straightforward Statistical methods designed to apply to engineering/chemistry/biology experiment type situations are not directly applicable (although it is great when they agree with the fancier methods).


>A large part of economics is not empirical at all -- despite the fact that people get Nobel prizes pretending this not to be the case.

I'm not sure which parts of the field or which prize winners you are talking about. To be clear: you think economics is _not actually empirical_, but people are awarded Nobel Prizes for _pretending that it is_? That's a little odd. Let me know if that's not what you meant.

When you look at this list:

https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_l...

Who satisfies that condition, in your mind? Who is getting the prize on the basis of pretending that economics is empirical?


> To be clear: you think economics is _not actually empirical_

That is a misrepresentation of what I said.

To be clear, I think what I said:

>>A large part of economics is not empirical at all

E.g., as an example, Kahneman's Nobel is solely a product of taking an axiomatic theory and designing experiments where regular people who are actually not being paid according to their performance are gently prodded into violating the axioms in weird settings. It is attractive to people who want to claim that clearly the plebes cannot be allowed to choose for themselves as they are not "rational".

The only meaning of "rational" in Economics is that individuals choose the best alternative according to their preferences among a constrained set of alternatives. Here an "alternative" or "bundle" is a point in the entire commodity space.

The only test of this is consistency with GARP: A choice is not rational if a feasible and more preferred alternative exists.


There are actually several economists on this list, like Victor Chernozhukov, Guido Imbens, and Susan Athey...


I think of it like this:

Suppose I want to make a decision about whether to hedge for a market crash right now. Statistics can tell me the likelihood of a crash, and how bad. But if the market crashes, and very badly, how might that affect my life? To make a good decision I would need to think of all the things that come with a market crash (job loss, savings loss). This is not statistics.

I could again use statistics to say what is the chance I lose my job given a market crash (say 70%). But then I would need to estimate the impact on my life should I lose my job (Stress, etc). This is not statistics. But it should very well factor into my ability to do back of the napkin math on whether I should hedge or not.


If your decision substantially involves or derives from making an estimate about a population based on a sample, it is statistics. "Making decisions under uncertainty" is well-studied in statistical literature, just like "quantifying uncertainty" is well-studied. It sounds like you think the latter is "actual statistics", but these things are both statistics.

In particular:

> But if the market crashes, and very badly, how might that affect my life? To make a good decision I would need to think of all the things that come with a market crash (job loss, savings loss). This is not statistics.

This is all statistics, not just the part where you're forecasting likelihood of the market crashing. The reason is because making decisions about the future under the constraints of uncertainty implicitly involves a forecast. When you decide how to diversify your personal investment portfolio, how much to allocate to your Roth versus traditional IRA or 401k, etc, you are making forecasts about which allocation will provide you with a more favorable outcome.

Stated more concisely: there is no rational reason to use statistics for forecasting market events but not for deciding what to do in the event specific market events occur.


This is exactly statistics. This is an expectation of a utility function with respect to some distribution.


> Statistics can tell me the likelihood of a crash

Statistics cannot tell you any such thing.


Do you mean to say that nothing can tell you such a thing?

What is a likelihood, but a statistic?

If there is any method to determine a statistic, it seems reasonable to me to say that that method involved statistics.

(Now, of course, except for possibly where quantum randomness is relevant, which might be quite often, I'm fairly confident that the only probabilities are subjective or relative to some set of assumptions, or something along those lines, because the future "already exists". But, given some fixed priors and some fixed evidence, there should in principle be a well defined probability of such a crash. So, insofar as peoples priors match up, there should, in principle, be a common well defined probability given "the information which is publicly available", or also, given whatever other set of evidence.)

Of course, that doesn't mean it is computationally tractable to compute such a probability.


> But, given some fixed priors and some fixed evidence, there should in principle be a well defined probability of such a crash.

:-)

How do you test this model?

It is easy to find things that fit one of the previous crashes.

Given that there is only one realization of history, the data we have is consistent with any model that puts a non-zero probability on a crash.


Well, what I gave isn't exactly a model of the market, so much as "a description of having a model of the world".

So, I'm not sure what you mean by "test this model".

You can refine your model-of/beliefs-about the world, by continuing to look at the world and make observations.

And obviously your beliefs should include a non-zero probability of a crash. That follows from non-dogmatism/Cromwell's rule.

And yeah, there is only one, (or, either that, or at least we can only observe one, which is practically the same thing) "realization of history". This doesn't produce any difficulty, because probability isn't defined by the proportion of trials in which the event occurred.

Probability is about degree of belief (or, belief and/or caring).

edit: I suppose you can also evaluate how calibrated your beliefs have been, which is kind of like testing a model.


> Probability is about degree of belief (or, belief and/or caring).

Not at all.

Probability is a countably additive, normalized measure over a sigma algebra of sets.

> This doesn't produce any difficulty, because probability isn't defined by the proportion of trials in which the event occurred.

You misunderstand the point.

Let's say you provide me a distribution of crash probabilities for every trading day for the next three months.

We all ought to know that P(event) = 0 does not mean event is impossible., Therefore, P(event) = 1 does not mean "not event" is not impossible.

What would allow one to state that your model is consistent/not consistent with the one observed history of events over the three months, regardless of whether there is a crash or not?

You have to come up with this criterion before observing the history.


Ok yes, that’s the definition of a probability measure. But I was talking about the concept of probability, in the world, contrasting with the “objectively defined via frequency in related trials”, which is something people sometimes claim. I misunderstood and thought that was the claim you were making.

Ok.

I would think that, if we have a continuous distribution, then the score should be the probability density of what is observed?

If you say beforehand “I think x will happen”, and I respond “I assign probability 1 that x will not happen”, and then x happens, then I’ve really messed up big time. I’ve messed up to a degree that should never happen.

(And, only countably many events can be described using finite descriptions, and a positive probability could, in principle, be assigned to each, while having the total probability still be 1, so that nothing that can possibly be specified happens while being assigned a probability of 0. Though this isn’t really computable..)

As a more practical thing, if I assign probability 0 to an event which you could describe in a few sentences in under 5 lines (regardless of whether you actually have described it), and it happens, then I’ve really messed up quite terribly, and this should never happen (outside of just, because I made an arithmetic error or something.)


> As a more practical thing, if I assign probability 0 to an event which you could describe in a few sentences in under 5 lines (regardless of whether you actually have described it), and it happens, then I’ve really messed up quite terribly, and this should never happen (outside of just, because I made an arithmetic error or something.)

I think this conversation has reached an impasse.

https://en.wikipedia.org/wiki/Cantor_set


I'm familiar with the cantor set, and I know it has 0 measure. Just because you can succinctly describe the cantor set, which has 0 measure, doesn't mean I've messed up. If I assign a uniform distribution over [0,1] to some number outcome in the world, and an element in the cantor set is the result, then I've messed up. But, when we measure numbers in the world, we don't measure specific real numbers, as all our measurements have some amount of error. So, that can't happen. We can measure that the result is in some interval, and that this interval contains some element of the cantor set, but the probability of what we observed, is not something that I assigned 0 probability to. Like, heck, every interval will have a rational number in it, and the rational numbers also have measure 0.

"the measured value is in the cantor set" isn't a thing that we can observe to have happened.

("the value, when rounded to the finite amount of precision that our measurement has, is in the cantor set" is something that would have positive probability, under the uniform distribution over the interval, and therefore something I shouldn't assign a probability of 0.)


Frame your hypothesis in the most simplest way possible and go from there.


It has to a lot to do with statistics, but is not usually classified as statistics per-se, but I think the entire field that includes Random Projections (e.g. Johanson-Lindenstrauss Lemma), Compressive Sensing and friends.

Statistically speaking, it says that regardless of the dimension that data is presented in, a random linear projection to (slightly more than) the intrinsic dimension of the data captures the underlying topology.


This is a huge result in signal processing but with broad applications


That's a great practically oriented crash course on modern statistics!


The re-discovery of causation analysis, by Pearl, after it was suppressed for many decades by the statistics mandarinate, clearly qualifies.

Max Planck is quoted, "Science progresses one funeral at a time." In this case, the grand old man of statistics finally died still insisting that it could not be proved that smoking caused cancer, but not before blighting careers of those who were showing it could.


> after it was suppressed for many decades by the statistics mandarinate, clearly qualifies.

I have heard Pearl make this claim but have never seen his evidence for it.

As a counterexample: Don Rubin is a statistician who has a well-known framework for causal inference who has been at the top of the field for a very long time. Rubin has published widely and very well on causal inference.

Is there good evidence for the topic actually being suppressed by anyone within the statistics profession? There is work on causal inference going back to RA Fisher. If anyone has tried to suppress it, I'm not sure they have been very effective.


>Is there good evidence for the topic actually being suppressed by anyone within the statistics profession?

Unfortunately there's not enough evidence to show causation...


Causality, i.e. Causal inference & Graphical Models, see the work by Judea Pearl, he pretty much singlehandedly pioneered the field.

https://www.amazon.com/Causality-Reasoning-Inference-Judea-P...

https://en.wikipedia.org/wiki/Causality#Theories

https://en.wikipedia.org/wiki/Graphical_model


Single-handedly pioneered reviving the field, but yes.

Pearl is very careful to give his deceased predecessors their due credit. That their work was suppressed will always be a blot on the leading names in statistics in the past century.


Statistical Consequences of Fat Tails by Nassim Taleb.


Taleb is... not a good source for learning statistics. Start with Wasserman. Taleb says obvious and well known things using his own invented terminology in order to cast himself as some sort of contrarian genius. It's not that he's wrong, it's that the insights he hawks are banal. That's why his readership base are insight porn book junkies not people actually trying to learn statistical methods.


"insight porn books" is going in my "objects you've been searching for titles for" Notion list.


Yeah, I think I first heard it in relation to Malcolm Gladwell and it's just so apt at capturing everything wrong with that category of book. I mean he's a skillful writer, and it's definitely entertaining stuff. But if you flip into critical mode and do comparative research vs authoritative sources, you start seeing how vapid it is really fast.


When I read Fooled by Randomness I found it useful. Not groundbreaking work, but it drew some nice analogies between statistical distributions and human's over-certainty.


Would love to see that list or any other on which this choice descriptor finds itself.


> Start with Wasserman

If you're referring to "All of Statistics" by Wasserman, then there are some significantly easier textbooks to learn statistics from. Depending on the program, "All of Statistics" is a book used by senior undergrads or grad students. Are there more mathematical heavy stats books, yes, but this isn't a casual read for someone who is trying to learn statistics either.

I like "Probability and Statistics for Engineering and the Sciences" by Devore as an intro book. It covers the basics of probability distributions, maximum likelihood and method of moments estimation, ANOVA, and linear regression. Pre-requisite knowledge is probably multivariable calculus, matrix multiplication, determinants, and eigenvalues.


Devore's book is great. It's sad it gets many negative reviews. In my experience, there are two types of people:

1. Those who want a statistics book to be like a math book: Fewer words and more equations.

2. Those who want a wordy book with little math

Devore's book is in between, which is why I think both camps tend to hate it. It has a decent amount of math, and has quite a bit of text. The text is invaluable: You get information about common rules of thumb. You get insights on why the technique works. Etc.

And the examples/problems are great. So many of them are from real papers/books. You're not working on some contrived example, but on real world problems.


If you do have his books then the reference lists in the back provide a good starting point for further reading.


I have read this book and want to leave an anti-recommendation here. It's a poorly edited mess and makes at least one blatant mathematical error.

More broadly, let me leave a Taleb anti-recommendation. His entire shtick is yelling that traditional statisticians have ignored heavy-tailed random variables in their modeling and that he has special insight into the nature of tail risk (perhaps along with a few select other people, like Mandelbrot).

But this is manifestly not the case. In fact, if you go through his Amazon reviews page, you can find him leaving positive reviews several years ago on all the books written by traditional statisticians that he learned about heavy-tailed randomness from!


link to his Amazon reviews page?


Scroll back to the early 2010s: https://www.amazon.com/gp/profile/amzn1.account.AHMHNR4MRTDL...

For a more detailed critique, see Robert Lund, Revenge of the White Swan, The American Statistician Vol. 61, No. 3 (Aug., 2007). Accessible through your favorite Russian website.

If you want a better book on heavy-tailed randomness, I like Didier Sornette's Critical Phenomena in Natural Sciences (subtitled Chaos, Fractals, Selforganization and Disorder: Concepts and Tools).


Revenge of the White Swan also appears available on ResearchGate:

https://www.researchgate.net/publication/4741329_Revenge_of_...


This feels only slightly more legitimate than recommending the 538 blog as a statistical authority.


[flagged]


calling someone's book suggestion an "advertisement" is rude, and inaccurate. taleb wouldn't pay anyone to suggest his book when he could instead just show up here and insult everyone for free.


I didn’t mean it strictly literally. The original comment was a thoughtless namedrop of a brand new book (which means by tautology it’s irrelevant to the topic at hand) and doesn’t have a shred of reasoning behind it. So, functionally, it’s a billboard advertising Taleb’s book.

I am aware that the Cult of Taleb means that people are willing to advertise his work for free.


Not mentioned, not cited in the paper. That's shocking.

Edit: the word "tail" appears nowhere in the paper, in any context. I'm beyond shocked now.


Because this was well known to statisticians long before Taleb talked about it?

That would be my suspicion as to why it isn't there.


Quite plausible. Extreme Value theory [1] appears to have been codified by the 1960s, and one of the main theorems is credited “to Fréchet (1927), Ronald Fisher and Leonard Henry Caleb Tippett (1928), Mises (1936) and Gnedenko (1943)” [2]. ETA: And the second theorem of Extreme Value Analysis is from the mid 1970s. [3]

1. https://en.wikipedia.org/wiki/Extreme_value_theory

2. https://en.wikipedia.org/wiki/Fisher–Tippett–Gnedenko_theore...

3. https://en.wikipedia.org/wiki/Pickands–Balkema–De_Haan_theor...


My stats training was in the 90s and we absolutely covered leptokurtic things.


The book by Leadbetter, Lindgren and Rootzen is good too if a bit dated.


This is subsumed in the robust estimation section.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: