Hacker News new | past | comments | ask | show | jobs | submit login
Statistical Modeling: The Two Cultures (2001) [pdf] (projecteuclid.org)
200 points by michael_fine on May 6, 2019 | hide | past | favorite | 31 comments

This is a great little paper -- with comments and rejoinder! I presented it to a reading group back when it appeared, I enjoyed it so much. Always worth re-reading because Breiman is such a hero of useful probabilistic modeling and insight.

One should remember that it is a reflection of its time, and the dichotomy it proposed has been softened over the years.

Another paper, more recent, and re-examining some of these same trends in broader context, is by David Donoho:


Highly recommended. Pretty good HN comments at:


I was not too impressed by the paper.

In particular, I think the dichotomy stems not from statisticians neglecting what works, or having a narrow mindset, or whatever.

It seems to me that it stems from different goals:

* business seeks to predict and classify

* science seeks to test hypotheses

And statisticians used to focus on the latter, for which you need classical statistics ("data modeling", or "generative modeling", as Donoho calls it), don't you?

And for prediction and classification, sure, there are the classical techniques (regression, time series (ARCH, GARCH, ...), Fisher's linear discriminant), there are Bayesian methods, newer statistical stuff such as SVM, and ML techniques such as random forests.

However, it's just driven by different objectives. As the commenters state, Efron: 'Prediction is certainly an interesting subject but Leo [Breiman]’s paper overstates both its role and our profession’s lack of interest in it.', or Cox: 'Professor Breiman takes data as his starting point. I would prefer to start with an issue, a question or a scientific hypothesis [...]', or Parzan: 'The two goals in analyzing data which Leo calls prediction and information I prefer to describe as “management” and “science.” Management seeks profit, practical answers (predictions) useful for decision making in the short run. Science seeks truth, fundamental knowledge about nature which provides understanding and control in the long run.'.

So, different objectives call for different methods. And, certainly 20 years ago, statisticians were mostly focusing on one rather than the other. Ok. So?

“Ok. So?” Well so Computer Science has been focusing mainly on the predictive side (ML/AI) and you had a lot of intellectual whining that they’re “re-inventing” statistics just with different terminology. I’m not sure if this is just an attempt to down play their results or if it’s more academic jealousy because the funding goes to the “cool stuff” like AI/ML in the CS dept. and the Stats dept. is seen as old and boring. That’s what it feels like. You’ll even see this type of commentary in the preface of books like All of Statistics.

No matter what comes from empirical research in Computer Science re: prediction/classification methods you’ll hear the Stats camp crying that it’s “just Statistics” at the end of the day. Fair enough, but computational Statistics was then neglected for long enough that computer scientists had to create more powerful techniques independently and can claim priority on that front. Theory lags practice in this area.

>that they’re “re-inventing” statistics just with different terminology.

To be very clear about it, this is referencing a very common problem with subfields of applied statistics in general; it's not limited to AI/ML! Econometrics, epidemiology, business stats (decision support) and so on and so forth, all tend to come up with their bespoke terminologies and reinventions of statistically basic principles. It would seem entirely appropriate to point that out.

> I’m not sure if this is just an attempt to down play their results or if it’s more academic jealousy because the funding goes to the “cool stuff” like AI/ML in the CS dept. and the Stats dept. is seen as old and boring.

No one is trying to downplay the legitimately impressive results of AI/ML. Deep learning, convolutional neural networks and GANs have had incredible success in fields like computer vision, and image/speech recognition. But outside of those areas the "results" for the current fads in AI/ML learning have been grossly overstated. You have academic computer scientists like Judea Pearl decry the "backward thinking" of statistics and who are championing a "causal revolution", despite not actually doing anything revolutionary. You have modern machine learning touted ad nauseam as a panacea to any predictive problem, only for systematic reviews to show they don't actually out perform traditional statistical methods [1]. And you have industry giants like IBM and countless consulting companies promise AI solutions to every business problem that turn out to be more style than substance, and "machine learning" algorithms that are just regression.

There's a reason why AI research has gone through multiple winters, and why another is looming. Those in AI/ML seem to be more prone and/or willing to overpromise and underdeliver.

[1] https://www.sciencedirect.com/science/article/pii/S089543561...

I would also add in that list RNNs for time-series forecasting, especially LSTMs. I would say this is a bit more than “just regression”.

When I first read this paper I thought it was thought-provoking and captured the tension being referenced pretty well.

Over time, I've come to see it as pretty dated and misleading.

The problem is that the methods of both "cultures" are pretty black box, and it's a matter of which black you want to dress your box in. Actually, it's all black boxes anyway, all the way down, epistemological matryoshki.

The real tension is between relatively more parametric approaches, and relatively nonparametric approaches, and how much you want to assume of your data. That in turn, reduces to a bias-variance tradeoff. Some approaches are more parametric and produce less variance but more bias; others are less parametric and produce more variance but less bias. In some problem areas the parameters of the problems might push things in one or another direction; e.g., in some fields you know a lot a priori, so just slapping a huge predictive net on x and y makes no sense, but in other fields you know nothing, so it makes a lot of sense.

Another tension being conflated a bit is between prediction and measurement (supervised and unsupervised classification, forward and inverse inference, etc.). Much of what is being hyped now is essentially prediction, but a huge class of problems exist that don't really fall in this category nicely.

I disagree that computational statistics was being neglected in statistics. What I have seen is a new method (NN classes of approaches) got new life breathed into it, and became extraordinarily successful in a very specific but important class of scenarios. Subsequently, the "AI/ML" learning label got expanded to include just about any relatively nonparametric, computational statistical method. Maybe computational multivariate predictive discrimination was neglected?

A lot of what AI/ML is starting to bump up against are problems that statistics and other quantitative fields have wrestled with for decades. How generalizable are the conclusions based on this giant datasets to other data? What do you do when you have a massive model fit to a idiosyncratic set of inputs? How do you determine your model is fitting to meaningful features? What is the meaning of those features? Why this model and not another one? There are really strong answers to many of these types of questions, and they're often in traditional areas of statistics.

Anyway, I see this paper as making a sort of artificial dichotomy with regard to issues that have existed for a long long time, and see that artificial dichotomy as masking more fundamental issues that face anyone fitting any quantitative models to data. It's a misleading and maybe even harmful paper in my opinion.

> and you had a lot of intellectual whining

Downvoted for that.

I'm also not impressed. He asserts blithely that "the goal is not interpretability, but accurate information", that is, only predictive ability matters. Maybe this is true in some domains, but in my experience scientists usually give a shit about what the hidden function is and are trying to learn how it behaves. No one in science wants to be left with a black box that you can't move past, they want a deeper understanding of the underlying phenomenon.

I don’t think that’s what Breiman is saying. He means useful and reliable information. The point is not just prediction but also actionable information about variables. Look at the 3 examples.

This is a great paper. Very long, but worth every bit of it. BTW, here is a recent blog post about the paper: http://duboue.net/blog27.html

One of the key insights I took away was the importance of using out-of-sample predictive accuracy as a metric for regression tasks in statistics—just like in ML. The standard best practices in STATS 101 is to compute R^2 coefficient (based on data of the sample), which is akin to reporting error estimates on your training data (in-sample predictive accuracy).

IMHO, statistics is one of the most fascinating and useful fields of study with countless applications. If only we could easily tell apart what is "legacy code" vs. what is fundamental... See this recent article https://www.gwern.net/Everything the points out the limitations of Null Statistical Hypothesis Testing (NHST), another one of the pillars of STATS 101.

> The standard best practices in STATS 101 is to compute R^2 coefficient (based on data of the sample), which is akin to reporting error estimates on your training data (in-sample predictive accuracy).

That's not the best practice at all. It ain't even standard because adj R^2 exist to penalized coefficient cheating and if you cheat on your degree of freedom. And on top of that we got other penalization functions other than R^2 such as AIC and AAIC.

All of them are just for comparison between similar type of models and it's not suppose to be use for generalization test for unseen data. We're taught CV too and it is of statistic invention. And also taught about training set and test set with CV.

If you want to debug for test data you have to do CV anyway for imbalance data. So this is comparing apple to banana with R^2 vs out of sample.

You can see all of this in the book that applied statistic uses for linear regression, Applied Linear Statistical Models by Kutner & et al.

> The standard best practices in STATS 101 is to compute R^2 coefficient (based on data of the sample)

> Null Statistical Hypothesis Testing (NHST), another one of the pillars of STATS 101

Best practices of statistics and what is taught in Stat 101 are not remotely the same thing. The problems with R^2 and NHST have been well documented they've been argued against for decades by the statistics community. But actual statisticians make up a small proportion of statistics practitioners, as statistics is the backbone of nearly all modern science. What gets taught in Stats 101 is not the "foundations of statistical best practices" so much as "a system of guidelines and rules of thumb that have been simplified greatly for the sake of the lowest common denominator". To make things worse, non-statisticians seem to over-estimate their statistical knowledge and prowess after having taken a handful of introductory stats courses more than any applied field I know of.

Only now after the magnitude and pervasiveness of the replication crisis has begun to be recognized by the broad scientific community are people starting to realize what many statisticians have been pointing out for years.

> The standard best practices in STATS 101 is to compute R^2 coefficient

Any regression class which doesn't teach about overfitting is incomplete. R^2 is still a very useful measure, however, if you've come up with a parsimonious model and want to describe its descriptive performance.

The title is a reference to this famous essay by C.P. Snow about a split between the humanities and science: https://en.wikipedia.org/wiki/The_Two_Cultures

See also "50 years of Data Science" by David Donoho (2015), which discusses the question of whether there's any difference between "statistics" and "data science".


Two? Two?!

There is a classic post here https://news.ycombinator.com/item?id=10954508:

" The Geneticists: Use evolutionary principles to have a model organize itself The Bayesians: Pick good priors and use Bayesian statistics

The Symbolists: Use top-down approaches to modeling cognition, using symbols and hand-crafted features

The Conspirators: Hinton, Lecun, Bengio et al. End-to-end deep learning without manual feature engineering

The Swiss School: Schmidhuber et al. LSTM's as a path to general AI.

The Russians: Use Support Vector Machines and its strong theoretical foundation

The Competitors: Only care about performance and generalization robustness. Not shy to build extremely slow and complex models.

The Speed Freaks: Care about fast convergence, simplicity, online learning, ease of use, scalability.

The Tree Huggers: Use mostly tree-based models, like Random Forests and Gradient Boosted Decision Trees

The Compressors: View cognition as compression. Compressed sensing, approximate matrix factorization

The Kitchen-sinkers: View learning as brute-force computation. Throw lots of feature transforms and random models and kernels at a problem

The Reinforcement learners: Look for feedback loops to add to the problem definition. The environment of the model is important.

The Complexities: Use methods and approaches from physics, dynamical systems and complexity/information theory.

The Theorists: Will not use a method, if there is no clear theory to explain it

The Pragmatists: Will use an effective method, to show that there needs to be a theory to explain it

The Cognitive Scientists: Build machine learning models to better understand (human) cognition

The Doom-sayers: ML Practitioners who worry about the singularity and care about beating human performance

The Socialists: View machine learning as a possible danger to society. Study algorithmic bias.

The Engineers: Worry about implementation, pipe-line jungles, drift, data quality.

The Combiners: Try to use the strengths of different approaches, while eliminating their weaknesses.

The Pac Learners: Search for the best hypothesis that is both accurate and computationally tractable. "

This is too funny. Engineers focus on “pipeline jungles” lol!

God I hate this paper. Perhaps it was relevant at its time. But that was 18 years ago. The described dichotomy between the "two cultures" isn't nearly as pronounced, if it even exists, today. There are few statisticians today who adhere entirely to the "data modeling culture" as described by Breiman.

I'm surprised how often this paper continues to get trotted out. In my experience it seems to be a favorite of non-statisticians who use it as evidence that statistics is a dying dinosaur of a field to be superseded by X (usually machine learning). Perhaps they think if its repeated enough it will be spoken into existence?

Here is a previous discussion https://news.ycombinator.com/item?id=10635631

I am not an expert and am still reading thru the article, but why is it such a strong dichotomy? Don't all predictive algorithm also assume a data model? for example aren't hidden Markov models, by assuming constant transition probability make a data assumption?

To my ears (eyes?), this discussion resembles the transition from linear, euclidean geometry into the fractal realm.

> I am not an expert and am still reading thru the article, but why is it such a strong dichotomy? Don't all predictive algorithm also assume a data model? for example aren't hidden Markov models, by assuming constant transition probability make a data assumption?

I'm going to try my best to answer your question from my experiences and background. I am from the statistic school of thought so please keep that in mind for any bias.

I can give you an example of the different mentality in applied math vs statistic with regard to modeling. Then I'll try to expand to machine learning.

So an example will be time series univariate data. Applied Math people will use probability to try to model the process which create the time series data. A statistician will not care about modeling the process that create the data, he/she only care about using all the information from the data to create a model. A very clear example to this times series model is when you do residual analysis to see if the statistic model uses all the information from the data.

I know it sounds superficial but it also drive how each field invent and research different models.

Let's go from statistic vs machine learning thinking. If you lurk in /r/statistic you will see many statisticians will separate ML models vs statistic model (data model in Dr. Breiman's term) by confidence interval. A statistical model gives the confidence interval of that prediction on top of inferences to the parameters and such. ML models does not give a CI. A con to this is giving a prediction without CI is worthless to a statistician because it doesn't tell us how good that prediction is.

Let take linear regression as an example. Many non statistic books will give you equations how to solve for it via a cost function least square. Statistic books give you that and the MLE way of doing it. We also see every prediction is the expected value from a distribution (see here https://stats.stackexchange.com/questions/148803/how-does-li...). And often time most non statistic books aren't going to give you that point of view.

Another example is Deep Learning vs PGM or Bayes Network (heirarchical modeling). From my experiences ML is more empirical driven then statistic.

Thanks. I think you are furthering the point of prediction vs. modeling, right? You can, after all, get a confidence rating for a model

> I think you are furthering the point of prediction vs. modeling, right?

Kinda. I just wanted to point of in general why the models are categorize as statistical model vs non statistical models. Certain area of statistic do prediction aka forecasting too (time series). It's just statistical models gives CI for their prediction and ML usually do not give any, most of the time, it's created empirically and have no theory to get a CI.

As for Confidence Rating, I have no idea what it is I've tried Google but I couldn't find much. What field is this?

> I am not an expert and am still reading thru the article, but why is it such a strong dichotomy?

There isn't. This paper is nearly 2 decades old and isn't nearly as relevant as you would think given how many times it gets trotted out. Any statistician under the age of 45 would not recognize the two cultures as they are described here.

I'm a statistician under 45 who has first-hand experience with the separate cultures laid out here. The gap isn't as big as it used to be, but it still exists.

If people liked this paper, I suggest reading "The Two Cultures" by CP Snow which is not as technical but more expansive, cultural and philosophical.

> Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model. The goal is not interpretability, but accurate information.

As others have said, great paper by a great author. Must read.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact