
Statistical Modeling: The Two Cultures (2001) [pdf] - sonabinu
http://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
======
graycat
In short: In some important uses of statistical model building, traditional
regression analysis doesn't work or _fit_ very well.

One of the reasons to use regression was to look at the regression
coefficients as guides to the _importance_ of the corresponding variables.
Well, that was always a bit fishy anyway for more than just a _differential_ ,
i.e., partial derivative, view.

In some important applications, e.g., with some really complicated biomedical
data, instead of so much emphasis on the coefficients, what is really needed
is a _model_ with some accurate predictions. So to get such predictions, we
can drop the emphasis on just traditional regression analysis and interpreting
its coefficients and, instead, use much more general means of _model
building_.

Relevant old approaches have been Fourier methods for band limited smoothing
and interpolation, multivariate least squares spline interpolation, and
categorical data analysis, e.g., _logistic_ regression with _log-linear_ going
way back in the math for the social sciences.

Then Breiman and others did _Classification and Regression Trees_ \-- a.k.a.,
CART for which there is software. In short, build a tree: Start at the root
node. Take the input data and partition it into two collections, put each
collection in a node that is a child of the root node, and fit each collection
separately. Then split those two nodes similarly.

Then maybe have a good fit to the data and good predicted values. However,
have lost the considerable advantages of traditional regression analysis, but
also didn't have or use the sometimes severe assumptions of that old analysis.

With traditional regression, had a lot in statistical hypothesis tests, etc.,
to evaluate the _quality_ of the work, but with the tree, etc. don't have such
things.

So, for the tree, evaluate the quality by seeing how accurate the model is on
a second collection of data, _test data_ , _statistically similar_ to the
first collection used to find the tree. If the model is accurate on the test
data, then maybe have some good work.

Else try again and then do some more statistical work to avoid a case of just
getting a _spurious_ fit from just trying until got _lucky_.

Since I have a lot of respect for Breiman, I'm willing to take seriously
whatever he wrote.

------
closed
Great paper! My undergrad advisor wrote an interesting article that explains
the difference in perspective between information criterion measures, such as
AIC and BIC to biologists using similar reasoning.

[http://www.ncbi.nlm.nih.gov/pubmed/24804445](http://www.ncbi.nlm.nih.gov/pubmed/24804445)

~~~
tukelully
How might I actually view the article?

------
andreasvc
Also see this 2010 paper which cites this
[http://projecteuclid.org/euclid.ss/1294167961](http://projecteuclid.org/euclid.ss/1294167961)

