
Akaike information criterion - dedalus
https://en.wikipedia.org/wiki/Akaike_information_criterion
======
closed
I've plugged it before, but the article "Model selection for ecologists: the
worldviews of AIC and BIC" [1] is a great place to start.

It's aimed at practitioners, and discusses the relation between many
statistical concepts (like AIC and cross-validation).

[1]:
[http://izt.ciens.ucv.ve/ecologia/Archivos/ECO_POB%202014/ECO...](http://izt.ciens.ucv.ve/ecologia/Archivos/ECO_POB%202014/ECOPO2_2014/P%20en%20inferencia/Aho%20et%20al%202014.pdf)

------
poster123
I have wondered why books on machine learning mention cross-validation to
avoid overfitting but rarely talk about information criteria such as AIC and
BIC. AIC and BIC have been well studied for multiple linear regression models.
Are they less applicable to multi-layer neural networks and other machine
learning models?

~~~
bendysnowflakes
I'm more giddy to respond to questions like this than I should be, given that
information criteria are an area of my research.

AIC and BIC are approximations to other quantities under certain common but
idealized scenarios. What's become clear over the last 20 years or so is that
for most other scenarios those approximations are not nearly as good as other
approximations, or just calculating the targeted values of interest.

So, for something like DL models, criteria such as AIC and BIC are less
applicable. But what is confusing to me is why other information criteria
aren't pursued more. There's literature on the asymptotic equivalence of
cross-validation and certain information criteria, and CV has certain problems
that those criteria overcome (I've been looking citations but Google thinks
I'm spamming it or something).

I think part of the reason for the adoption of CV is a cultural one, to be
honest, and due to historical quirks. Like bootstrapping and other empirical
simulation methods, it has an appeal due to its relatively nonparametric
nature. But there is a ton of mischaracterizations about CV (e.g., relative to
bootstrapping) and its warts have been overlooked in general due to the
position it has attained in applied settings.

There is also a huge gap between the literature on information criteria and
the rest of statistics and applied data analysis. People tend to learn about
AIC and BIC through certain standard sources, sources mostly pertaining to
their initial derivation, or that follow the same line of reasoning. But
there's other literature that derives AIC and BIC through other means, or
approach them from other perspectives, and that literature tends to get
ignored. As a result there's a lot of misleading statements that are made
about each, even from talented individuals (e.g., that the rationale for BIC
depends on Bayesianism, or that there is a true model under consideration),
and a lot of advances in the area are totally ignored. It's like there's two
literatures, one attended to by people who are interested in information
criteria, and another by people that "just want to understand AIC and BIC."
It's confusing to me really, because I generally see some kind of trickle-down
in other areas of statistics, but with information criteria, it's like there's
an enormous gap.

~~~
srean
I would chime in that cross-validation and its variants are extremely flexible
in the sense they make very few assumptions on the model. No likelihood
function ? no problem. Non-parametric no problem. Need non asymptotic numbers
? no problem. The model changed ? No problem again. One can use the exact same
code and procedure.

AIC is controlled by the number of parameters and the quality of fit, but it
is known in theory, as well as in practice that for many models, for example,
neural networks, linear regression with additive iid Gaussian noise, the norm
of the parameters matter more for prediction accuracy than the sheer number of
parameters.

Not claiming that CV is the best model selection criteria, just listing out a
few pluses for CV. The last word on model selection is yet to be spoken. It
goes right to one of the core questions about model fitting and prediction.

~~~
bendysnowflakes
Yeah, I don't mean to make it sound like there's no good reason for CV... I
came across as sort of harsh. I just feel like CV got taken up in application
a lot because it makes intuitive sense but with relatively little in terms of
theoretical scrutiny, especially relative to other alternatives. When it has
been compared to alternatives, the comparisons are sort of weak in that they
often involve strawman or unrepresentative assumptions. CV involves certain
assumptions itself, for example, in terms of choices, that affect results in
sort of unpredictable ways sometimes. I'm a big fan of bootstrapping too and
have often felt that that avenue hasn't been explored as much as it could be.

~~~
srean
Oh your original comment wasn't harsh at all. I was just listing a few of CV's
strengths which may have played a role in making it a popular choice

------
nimish
Burnham and Anderson is a fundamental book to understand why AIC and BIC and
the like are useful.

------
stared
For intro I recommend "6.1.3 Choosing the Optimal Model" from "An Introduction
to Statistical Learning" [http://www-bcf.usc.edu/~gareth/ISL/](http://www-
bcf.usc.edu/~gareth/ISL/)

------
j7ake
Is it possible to derive the AIC from Bayesian principles ? If so what
assumptions are necessary ?

------
mushufasa
what about it?

