Hacker News new | past | comments | ask | show | jobs | submit login
Akaike information criterion (wikipedia.org)
36 points by dedalus on Jan 8, 2018 | hide | past | favorite | 15 comments

I've plugged it before, but the article "Model selection for ecologists: the worldviews of AIC and BIC" [1] is a great place to start.

It's aimed at practitioners, and discusses the relation between many statistical concepts (like AIC and cross-validation).

[1]: http://izt.ciens.ucv.ve/ecologia/Archivos/ECO_POB%202014/ECO...

I have wondered why books on machine learning mention cross-validation to avoid overfitting but rarely talk about information criteria such as AIC and BIC. AIC and BIC have been well studied for multiple linear regression models. Are they less applicable to multi-layer neural networks and other machine learning models?

I'm more giddy to respond to questions like this than I should be, given that information criteria are an area of my research.

AIC and BIC are approximations to other quantities under certain common but idealized scenarios. What's become clear over the last 20 years or so is that for most other scenarios those approximations are not nearly as good as other approximations, or just calculating the targeted values of interest.

So, for something like DL models, criteria such as AIC and BIC are less applicable. But what is confusing to me is why other information criteria aren't pursued more. There's literature on the asymptotic equivalence of cross-validation and certain information criteria, and CV has certain problems that those criteria overcome (I've been looking citations but Google thinks I'm spamming it or something).

I think part of the reason for the adoption of CV is a cultural one, to be honest, and due to historical quirks. Like bootstrapping and other empirical simulation methods, it has an appeal due to its relatively nonparametric nature. But there is a ton of mischaracterizations about CV (e.g., relative to bootstrapping) and its warts have been overlooked in general due to the position it has attained in applied settings.

There is also a huge gap between the literature on information criteria and the rest of statistics and applied data analysis. People tend to learn about AIC and BIC through certain standard sources, sources mostly pertaining to their initial derivation, or that follow the same line of reasoning. But there's other literature that derives AIC and BIC through other means, or approach them from other perspectives, and that literature tends to get ignored. As a result there's a lot of misleading statements that are made about each, even from talented individuals (e.g., that the rationale for BIC depends on Bayesianism, or that there is a true model under consideration), and a lot of advances in the area are totally ignored. It's like there's two literatures, one attended to by people who are interested in information criteria, and another by people that "just want to understand AIC and BIC." It's confusing to me really, because I generally see some kind of trickle-down in other areas of statistics, but with information criteria, it's like there's an enormous gap.

I would chime in that cross-validation and its variants are extremely flexible in the sense they make very few assumptions on the model. No likelihood function ? no problem. Non-parametric no problem. Need non asymptotic numbers ? no problem. The model changed ? No problem again. One can use the exact same code and procedure.

AIC is controlled by the number of parameters and the quality of fit, but it is known in theory, as well as in practice that for many models, for example, neural networks, linear regression with additive iid Gaussian noise, the norm of the parameters matter more for prediction accuracy than the sheer number of parameters.

Not claiming that CV is the best model selection criteria, just listing out a few pluses for CV. The last word on model selection is yet to be spoken. It goes right to one of the core questions about model fitting and prediction.

Yeah, I don't mean to make it sound like there's no good reason for CV... I came across as sort of harsh. I just feel like CV got taken up in application a lot because it makes intuitive sense but with relatively little in terms of theoretical scrutiny, especially relative to other alternatives. When it has been compared to alternatives, the comparisons are sort of weak in that they often involve strawman or unrepresentative assumptions. CV involves certain assumptions itself, for example, in terms of choices, that affect results in sort of unpredictable ways sometimes. I'm a big fan of bootstrapping too and have often felt that that avenue hasn't been explored as much as it could be.

Oh your original comment wasn't harsh at all. I was just listing a few of CV's strengths which may have played a role in making it a popular choice

Glad to see an expert respond. Are you able to comment on my ramblings here? https://news.ycombinator.com/item?id=16096865

How do you feel about stuff like the DIC and WAIC for machine learning and other "predictive" modeling?

I like the links--thanks.

About DIC and WAIC, etc. ... that's a huge topic for me.

There's a relatively well-cited result that shows that you can arbitrarily choose any penalty in an information criterion (i.e., penalty in the form -logL + penalty), as long as it meets certain weak conditions, and it will eventually work with big enough samples. So the issues of most importance are usually how selection accuracy increases with sample size, not its asymptotic performance. The leads to a certain hesitancy for me in terms of why I should pay attention to any particular IC.

Having said that, DIC and WAIC are pretty elegant in my mind, and I've been glad to see research on them.

Predictive scenarios aren't really my focus, although I've become much more interested in them the last couple of years for various reasons. Most of my work is in what might be thought of as unsupervised models.

Isn't AIC just an approximation to cross-entropy, in the simplest possible case of gaussian noise? It's a rule of thumb which is useful in much simpler cases (a few parameters, lots of data) than typical for say neural nets.

Not the optimal reference, but a paper dealing with deriving AIC like this which I happen to have on my desk: Gelman & co 2014 https://link.springer.com/article/10.1007/s11222-013-9416-2

Maybe an expert will chime in here, but I recall learning that several concepts in ML are derived from or in parallel to the BIC. Many use feature AIC/BIC measures for feature selection.

On a separate note, AIC/BIC are prevalent in linear regression models because they are extremely well-behaved in comparison to some other models. This has generated enormous amounts of field-specific experimental literature. For instance, economists can form expectations for AIC/BIC measurements dependent on the study-type.

Additionally, we already have some decent tools for regularization (albeit model-dependent ones). Counting the nodes of a NN to establish an AIC may not lead to a proper understanding of the free parameters.

This wasn't meant to be an apologists take on why I'm not using AIC/BIC for model selection. If we're talking about "deep learning", then it's definitely a worthwhile goal to examine how to optimize network topologies. The link below mentions using reinforcement learning to learn architecture.


It looks like some work has been done on using AIC for neural networks:

- https://www.sciencedirect.com/science/article/pii/B978044489...

- https://www.sciencedirect.com/science/article/pii/S095219760...

- https://waseda.pure.elsevier.com/en/publications/network-inf...

In principle it would be straightforward, right? AIC = 2k - 2ln(L). So set k = # weights + # biases, and use the log-likelihood as the objective function so you can just read off L from there.

I wonder if the reason why AIC is unpopular is that it's harder to explain to your boss than accuracy, precision, recall, or even proper scoring. This is perhaps more true now that statistical literacy in management increases -- the notion that you can't use training data to estimate performance is becoming popular. Now here comes a magic formula, calculated on the training set, that supposedly tells me how well the model will perform... that's not gonna fly in a lot of settings.

There's also the question of whether it even tells us what we want to know. AIC is an estimate of Kullback-Leibler information of a probability model, under somewhat-restrictive conditions [0]. The question of "why don't we use AIC?" might be the same as the question of "why don't we use proper scoring rules? -- people want to know accuracy, so they just go ahead and estimate accuracy by brute-force resampling. I'm not saying it's right, but until people see tangible value in thinking about their models from a probabilistic perspective, they won't be motivated to do so.

[0]: http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf

Burnham and Anderson is a fundamental book to understand why AIC and BIC and the like are useful.

For intro I recommend "6.1.3 Choosing the Optimal Model" from "An Introduction to Statistical Learning" http://www-bcf.usc.edu/~gareth/ISL/

Is it possible to derive the AIC from Bayesian principles ? If so what assumptions are necessary ?

what about it?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact