It's aimed at practitioners, and discusses the relation between many statistical concepts (like AIC and cross-validation).
AIC and BIC are approximations to other quantities under certain common but idealized scenarios. What's become clear over the last 20 years or so is that for most other scenarios those approximations are not nearly as good as other approximations, or just calculating the targeted values of interest.
So, for something like DL models, criteria such as AIC and BIC are less applicable. But what is confusing to me is why other information criteria aren't pursued more. There's literature on the asymptotic equivalence of cross-validation and certain information criteria, and CV has certain problems that those criteria overcome (I've been looking citations but Google thinks I'm spamming it or something).
I think part of the reason for the adoption of CV is a cultural one, to be honest, and due to historical quirks. Like bootstrapping and other empirical simulation methods, it has an appeal due to its relatively nonparametric nature. But there is a ton of mischaracterizations about CV (e.g., relative to bootstrapping) and its warts have been overlooked in general due to the position it has attained in applied settings.
There is also a huge gap between the literature on information criteria and the rest of statistics and applied data analysis. People tend to learn about AIC and BIC through certain standard sources, sources mostly pertaining to their initial derivation, or that follow the same line of reasoning. But there's other literature that derives AIC and BIC through other means, or approach them from other perspectives, and that literature tends to get ignored. As a result there's a lot of misleading statements that are made about each, even from talented individuals (e.g., that the rationale for BIC depends on Bayesianism, or that there is a true model under consideration), and a lot of advances in the area are totally ignored. It's like there's two literatures, one attended to by people who are interested in information criteria, and another by people that "just want to understand AIC and BIC." It's confusing to me really, because I generally see some kind of trickle-down in other areas of statistics, but with information criteria, it's like there's an enormous gap.
AIC is controlled by the number of parameters and the quality of fit, but it is known in theory, as well as in practice that for many models, for example, neural networks, linear regression with additive iid Gaussian noise, the norm of the parameters matter more for prediction accuracy than the sheer number of parameters.
Not claiming that CV is the best model selection criteria, just listing out a few pluses for CV. The last word on model selection is yet to be spoken. It goes right to one of the core questions about model fitting and prediction.
How do you feel about stuff like the DIC and WAIC for machine learning and other "predictive" modeling?
About DIC and WAIC, etc. ... that's a huge topic for me.
There's a relatively well-cited result that shows that you can arbitrarily choose any penalty in an information criterion (i.e., penalty in the form -logL + penalty), as long as it meets certain weak conditions, and it will eventually work with big enough samples. So the issues of most importance are usually how selection accuracy increases with sample size, not its asymptotic performance. The leads to a certain hesitancy for me in terms of why I should pay attention to any particular IC.
Having said that, DIC and WAIC are pretty elegant in my mind, and I've been glad to see research on them.
Predictive scenarios aren't really my focus, although I've become much more interested in them the last couple of years for various reasons. Most of my work is in what might be thought of as unsupervised models.
Not the optimal reference, but a paper dealing with deriving AIC like this which I happen to have on my desk: Gelman & co 2014
On a separate note, AIC/BIC are prevalent in linear regression models because they are extremely well-behaved in comparison to some other models. This has generated enormous amounts of field-specific experimental literature. For instance, economists can form expectations for AIC/BIC measurements dependent on the study-type.
Additionally, we already have some decent tools for regularization (albeit model-dependent ones). Counting the nodes of a NN to establish an AIC may not lead to a proper understanding of the free parameters.
This wasn't meant to be an apologists take on why I'm not using AIC/BIC for model selection. If we're talking about "deep learning", then it's definitely a worthwhile goal to examine how to optimize network topologies. The link below mentions using reinforcement learning to learn architecture.
In principle it would be straightforward, right? AIC = 2k - 2ln(L). So set k = # weights + # biases, and use the log-likelihood as the objective function so you can just read off L from there.
I wonder if the reason why AIC is unpopular is that it's harder to explain to your boss than accuracy, precision, recall, or even proper scoring. This is perhaps more true now that statistical literacy in management increases -- the notion that you can't use training data to estimate performance is becoming popular. Now here comes a magic formula, calculated on the training set, that supposedly tells me how well the model will perform... that's not gonna fly in a lot of settings.
There's also the question of whether it even tells us what we want to know. AIC is an estimate of Kullback-Leibler information of a probability model, under somewhat-restrictive conditions . The question of "why don't we use AIC?" might be the same as the question of "why don't we use proper scoring rules? -- people want to know accuracy, so they just go ahead and estimate accuracy by brute-force resampling. I'm not saying it's right, but until people see tangible value in thinking about their models from a probabilistic perspective, they won't be motivated to do so.