Hacker News new | past | comments | ask | show | jobs | submit login
A Bayesian Perspective on Generalization and Stochastic Gradient Descent (ai.google)
182 points by godelmachine 3 months ago | hide | past | web | favorite | 36 comments



This result is an amazing twofer. Not only do they give a formal probabilistic perspective on how deep nets fall into local minima, the result proposed a way to reduce two "voodoo parameters" - learning rate and batch size - to one, by showing how to pick the other in a way that is optimal for expected generalization.


I suspect actually there is some connection between optimal batch size and the proofs of this paper regarding how deep NN will provably converge to zero training loss with gradient descent (not stochastic gradient descent).

https://arxiv.org/abs/1811.03804


You should check out our blog post related to finding the optimal learning rate, batch size, and momentum setting:

https://www.myrtle.ai/2018/09/24/how-to-train-your-resnet-5/


This development has parallels to tuning parameters in the Montecarlo world, specifically using HCMC+NUTS one does not have to worry about finding the complete distribution and you get warned when the distribution is misbehaved[0].

HCMC+NUTS not only does better exploration of the posterior but can help you select things like step sizes, acceptance rate, simulation time(See the NUTS Sampler[1]) much more easily.

[0] http://mc-stan.org/misc/warnings.html#runtime-warnings [1] http://mc-stan.org/workshops/ASA2016/day-1.pdf


I am increasingly amazed by the use of the term "generalization" - without modifiers or contexts - in describing what current deep learning does. Certainly, deep learning systems do generalize in certain ways when present with certain large datasets. But how can this just be "generic"? Is this just "data you find in the world" without other considerations?

Shouldn't they be discussing "generalizing what qualities?" Wouldn't image-generalization be different chat-script-generalization and so-forth?


Generalization in this context is usually a very specific technical term that means after training on a training set, the fitted model continues to perform well when used on previously unseen examples (a test set) that came from the same distribution (data generating process) that Nature used to produce the training set.

You can define other notions of generalization that try to be more than this, like generalizing to multiple tasks or an algorithm capable of playing any turn-based perfect information game, etc., but that is not usually what people are talking about when they talk about the formal statistics concept of generalization error.


>> Generalization in this context is usually a very specific technical term that means after training on a training set, the fitted model continues to perform well when used on previously unseen examples (a test set) that came from the same distribution (data generating process) that Nature used to produce the training set.

To be more precise, that a dataset came from the "real" distribution of data in the real world is an assumption, and one that is impossible to verify, at that. In practice, all we can know for sure is the distribution of data in a dataset. There is such a thing as sampling error, after all, and it's a big pain in every field that uses sampling to collect evidence.

So -in practice, again- when we say that a model "generalises well" we mean that it performs well in cross-validation, where it's trained and tested on different partitions of the entire dataset, in successive steps. Unfortunately, that's a very poor measure of generalisation. More to the point- it's an estimate of the model's real-world generalisation. But how good an estimate it is, is very hard to know.


Almost always the training and test set are taken from exactly the same Natural distribution, inclusive of any sampling errors, biases, censoring artifacts, data corruption or loss, etc.

So while in some cases the distinction you point out does matter, it’s pretty rare, certainly rare enough that it doesn’t cause issues with generalization error analysis of the majority of experiments.

For most domain applications like computer vision tasks, natural language tasks, etc., it would be pedantic and disingenuous to call test set generalization error a mere “estimate” of the real generalization error.

Beyond that, in topics like PAC learning you can prove results about optimal algorithms, like SVMs as maximum margin classifiers, that can be proved over fully theoretical generalization error bounds.

Overall, except for certain special case applications, I think the comments you’re indicating about generalization turn out not to matter much in practice.


Well, PAC learning assumes that your training and test datasets are drawn from the same distribution and offers no guarantees if that assumption cannot be made.

Obviously, training and test data in cross-validation partitions have a common distribution- but that is the distribution associated to the sampling process, not that of whatever natural process (we assume) generates the data in the real world. So, my point is, once you move from the lab to the real world, any generalisation error (resp. accuracy) you calculated by cross-validation in the lab becomes a mere estimate- because you don't know the real-world distribution and you are not guaranteed anything by PAC learning, anymore.

Look- your dataset is not the real world, your sampling method is not the natural process that generates the events you observe and the distribution of data in the real world is not the distribution of data in your data set. If it was- you wouldn't need machine learning, would you? You'd already have a model of the real world process you're interested in.

I'm afraid the best you can ever hope to do with generalisation to real-world data is estimate it, and nothing more.


Several subfields of PAC learning also deal with quantifying error bounds under other assumptions about the KL divergence between your sampling distribution and the true distribution. Variational Bayesian methods also provide results on generalization in these cases.

You seem to care a lot that these approaches rely on assumptions, but I find that silly.

Further in _most_ specific applications of e.g. CNNs for computer vision tasks or RNNs for time series or sequential prediction tasks, you _do_ control the entire sampling distribution (e.g. user photos with exactly satisfied criteria), and you really can fully and unambiguously guarantee that the training and testing set come from the “true” distribution you care about. And in many other cases the degree to which that assumption fails is so minor that it makes no practical difference.


Some domains are chaotic, some are very large, some are dynamic. In some domains errors of some types matter much more than other errors.

Minor assumption failures can suddenly matter alot.

The distinction between a lab result and a field trial has been somewhat lost in ml, there will be hell to pay in the long term...


> Minor assumption failures can suddenly matter alot.

Indeed. In fact this is nowadays painfully obvious in Fisher's maximum-likelihood based estimators -- they do not have nice 'continuity' properties that will give a graceful degradation. We figured out how to deal with this once the problem was recognized.

However, we see the same problem raising its head in some of the deep neural net literature. Tiny variations often does not cause tiny distortions in the output -- thats cause for some legitimate concern.


>> You seem to care a lot that these approaches rely on assumptions, but I find that silly.

Thanks, I'll keep that in mind.

>> Further in _most_ specific applications of e.g. CNNs for computer vision tasks or RNNs for time series or sequential prediction tasks, you _do_ control the entire sampling distribution (e.g. user photos with exactly satisfied criteria), and you really can fully and unambiguously guarantee that the training and testing set come from the “true” distribution you care about.

One of my pet peeves is the Deep Gaydar Paper from Stanford (https://osf.io/zn79k/).

The authors trained a CNN to identify sexuality (gay or straight) from images of faces. The images were self-submitted to a US dating site and the users also stated their sexuality. The dataset included different numbers of images for each user, but of the users that had at least one image in the dataset (i.e. all users) 50% had stated their sexuality as "gay" and 50% as "straight"- this, for both men and women. The research concluded that it was possible to train an image classifier, a CNN, to identify sexuality from facial features. The authors justified this with a theory about the relation between the hormonal environment in the womb.

I hold this up as a perfect example of why this assumption, that you can control the distribution in the real world because you can choose the attributes of a dataset that you wish to work with, is very risky. And I don't mean for the social implications of that reprehensible publication. The researchers chose a uniform distribution for their gay and straight subjects. The paper itself cites a 7% estimate of the chance that a person is "gay"- so what, exactly, did their classifier learn to represent? That "gay features" and "straight features" are equally distributed?

In any case, "fully and unambiguously guarantee" is not something that happens outside science fiction. Not in machine learning, not in physics or medicine, not in any subject of knowledge. Guarantees always come with assumptions. If assumptions are violated, or if they were bad assumptions to begin with, guarantees fly out the window in haste.


I used to believe that generalization is about some nice concentration bound that one can design carefully in terms of the optimization algorithm. After many years practice, I began to firmly question about the basic assumption about data distributions. For many real applications, assuming the data are from the same unknown true distribution is fundamentally incomplete. In the real world, information is always partial and adversarials always exist. It never is a closed world where adding samples eventually has diminished value. Instead, it's an open world where long tail phenomenons are often observed. It never is about a single objective that the optimization algorithm cares about. Instead, it's entangled with many coupled but conflicting interests.

The existing statistical learning theory is having a hard time on addressing questions like combinatorial generalizability. Hopefully, the new stability analysis and adversarial analysis can lead us somewhere theoretically.


I’ve had the opposite experience. After finishing grad school and working in industry for a long time now, applying ML to computer vision, NLP, causal inference and quant finance for production systems, the vast, vast majority of problems are such that the training and test set are defined by very precise business operating constraints and it’s known with a high degree of certainty that future examples to which the model is required to generalize will also be from the same distribution reflected in model training, and any differences are so minor as to be practically ignorable all the time, and requiring really special evidence to show the super rare cases when it cannot be ignored.

(Those cases also are most often just solved by collecting new/ more training & testing data and re-training / fine-tuning / calibrating the model to some slight shift in the distribution, and require no formal modeling techniques related to tricky generalization distribution differences).

This has been true across commercial ML products or services at large tech companies, small startups, quant finance firms, defense research labs, and academic labs. Across projects ranging from face recognition, high-traffic search engine and related services like machine translation, trending models, search ranking, dimensionality reduction models in time series forecasting, reverse image search, audio similarity models, advertising models, and many more.

It just turns out that in industry, the problems that have meaningful utility to the business are almost always defined by intrinsic constraints that nearly guarantee there won’t be a meaningful difference in the distribution of future prediction-time examples compared with training & testing distribution.


I find all this very hard to square with my understanding. I mean, what kind of "business operating constraints" stay the same over months or years? Or do you mean that, when constraints change, you just train a new model?

I think basically what you're saying is "we choose the data we want to reprsent, so we know when we're representing it well". But that's just what I say above, about measuring progress by hitting the targets you choose, completely arbitrarily.

It's as if I shot a bunch of arrows and then painted targets around them, then claimed I'm a great archer because I always hit bullseye.

I don't think that's the result of some fundamental difference in understanding between academia and industry, either. Perhaps, in the industry, assuming that reality will comply with your assumptions helps achieve business aims, but that's not because the systems work. It's becaus they can be made to look like they do and there's no good way to show they don't, as there's no good way to show they do.


It sounds like you haven’t worked on many real world applied problems.


I would add that I have had my fair share of industry scale problems where labeled data is not drawn from the same distribution as where the system will be deployed. But as you pointed out earlier, there are theoretical and practical methods to deal with this. For someone who wants to get familiar with this territory, domain adaptation, transfer learning, transductive learning, semi-supervised learning, etc would be some of the keywords to seed an initial literature search with.


If you mean that measuring generalisation is a poorly defined problem and we need to develop new theoretical tools- I'm totally with you. Unfortunately, in machine learning, the ratio of theory to practice is 1:100, or so.


Yeah, I agree that gaydar example illustrates your point, but it’s just such an egregious outlier that overall it just further refutes the idea that this distinction matters much.

About zero serious, mainstream papers or open-source libraries or business case studies each year resemble the poor experiment design from that gaydar paper. It’s a silly one to emphasize in terms of generalization error except I guess for basic 101 pedagogy for people who have never worked with models like this before.


It's an especially egregious example, but everybody makes assumptions that they can't guarantee will hold.


The first part of your comment doesn’t seem related to the second part, and neither part seems related to the question of whether most applied problems do or do not benefit from making a distinction between the sampling distribution used for generating training and test data and the “true” distribution.


You are technically somewhat correct (one of the better forms of correct) but it feels like criticisms made more for the sake of criticism.

The assumption that train and test distributions are similar has served us phenomenally well. There are indeed cases where they are violated, I grapple with these violations everyday as a part of my job, but I would still claim that approximate equality of test and train covers a substantial volume of potential applications.

When test and train do differ, there are formalisms to deal with that. You could look at transductive learning and transfer learning literature.

Want to mention again that you are not wrong, but that for a large section of applications the distinction that you draw attention to does not matter. When it does matter there are theoretical and practical tools, but of course these are nowhere as mature as the conventional ones.


>> The assumption that train and test distributions are similar has served us phenomenally well.

To clarify, when I say "the (entire) dataset" I mean all the data that one has at the start of a research project. What I call the "training set" and the "test set" are partitions of that dataset, used to train and validate a model. The partitions might be chosen once, or the entire dataset may be split randomly in training and test partitions during cross-validation, etc.

Now, what I say above is that the distribution of the dataset is known - you can count the instances in it, etc. What is not known is the distribution of data in the real world, where the dataset was sampled from. Most of the time, there is no way to tell whether a dataset is representative of the real world. Accordingly, there is no guarantee that an accurate model of the dataset is an accurate model of the real world.

What's more, most of the time you can't measure the performance of a model in the real world- because you don't know the ground truth. Again, you can estimate- but you can't calculate.

I think, by "test set" you may mean the real world. In that case, I don't disagree with what you say- assuming that a dataset is representative of the real world has "served us well". But, who are "we" and what is our purpose? If we are talking about machine learning research [1] then yes, that assumption has served us well: we can submit papers showing results on specific datasets and have them accepted, because everyone understands that this is the best anyone can do. And it would be tedious and pointless to accompany every result with a disclaimer about PAC learning assumptions- "this result is only meaningful if we accept that our dataset came from the same distribution as the real world" etc.

However, my concern is that all this only serves to obscure the fact that we can't really tell how much real progress we are making, as a field. I am going to side with the Great Satan, Noam Chomsky, on this one- our progress until now can only be considered progress if we accept the definition of progress that we have chosen ourselves, that of beating benchmarks and solving datasets, both chosen completely arbitrarily and with no guarantee that they have anything to do with, again, the real world.

Now we can have a big, post-modern fight about what is this "real world" I speak of, if you like :)

___________

[1] I'm on the second year of a machine learning PhD, though I should say it's symbolic, logic-based machine learning i.e. Inductive Logic Programming- which perhaps accounts for my preoccupation with subjects the mainstream of machine learning research today probably finds irrelevant.


> And it would be tedious and pointless to accompany every result with a disclaimer about PAC learning assumptions- "this result is only meaningful if we accept that our dataset came from the same distribution as the real world" etc.

That assumption is not necessary, as I mentioned and mlthoughts mentions above, we do have tools to deal with discrepancies between test and train distribution.

BTW I want talking about research but meat and potatoes applications.

I think it is intellectually lazy to raise questions that even the person asking the question and everyone else knows that there cannot be an answer, especially when we dont define things (example 'real world'). Its easy-peasy to ask for things that are epistemologically impossible. On the other hand working within the limitations of impossibility theorems, under assumptions that seems reasonable, to obtain theoretical results as well as practical algorithms to solve real problems -- now that is a valuable proposition.

> [1] I'm on the second year of a machine learning PhD

Goes without saying. Those who have been through that, can spot our old enlightened newbie self from a mile away :)


>> I think it is intellectually lazy to raise questions that even the person asking the question and everyone else knows that there cannot be an answer, especially when we dont define things (example 'real world'). Its easy-peasy to ask for things that are epistemologically impossible. On the other hand working within the limitations of impossibility theorems, under assumptions that seems reasonable, to obtain theoretical results as well as practical algorithms to solve real problems -- now that is a valuable proposition.

No, I believe that real progress is definitely possible, "epistemologically". That is how science advances- by solving one "impossible" problem after the other. Of course, that's not how the industry rolls. In industry, if you have something that makes money, you sell it, whether it works or not and you don't bother with silly stuff like theoretical justifications and whatnot. See Waymo's driverless cars (with drivers) and Tesla's Autopilot (that can't auto-pilot) etc.

As to what is the real world- "everything that isn't in your data set" will do. My earlier comment was just me being gregarious.

And this:

>> Goes without saying. Those who have been through that, can spot our old enlightened newbie self from a mile away :)

-is distasteful. I was having a conversation and you were having a pissing contest. My bad, I guess.

No, I'm not a newbie, you are not me but more experienced and I'm probably older than you. Wrong assumptions. Again.


You missed the keyword -- "enlightened".

You might not agree, but what you demonstrate is very common among new PhD students (and I would argue that is a desirable trait). I sure had that in my days of yore.

Just clarifying that by 'impossible' I meant a mathematical impossibility theorem, not some notion of "that seems very implausible".

Mathematical science does not progress by making impossible statements possible. Rather they work by scoping out and relaxing things (conditions, assumptions and axioms).

For example, one does not deal with Russel's paradox by removing the paradox, but by scoping things out (in this case putting restrictions on how you define your sets) so that the paradox does not arise in the restricted, yet practically useful setting. OTOH other hand if someone keeps bringing up 'whoa Russel's paradox, whoa Russel's paradox that's a limitation' (albeit a well known and obvious one) that qualifies as newbie behavior, although an enlightened one.

Inference in an arbitrary graphical model is NP hard. It would be lazy to come up with a corner case where a inference algorithm with a short running time will fail. We make progress by accepting that for arbitrary graphs inference is a hard problem and then asking what is a practically relevant subset of graphs where one can indeed give an efficient solution. Pointing out obvious failure cases is exactly what you said -- a [lazy and vapid] pissing contest. It does no one any good, unless the failure case brought up is of a new and an unknown type that provides new insight. If there was any new insight in the cases you brought up I definitely missed them.


Thank you Goblyn Queenne for making what I was trying to say actually sound credible. I'd be tempted to even ask whether "probability distribution" could describe the real world. Certain no time-invariant probability distribution could describe this world.


I think what 'joe_the_user is saying is one should specify what exactly is the "distribution" that "Nature used to produce the training set". E.g. "images of cat faces taken indoors and cropped to square" is a pretty specific set.


This is usually always specified to a complete degree in research papers that address this issue. I work professionally in computer vision and image processing and have never once encountered a mainstream research paper where the definition of the data generating process under study was not fully and unambiguously clear.

Can you point to examples where this was problematically under-specified?


Of course you are right, but these are the parts of papers not read or understood by journalists or ceo's or the public. This is leading to inappropriate and unsafe applications, and will lead to a backlash and damage. We should be as clear (at least) as the medics, promising lab results that may deliver in the future, until field tests we can't be sure.

Unfortunately the lab to field practice in ml is underspecified.


Yes, but this specific technical research paper doesn’t suffer these issues. So the original comment seemed wrongly placed here.


> Can you point to examples where this was problematically under-specified?

Not really. My comment was about how I interpreted joe_the_user's statement; I can't think of a ML or CV research paper I've read to which this criticism would apply, but I definitely can think of such news articles and startup landing pages/marketing material.


I agree about random news articles / startup landing pages where the term is thrown around imprecisely. Just seemed odd to see the original comment saying it about this linked Bayesian stats article, which doesn’t suffer this problem.


I would tend to agree it generalizes for the given dataset, but stretch and say it generalizes for a orders of magnitude larger dataset of "similar" data that encompasses your smaller portion of data on which you trained; Then, as he pointed, you have only 2 options: or you need so much hard-work in collecting properly sampled and mapped data that you can model your data by hand then, no need for NNs, the other option is you just pray for the much bigger unknown portion of the data really be "similar" to your training portion, and praying, in this sentence, is ultimately just making a estimation.





Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: