HCMC+NUTS not only does better exploration of the posterior but can help you select things like step sizes, acceptance rate, simulation time(See the NUTS Sampler) much more easily.
Shouldn't they be discussing "generalizing what qualities?" Wouldn't image-generalization be different chat-script-generalization and so-forth?
You can define other notions of generalization that try to be more than this, like generalizing to multiple tasks or an algorithm capable of playing any turn-based perfect information game, etc., but that is not usually what people are talking about when they talk about the formal statistics concept of generalization error.
To be more precise, that a dataset came from the "real" distribution of data in the real world is an assumption, and one that is impossible to verify, at that. In practice, all we can know for sure is the distribution of data in a dataset. There is such a thing as sampling error, after all, and it's a big pain in every field that uses sampling to collect evidence.
So -in practice, again- when we say that a model "generalises well" we mean that it performs well in cross-validation, where it's trained and tested on different partitions of the entire dataset, in successive steps. Unfortunately, that's a very poor measure of generalisation. More to the point- it's an estimate of the model's real-world generalisation. But how good an estimate it is, is very hard to know.
So while in some cases the distinction you point out does matter, it’s pretty rare, certainly rare enough that it doesn’t cause issues with generalization error analysis of the majority of experiments.
For most domain applications like computer vision tasks, natural language tasks, etc., it would be pedantic and disingenuous to call test set generalization error a mere “estimate” of the real generalization error.
Beyond that, in topics like PAC learning you can prove results about optimal algorithms, like SVMs as maximum margin classifiers, that can be proved over fully theoretical generalization error bounds.
Overall, except for certain special case applications, I think the comments you’re indicating about generalization turn out not to matter much in practice.
Obviously, training and test data in cross-validation partitions have a common distribution- but that is the distribution associated to the sampling process, not that of whatever natural process (we assume) generates the data in the real world. So, my point is, once you move from the lab to the real world, any generalisation error (resp. accuracy) you calculated by cross-validation in the lab becomes a mere estimate- because you don't know the real-world distribution and you are not guaranteed anything by PAC learning, anymore.
Look- your dataset is not the real world, your sampling method is not the natural process that generates the events you observe and the distribution of data in the real world is not the distribution of data in your data set. If it was- you wouldn't need machine learning, would you? You'd already have a model of the real world process you're interested in.
I'm afraid the best you can ever hope to do with generalisation to real-world data is estimate it, and nothing more.
You seem to care a lot that these approaches rely on assumptions, but I find that silly.
Further in _most_ specific applications of e.g. CNNs for computer vision tasks or RNNs for time series or sequential prediction tasks, you _do_ control the entire sampling distribution (e.g. user photos with exactly satisfied criteria), and you really can fully and unambiguously guarantee that the training and testing set come from the “true” distribution you care about. And in many other cases the degree to which that assumption fails is so minor that it makes no practical difference.
Minor assumption failures can suddenly matter alot.
The distinction between a lab result and a field trial has been somewhat lost in ml, there will be hell to pay in the long term...
Indeed. In fact this is nowadays painfully obvious in Fisher's maximum-likelihood based estimators -- they do not have nice 'continuity' properties that will give a graceful degradation. We figured out how to deal with this once the problem was recognized.
However, we see the same problem raising its head in some of the deep neural net literature. Tiny variations often does not cause tiny distortions in the output -- thats cause for some legitimate concern.
Thanks, I'll keep that in mind.
>> Further in _most_ specific applications of e.g. CNNs for computer vision tasks
or RNNs for time series or sequential prediction tasks, you _do_ control the
entire sampling distribution (e.g. user photos with exactly satisfied
criteria), and you really can fully and unambiguously guarantee that the
training and testing set come from the “true” distribution you care about.
One of my pet peeves is the Deep Gaydar Paper from Stanford
The authors trained a CNN to identify sexuality (gay or straight) from images
of faces. The images were self-submitted to a US dating site and the users
also stated their sexuality. The dataset included different numbers of images
for each user, but of the users that had at least one image in the dataset
(i.e. all users) 50% had stated their sexuality as "gay" and 50% as
"straight"- this, for both men and women. The research concluded that it was
possible to train an image classifier, a CNN, to identify sexuality from
facial features. The authors justified this with a theory about the relation
between the hormonal environment in the womb.
I hold this up as a perfect example of why this assumption, that you can
control the distribution in the real world because you can choose the
attributes of a dataset that you wish to work with, is very risky. And I don't
mean for the social implications of that reprehensible publication. The
researchers chose a uniform distribution for their gay and straight subjects.
The paper itself cites a 7% estimate of the chance that a person is "gay"- so
what, exactly, did their classifier learn to represent? That "gay features"
and "straight features" are equally distributed?
In any case, "fully and unambiguously guarantee" is not something that happens
outside science fiction. Not in machine learning, not in physics or medicine,
not in any subject of knowledge. Guarantees always come with assumptions. If assumptions are violated, or if they were bad assumptions to begin with, guarantees fly out the window in haste.
The existing statistical learning theory is having a hard time on addressing questions like combinatorial generalizability. Hopefully, the new stability analysis and adversarial analysis can lead us somewhere theoretically.
(Those cases also are most often just solved by collecting new/ more training & testing data and re-training / fine-tuning / calibrating the model to some slight shift in the distribution, and require no formal modeling techniques related to tricky generalization distribution differences).
This has been true across commercial ML products or services at large tech companies, small startups, quant finance firms, defense research labs, and academic labs. Across projects ranging from face recognition, high-traffic search engine and related services like machine translation, trending models, search ranking, dimensionality reduction models in time series forecasting, reverse image search, audio similarity models, advertising models, and many more.
It just turns out that in industry, the problems that have meaningful utility to the business are almost always defined by intrinsic constraints that nearly guarantee there won’t be a meaningful difference in the distribution of future prediction-time examples compared with training & testing distribution.
I think basically what you're saying is "we choose the data we want to reprsent, so we know when we're representing it well". But that's just what I say above, about measuring progress by hitting the targets you choose, completely arbitrarily.
It's as if I shot a bunch of arrows and then painted targets around them, then claimed I'm a great archer because I always hit bullseye.
I don't think that's the result of some fundamental difference in understanding between academia and industry, either. Perhaps, in the industry, assuming that reality will comply with your assumptions helps achieve business aims, but that's not because the systems work. It's becaus they can be made to look like they do and there's no good way to show they don't, as there's no good way to show they do.
About zero serious, mainstream papers or open-source libraries or business case studies each year resemble the poor experiment design from that gaydar paper. It’s a silly one to emphasize in terms of generalization error except I guess for basic 101 pedagogy for people who have never worked with models like this before.
The assumption that train and test distributions are similar has served us phenomenally well. There are indeed cases where they are violated, I grapple with these violations everyday as a part of my job, but I would still claim that approximate equality of test and train covers a substantial volume of potential applications.
When test and train do differ, there are formalisms to deal with that. You could look at transductive learning and transfer learning literature.
Want to mention again that you are not wrong, but that for a large section of applications the distinction that you draw attention to does not matter. When it does matter there are theoretical and practical tools, but of course these are nowhere as mature as the conventional ones.
To clarify, when I say "the (entire) dataset" I mean all the data that one has
at the start of a research project. What I call the "training set" and the
"test set" are partitions of that dataset, used to train and validate a model.
The partitions might be chosen once, or the entire dataset may be split
randomly in training and test partitions during cross-validation, etc.
Now, what I say above is that the distribution of the dataset is known - you
can count the instances in it, etc. What is not known is the distribution of
data in the real world, where the dataset was sampled from. Most of the time,
there is no way to tell whether a dataset is representative of the real world.
Accordingly, there is no guarantee that an accurate model of the dataset is an
accurate model of the real world.
What's more, most of the time you can't measure the performance of a model in
the real world- because you don't know the ground truth. Again, you can
estimate- but you can't calculate.
I think, by "test set" you may mean the real world. In that case, I don't
disagree with what you say- assuming that a dataset is representative of the
real world has "served us well". But, who are "we" and what is our purpose? If
we are talking about machine learning research  then yes, that assumption
has served us well: we can submit papers showing results on specific datasets
and have them accepted, because everyone understands that this is the best
anyone can do. And it would be tedious and pointless to accompany every result
with a disclaimer about PAC learning assumptions- "this result is only
meaningful if we accept that our dataset came from the same distribution as
the real world" etc.
However, my concern is that all this only serves to obscure the fact that we
can't really tell how much real progress we are making, as a field. I am going
to side with the Great Satan, Noam Chomsky, on this one- our progress until
now can only be considered progress if we accept the definition of progress
that we have chosen ourselves, that of beating benchmarks and solving
datasets, both chosen completely arbitrarily and with no guarantee that they
have anything to do with, again, the real world.
Now we can have a big, post-modern fight about what is this "real world" I
speak of, if you like :)
 I'm on the second year of a machine learning PhD, though I should say it's
symbolic, logic-based machine learning i.e. Inductive Logic Programming- which
perhaps accounts for my preoccupation with subjects the mainstream of machine
learning research today probably finds irrelevant.
That assumption is not necessary, as I mentioned and mlthoughts mentions above, we do have tools to deal with discrepancies between test and train distribution.
BTW I want talking about research but meat and potatoes applications.
I think it is intellectually lazy to raise questions that even the person asking the question and everyone else knows that there cannot be an answer, especially when we dont define things (example 'real world'). Its easy-peasy to ask for things that are epistemologically impossible. On the other hand working within the limitations of impossibility theorems, under assumptions that seems reasonable, to obtain theoretical results as well as practical algorithms to solve real problems -- now that is a valuable proposition.
>  I'm on the second year of a machine learning PhD
Goes without saying. Those who have been through that, can spot our old enlightened newbie self from a mile away :)
No, I believe that real progress is definitely possible, "epistemologically".
That is how science advances- by solving one "impossible" problem after the
other. Of course, that's not how the industry rolls. In industry, if you have
something that makes money, you sell it, whether it works or not and you don't bother with silly stuff like theoretical justifications and whatnot. See Waymo's
driverless cars (with drivers) and Tesla's Autopilot (that can't auto-pilot)
As to what is the real world- "everything that isn't in your data set" will
do. My earlier comment was just me being gregarious.
>> Goes without saying. Those who have been through that, can spot our old
enlightened newbie self from a mile away :)
-is distasteful. I was having a conversation and you were having a pissing
contest. My bad, I guess.
No, I'm not a newbie, you are not me but more experienced and I'm probably
older than you. Wrong assumptions. Again.
You might not agree, but what you demonstrate is very common among new PhD students (and I would argue that is a desirable trait). I sure had that in my days of yore.
Just clarifying that by 'impossible' I meant a mathematical impossibility theorem, not some notion of "that seems very implausible".
Mathematical science does not progress by making impossible statements possible. Rather they work by scoping out and relaxing things (conditions, assumptions and axioms).
For example, one does not deal with Russel's paradox by removing the paradox, but by scoping things out (in this case putting restrictions on how you define your sets) so that the paradox does not arise in the restricted, yet practically useful setting. OTOH other hand if someone keeps bringing up 'whoa Russel's paradox, whoa Russel's paradox that's a limitation' (albeit a well known and obvious one) that qualifies as newbie behavior, although an enlightened one.
Inference in an arbitrary graphical model is NP hard. It would be lazy to come up with a corner case where a inference algorithm with a short running time will fail. We make progress by accepting that for arbitrary graphs inference is a hard problem and then asking what is a practically relevant subset of graphs where one can indeed give an efficient solution. Pointing out obvious failure cases is exactly what you said -- a [lazy and vapid] pissing contest. It does no one any good, unless the failure case brought up is of a new and an unknown type that provides new insight. If there was any new insight in the cases you brought up I definitely missed them.
Can you point to examples where this was problematically under-specified?
Unfortunately the lab to field practice in ml is underspecified.
Not really. My comment was about how I interpreted joe_the_user's statement; I can't think of a ML or CV research paper I've read to which this criticism would apply, but I definitely can think of such news articles and startup landing pages/marketing material.