Hacker News new | past | comments | ask | show | jobs | submit login

I'm a self-described Bayesian* at my day job, but the author needs to do better to convince me that the Bayesian approach is worth it in the deep learning space. As far as I can tell, deep learning folks don't give two shits about uncertainty intervals, much less marginalization. All that matters is minimizing that test error as fast as possible. So what if you get a posterior for each parameter... Who cares about the parameters in a neural network as long as the predictions seem well calibrated?

The most convincing rationale for adopting a Bayesian perspective is contained the collected works of Jim Berger, which I see is cited by the author... but not used in the manuscript.

* Of course, a Bayesian is just a statistician that uses Bayesian techniques even when it's not appropriate -- Andrew Gelman




> As far as I can tell, deep learning folks don't give two shits about uncertainty intervals, much less marginalization. All that matters is minimizing that test error as fast as possible. So what if you get a posterior for each parameter... Who cares about the parameters in a neural network as long as the predictions seem well calibrated?

Well, the one thing that they ought to actually give a shit about is generalization. All else (Bayesian or not) is only in service of that.

Deep learning systems require too much data (and often not raw data), crazy amounts of compute, and still fail under comically irrelevant tweaks of the problem (eg: adversarial noise).

Yes, the PAC learning model starts off by assuming that the test set had the same “data distribution” as the train set. It’s obviously not exactly the same (otherwise learning is kinda irrelevant), but deep learning seems incredibly fragile in its ability to generalize. It seems to require crazy amounts of domain randomization during training to avoid overfitting.

Eg: The recent OpenAI Rubik’s cube demonstration (while an impressive achievement) was an indication of the fundamental silliness of deep RL (at least as of now). To solve the problem, they had to use classical Rubik’s cube solving algorithms, and completely model the physics of the situation (including friction on the surfaces!) and use that model to generate training data while varying every parameter for domain randomization. All that structured prior knowledge is absolutely unfeasible for even a slightly more complicated situation. And after all that, the robot had something like a 20% success rate IIRC. Impressive work on the team’s part, no doubt... but the conceptual framework does not inspire confidence.


I doubt that limits to generalizing NNs will be solved by mere bayesianfication. NNs poorly generalize, as with all of ML, because the strategy is the modelling of the data, not of the world.

Since data stands in as proxy for the environment, a NN will only generalize insofar as it has the data to do so. There is no 3rd dimension in an infinite number of 2D photographs -- an no "bayesianification" will put it there.


> because the strategy is the modelling of the data, not of the world.

But that is how all learning works. Even the most advanced physics models are verified using experimental data. If data is found that contradicts the model the model is deemed faulty and will be adjusted to match the insights gained from the new data. DL just automates that approach aiming for the minimum gap between 'theory' and 'data' where the theory is expressed as a number of weights, thus encoding 'insight' in a - for now - non-reversible way. Just like wetware typically has not clue how it arrived at a certain conclusion, we use the words intuition and synonyms when actually we mean 'guess based on past experience'.


That is not how human learning works. Manuals don't contain hundreds of thousands of examples to allow students to draw their own conclusion.

And for inventing new physics, there are two very different approaches. One is perhaps similar to machine learning, where people look at an existing model, find experimental data that doesn't match the model, and adjust the model slightly until it matches the new data. But another mode is where you have a model with certain mathematical properties, then you discover new mathatics that give more properties, and you only then ask for more data to check these extensions.


How does human learning work, then? Do you have pointers to literature on alternative formal learning theories? I am genuinely curious :)


> That is not how human learning works.

As someone who has a baby in the house, I assure you that it is indeed how a human learns and refines motor skills, at the very least - by repeated experiment and play, with thousands of permutations of position, speed, weight, and situation. They also try to infer causal models [1].

Just what do you think 'play' is for a child, if not (in part) an opportunity to safely explore its surroundings and refine mental models?

[1] https://www.nsf.gov/news/news_videos.jsp?cntn_id=125575&medi...


The GP's post was not about early development, but about all human learning, including higher learning and advanced physics. Sure, there are modes of learning that are pure experimentation, but there are many other modes of human learning.

Even children, after some point, start developing a model and then designing experiments to test hypotheses to refine that model, which is not at all how machine learning works.


Humans were learning long before manuals.


Sure, manuals were just a very obvious example. Teacher-pupil and master-apprentice relationships are similar to manuals though, and these were likely present even in primitive human societies. Even in primates, copying another individual's actions is a major mode of learning, instead of random exploration.


Part of the problem is that we don't have any way of translating prior "theory" information into a prior over the weights of the neural network, so we are stuck assuming some sort of prior normal distribution over the weights. It's unclear if the posterior over the weights has any particularly useful meaning.


> Deep learning systems require too much data (and often not raw data), crazy amounts of compute, and still fail under comically irrelevant tweaks of the problem (eg: adversarial noise).

Many of these criticisms generalize to 'first make it work, then make it better, then make it fast'.

Being able to solve some of these problems at all is the kicker, whether or not it is efficient is at the moment not nearly as important as being able to solve them in principle.

I suspect that in the very near future we will see something that extracts the actual insights required from a DL model that makes it work which then can be used to power an analytical solution that performs a lot faster than the model itself, and hopefully with more resistance against noise.

I've already seen several glimpses of this and I'm hoping for some kind of breakthrough where DL is used to continue where feature engineering left off.


> Many of these criticisms generalize to 'first make it work, then make it better, then make it fast'. Being able to solve some of these problems at all is the kicker, whether or not it is efficient is at the moment not nearly as important as being able to solve them in principle.

There are deep (not universally applicable) assumptions in that framing... It works only if you didn't "make it work" in a way that fundamentally limits/handicaps the future steps, or gives no insights in that direction.

Based on my understanding of the subject, I think that much recent progress in deep learning is less of a breakthrough than it's commonly made out to be, and that there are fundamental conceptual reasons why the approach is limited, almost like achieving the first step of drawing an owl: https://i.kym-cdn.com/photos/images/newsfeed/000/572/078/d6d...

But the jury is out on that one, and I think there is enough room for reasonable people to differ in their opinions. ¯\_(ツ)_/¯


In image classification problems (which covers a lot of ground) for instance the progress is undeniable. But we are still a ways away from extracting the essence and getting to the point where what biology does effortlessly can be done on similar power budgets by a combination of soft and hardware.


Or language models. If you think deep learning or any of the AI (no not AGI whatever that even means) in the last decade are crap/useless then you probably never tried Google Translate before the advent of these breakthroughs. It’s not perfect, but it’s undeniably useful as it is.


> As far as I can tell, deep learning folks don't give two shits about uncertainty intervals, much less marginalization.

Maybe folks making snapchat filters don't care, but this is absolutely vital if you're doing something with a low margin of error (self-driving cars, financial work, etc...). If your neural net can tell you if it thinks it could be making an error, that information is absolutely vital for keeping your system from messing up big time.

The problem is that none of the Bayesian methods for machine learning work well and quickly (please do correct me if I'm wrong--I will become my boss's favorite person). If they did, many many practitioners would be very excited to use them.


After a bit of search I found something that appears to work both well and quickly: http://www.cs.ox.ac.uk/people/yarin.gal/website/blog_3d801aa...

The basic idea is brilliant: take a deep neural network with dropout, keep the dropout at evaluation time, run it several times and deduce the uncertainty from the observed variance (with a bit of math).

This lets you add an uncertainty estimation to the output of any deep neural network that uses dropout.


Yes, this is the best I've seen too. Unfortunately, "quickly" is a matter of perspective.

Usually you train a neural net that's as large as possible with a runtime that's within your latency and GPU budget. Running the net even twice will blow up your budget.

Also, I haven't seen a great evaluation on how well this method works in practice. For example, how well does predicted uncertainty capture mistakes on a real-world dataset?


>The problem is that none of the Bayesian methods for machine learning work well and quickly (please do correct me if I'm wrong--I will become my boss's favorite person). If they did, many many practitioners would be very excited to use them.

There are loads of methods based on variational autoencoders that seem to work fairly well for various problems.


Bayesian methods work reasonably well as long as the parameter space is relatively limited and you have enough data to extract some advantage but not enough to power a deep learning model. You also usually do not rely on Bayesian methods exclusively to solve some problem but use it in conjunction with other methods (sanity checks) to ensure you aren't outputting nonsense.


Ah yes, the well-known limited parameter space of deep neural networks


Uncertainty is absolutely key everywhere decision making has an impact on human lives.

Detecting if we see an expresso or a cappucino, sure fine. But uncertainty on a CT scan would be extremely valuable just to avoid people (even trained professionals) falling into the common bias of trusting the output.

Say the model outputs 90% chance of nothing, 10% chance of a deadly condition that needs to be investigated ASAP. Doctors are often overworked and tired and can miss it. Knowing where the certainties and uncertainties of the model lies would allow the doctor to compare that to its own training "Ah but it didn't at all judge based on X and Y" and limit decision machine based on "trusting the algorithms".

This is a similar problem that rating system based on just a number of stars are facing, you need to read the comments and the why just to filter out astroturfing.

Coming back to ML, this is why I think more emphasis should be given to LIME [1], SHAP [2] and also Microsoft's Explainable Boosting Machine [3] which ironically is not a gradient boosting algorithm but a Generalized Additive Model (i.e. like GLM but Squared instead of Linear)

[1] https://github.com/marcotcr/lime

[2] https://github.com/slundberg/shap

[3] https://github.com/interpretml/interpret


i have a decent grasp of stats (having read all of casella berger) and i don't understand how you can truly quantify uncertainty since confidence intervals or pvalues assume a population distribution. i feel like it's a shell game. given a uniform prior what can you tell me about the uncertainty? and if you've chosen some other prior how can you qualify that decision? this is not even considering that if you take it all the way down to first principles, there isn't even such a thing as a continuous prior or one with uncountable support.


This is an area where Bayesian methods really come to the fore - dealing with non-normal, even non-parametric, distributions. That’s not to say they can’t be treated in frequentists stats. I’ve never read that book but perhaps it stops before dealing with how to handle all that stuff.

Regarding your specific point about how do you quantify error regarding your chosen prior? The answer in a Bayesian framework lies in things like credibility intervals, posterior distributions, posterior predictive distributions, depending on what you’re actually trying to quantify.

Very long story short, the latter allows you to sample from a distribution of predictions - for which you can then form a variety of error estimates (and the distribution doesn’t have to be parametric). That will quantify uncertainty for the model you’ve made using the prior you chose (and the data). But usually we wouldn’t recommend changing your prior just to get lower uncertainty unless you have some rationality behind changing the prior. Priors should really be chosen based on domain knowledge etc etc etc.


Maybe you should read something other than Casella & Berger for Bayesian statistics. The book is great, but the coverage in this topic is not. Something like Bayesian methods for hackers [1], the puppy book [2] are great tutorial style easy to read books. Big boy Gelman [2] is far more mathematical and advanced.

I highly recommend you [1] cause it's free, hands on and if you have solid stats will go fast and easy.

Anyways, the thing with Bayesian is you need to look at the problems from a different optic and learn new names for the same things you know from classic (Maximum likelihood is now called mean a posteriori and whatnot).

If I had to summarize the whole uncertainty thing I would say: Both in classic and Bayesian it holds that a function of random variables yields another random variable. Models are functions of random variables. As such, every point estimate you give has a full distribution behind. In classic this is accounted for with asymptotics and confidence intervals. In classics you have priors all the same, but using defaults and not talking about them. In Bayesian you talk about your priors, and if you have extra information about your inference problem you include it there. Or you abuse the priors to make the model yield results that will make your boss/client happy (I worked as an economics researcher at a big bank, the whole department was Bayesian). Or choose priors that make you custom model actually compute. Or use priors that do regularization (like a classic lasso)... Different priors are different models and you need to do your model checking / selection. I am not so sure what you mean about continuous priors. Any gamma should be continuous. But I don't think that is what you mean. [1] https://github.com/CamDavidsonPilon/Probabilistic-Programmin...

[2] https://www.elsevier.com/books/doing-bayesian-data-analysis/... [3] http://www.stat.columbia.edu/~gelman/book/


> Say the model outputs 90% chance of nothing, 10% chance of a deadly condition that needs to be investigated ASAP. Doctors are often overworked and tired and can miss it. Knowing where the certainties and uncertainties of the model lies would allow the doctor to compare that to its own training "Ah but it didn't at all judge based on X and Y" and limit decision machine based on "trusting the algorithms".

Spot on. Personally, I always enjoy Alex Kendall and Yarin Gal's writing on the topic, for example:

https://alexgkendall.com/computer_vision/bayesian_deep_learn...


Read the note carefully: "While improvements in calibration are an empirically recognized benefit of a Bayesian approach, the enormous potential for gains in accuracy through Bayesian marginalization with neural networks is a largely overlooked advantage." Because the different parameter settings correspond to different high performing models, averaging their predictions will lead to more accurate predictions.


Averaging predictions can lead to improvements but this isn't necessarily so. It may also result in simply dragging down the accuracy in the direction of the lesser method if it is consistently less good than the others in the batch. Averaging works well when multiple methods tend towards different outliers, that way you end up cancelling some of that out.


I have yet to see someone successfully produce a high dimensional posterior over all of the parameters in the neural network. Usually, people make the simplifying assumption of different posteriors for each parameter, which means that you can't just extract a bunch of different models.


I agree - far too often I just see "safety critical applications" thrown around and that is definitely not enough of a justification in my mind.

There are alternative ways of getting uncertainty estimates that are far less work.


"...deep learning folks don't give two shits about uncertainty intervals..." This is a failing of those folks for many applications.


Where's the best place to read through the collected works of Jim Berger? (Thanks!)




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: