With other ways of applying dropout, LSTMs typically fail to converge --- and with no dropout, they often over-fit. Gal's variational dropout therefore brings a significant improvement to many leading models.
There are several other nice contributions in the thesis as well, including a recommendation for applying dropout to word embedding matrices that I don't think has been well explored yet.
If these ideas look interesting, you might also want to check out Thomas Wiecki's blog  with a practical application of ADVI (a form of the variational inference Yarin discusses) to get uncertainty out of a network.
One useful tidbit is that you can get prediction intervals from deep learning models by running it forward N times with dropout and take the mean and variance of that distribution (plus another precision term).
Concretely, if you are trying to train a neural net to forecast stock prices or drive a car safely, not only do you want to have predictions, but you want to estimate some measure of how confident your model is of that prediction. This is eminently useful for models that lean towards the "black-box" spectrum, such as deep neural nets.
Note that parameter uncertainty and risk estimation are quite different, which are addressed in this preliminary work http://bayesiandeeplearning.org/papers/BDL_4.pdf
But does this basically mean that I can have a model trained on only cat pictures and it can still tell me, with some measure of certainty, that the picture of the horse is not a cat, all without training the model to answer specifically "is this a cat?"
> I think that's why I was so surprised that dropout – a ubiquitous technique that's been in use in deep learning for several years now – can give us principled uncertainty estimates. Principled in the sense that the uncertainty estimates basically approximate those of our Gaussian process. Take your deep learning model in which you used dropout to avoid over-fitting – and you can extract model uncertainty without changing a single thing. Intuitively, you can think about your finite model as an approximation to a Gaussian process. When you optimise your objective, you minimise some "distance" (KL divergence to be more exact) between your model and the Gaussian process. I'll explain this in more detail below. But before this, let's recall what dropout is and introduce the Gaussian process quickly, and look at some examples of what this uncertainty obtained from dropout networks looks like.
Gal's variational dropout is one of the paths forward to Bayesian deep learning
Why would resolving uncertainty lead to a false sense of confidence?
Without some explanations it's impossible to understand what you want to say.