
Why Bayesian Stats Needs Monte-Carlo Methods - laplacesdemon48
https://www.countbayesie.com/blog/2020/8/16/why-bayesian-stats-need-monte-carlo-methods
======
srean
The title is worse than the actual post. You dont always need Monte-Carlo. You
need integration or summation (a distinction that can be abstracted away) and
for some forms, Monte-Carlo is a way to evaluate those integrals
approximately. As the post indicates there often are cheaper alternatives,
more so when one is willing to tolerate inaccuracy.

~~~
physicsguy
\+ Monte Carlo integration goes to shit in higher dimensions...

~~~
aaronchall
That's why we do Hamiltonian Monte Carlo in higher dimensions.

------
Maro
I haven't read the OP, but if you're interested in this stuff, my blog is full
of MC simulations, both for frequentist and Bayesian experimentation:

[http://bytepawn.com/tag/ab-testing.html](http://bytepawn.com/tag/ab-
testing.html)

~~~
mnky9800n
you really like a/b testing.

------
ummonk
In this particular case you don't have to choose between Monte Carlo and using
a normal sum approximation. You can just do a non-stochastic numerical
integration (e.g. simple Riemann integration or whatever other method you want
to use), since it's only a double integral, and thus the dimensionality isn't
so high that you need to resort to stochastic Monte Carlo.

------
geoalchimista
The title seems to assume that Bayesian statisticians do not know Monte Carlo
methods. But in fact, Monte Carlo methods _are_ part of Bayesian statistics.
They are in any decent Bayesian stats textbooks.

~~~
conjectures
Yes. To give people from other disciplines a flavour:

\- Why Engineering needs PDE solvers.

\- Why Computer Science needs compilers.

\- Why Geography needs digital maps.

~~~
bonoboTP
"Needs" may not imply "currently does not use enough". "Why the human body
needs water" could be a popsci article.

~~~
ummonk
It kind of does though? I.e. I’d expect that article to be about why you need
to keep adding water to the human body (by drinking water), whereas an article
about “why the human body is mostly water” would explain why it is composed of
water rather than why it needs more water.

------
credit_guy
Let me take a stab at answering. And I'll answer 2 questions, not 1.

1\. Why do we need Bayesian estimation? Can't we just all use MLE, and live in
harmony? If MLE was good for Gauss and Laplace and Fisher, why isn't it good
for me? Answer: for many problems the likelihood function is unbound. In fact
the cases where the likelihood function are bounded, so the MLE makes sense,
are few and far between. But if you think of the likelihood as a function with
a hump somewhere, you have several choices for a single number that best
describes the function: mode, mean, median for example. MLE corresponds to
mode, can we use the mean instead? Also, does it not even make better sense to
use mean rather than mode? Sure it does, and if you use mean you become a
Bayesian. But can the likelihood be unbounded and have a finite mean, you ask?
Yes, it happens quite often, sufficiently often that in practice people don't
worry about that.

Once you start using mean instead of mode (and therefore you can proudly call
yourself a Bayesian), you get a number of side benefits: you get parameter
uncertainty estimation for free, and you can start multiplying your
likelihoods with various functions that make your life easier. You can
interpret those functions as regularization functions, or prior, you can even
start talking about expert knowledge, and delve into philosophical
meanderings. No need to do that , there are enough people on the internet who
do that.

2\. Why Monte Carlo? Because calculating means of high-dimensional
distributions is not easy. In most cases you are hit with the curse of
dimensionality. Separately, the distributions that you work with are not known
only up to a multiplicative constant. So the problem is how do you estimate
the mean of a vector (v1, v2, ...., vn) when you know its distribution, but
only up to a constant? The answer is Markov Chain Monte Carlo (MCMC) and if
you learn that, you get special powers.

So why do Bayesians need Monte Carlo? Because they like to be like super-
heroes and have special powers.

------
baxtr
Any article with “Bayesian” in the title gets upvoted these days :)

~~~
curiousgal
_" Deep Bayesian Learning for Blockchain proofs built using Rust and Vue.js"_

~~~
baxtr
Almost perfect. I would add a bit of Kubernetes to the mix

PS: I think it would be great to have a clickbait title builder based on top
HN article

~~~
privong
There was a HN title generator done a few years ago:
[https://veekaybee.github.io/2015/08/24/markov-in-
python/](https://veekaybee.github.io/2015/08/24/markov-in-python/)

And one for generating (unrelated) comments to a title:
[https://github.com/leod/hncynic](https://github.com/leod/hncynic)

------
abeppu
I think maybe a more interesting question is how to explain the extreme
disconnect between data volume and confidence in the two examples compared.
The election has a huge number of polls, analysts and subject matter experts
focused on it. And yet our ability to make predictions about it is comparable
to an A/B test where 30 observations have been made total?

The only sense I can make of this is that:

\- Predictions about stochastic dynamic systems are hard, especially when
there can be these exogenous variables intervening from out of nowhere.

\- Competitive situations are hard to predict, especially if they change based
on the measurements produced about them (i.e. the campaigns strategize based
on polling info). This effectively makes those measurements less predictive.

To me, this suggests that all the polling and analysis are a waste of
resources and attention. The expected value of information is extremely low.
We should kick off campaigns on Halloween, talk about them for a couple days,
and vote immediately. All the pollsters and analysts can be more usefully
deployed towards studying other systems.

~~~
sokoloff
This seems to be assuming that the value of all this coverage and analysis is
in the correctness of the prediction.

It seems more likely that the media companies profit enormously from all of
this coverage in spite of (and perhaps even because of) the low certainty
provided by this polling and analysis.

------
asdf_snar
Somewhat of a side-question, but how would one go about about finding jobs
where knowing Monte Carlo methods is a primary requirement? It seems most
companies want a software developer that knows a bit of this, rather than a
statistician/mathematician that knows a bit of software development.

~~~
mlthoughts2018
In general what you’re looking for just doesn’t exist. In niche cases it
might, and your best bet would be government / military / academic research
labs. Those jobs pay comparatively worse but usually have much better
work/life balance & job security and enable you to focus on specializations.

In industry, you just won’t find this. Statistics is a small domain specialist
component of a much larger software ecosystem. You don’t _require_ advanced
statistics, you just might want to try it or derive a special benefit from it,
and none of that will be worth it unless it efficiently connects to the
regular software development life cycle, architecture patterns, testing, and
maintenance processes that the larger system requires.

Someone who can do “a little software development” would be a liability in
that scenario, no matter what their other domain specializations are.

That’s why you’ll find most data scientists and ML engineers professionally
are highly skilled at software engineering. They approach statistical modeling
tasks from the point of view of standard software development life cycle tasks
where the components just happen to involve statistical models or techniques.

~~~
MacsHeadroom
Quants are statisticians who know a bit of software development. They're
always in high demand and get paid more than SWEs on average.

[https://www.linkedin.com/jobs/quantitative-analyst-
jobs](https://www.linkedin.com/jobs/quantitative-analyst-jobs)

~~~
mlthoughts2018
I worked as a quant for several years at the start of my career. Being a quant
is 95% software architecture and engineering and 5% stats / SQL / spreadsheets
and analytics.

------
olliej
I enjoyed reading most of this - only really got dull in the latter analytical
solutions bit

------
acidbaseextract
Did the post ever actually answer the question:

> this is an awesome example. is there an easy way (as in non-brute force) to
> finding the beta parameters that'll match a probability?

~~~
klipt
Something like Newton's method would probably do the trick.

~~~
acidbaseextract
That makes sense in the analytic case, but you mean in the monte-carlo case? I
presume just 4D Newton's method on the final integral over the pdf_X to get
the value we're looking for?

It makes sense, I'm just surprised the blog never comes out and says "you have
to do a brute force search to get the parameters for the likelihoods A ~
Beta(2,13) and B ~ Beta(3,11) for P(B > A) to have about the right value".

~~~
mlthoughts2018
Your comment sounds very confused, can you elaborate? What do you mean “in the
analytics case, but you mean in the monte carlo case?” I can’t parse that
sentence as a response to the parent comment.

Further what do you mean by

> “ the blog never comes out and says "you have to do a brute force search to
> get the parameters for the likelihoods A ~ Beta(2,13) and B ~ Beta(3,11) for
> P(B > A) to have about the right value".”

What do you mean by “brute force search” here?

------
bartleby_
This is not a good example of where Monte Carlo methods are needed because
solving this problem with Betas can be achieved by summation over the joint
posterior where P(A) > P(B).

------
clircle
I think there is something left out of the post, but it's hard to put my
finger on it. In a vacuum, 71% does sound like a relatively low amount of
evidence in favor of Biden, but when you put it next to most elections, which
are only won by a thin margin, it sounds like a very impressive amount of
evidence. Further, considering the amount of polarization in the USA, it seems
like if 71% of respondents favor Biden, there is little probability of them
switching to Trump.

I agree that 71% is a low amount of evidence, but election forecasts really
make my brain work to rationalize them. I think it's helpful to remember that
forecasts are produced _after_ weighting them by states' electoral college
shares.

~~~
ummonk
Are you conflating share of vote with probability of victory? Your comment
makes no sense.

~~~
proto-n
They are, and I guess most people are too, unconsciously. That's why not
winning while having a predicted 71% chance somehow becomes such a big thing.
You hear "election" and "71%" in the same sentence and you immediately think
that's a huge margin.

------
dumb1224
I like the blog site name Count Bayesie! Jazzy +1

------
GuB-42
> Our forecast is up!!! It gives Joe Biden a 71% chance of winning and Donald
> Trump a 29% chance

I think the confusing thing here is that at first glance it looks like "Joe
Biden will get 71% of the votes" which is a landslide victory. If it said
"there is as much chance of Joe Biden winning as rolling a 1-7 on a 10 sided
dice", which means almost exactly the same thing, no one would get surprised
if Trump won, because we are all familiar with dice rolls with probabilities
in the 0.1-0.9 range.

~~~
sokoloff
Slightly less accurate but probably significantly more informative would be
"rolling 1-4 on a regular die".

------
adamnemecek
I can't wait for someone to figure out automatic integration (similar to
automatic differentiation). Current ways of integrating are such clusterfuck.

~~~
laplacesdemon48
Definitely not an expert but there's a fundamental hurdle to this.

One way to think of this is to realize how easy it is to multiply numbers but
how much more work it takes to divide numbers.

For something like automatic differentiation, you're essentially applying the
chain rule for partial derivatives repeatedly. This is analytically pretty
straightforward to do for most applications. All you need is an analytical
derivative for the simple functions your more-complex function is comprised of
(e.g. a neural network).

For integration, the analogue of the chain rule is integration by substitution
[1]. The toolbox for solving integration problems is more limited than for
differentiation. You run into issues where the answer cannot even be expressed
using standard mathematical notation [2]. Sometimes you get lucky and the
answer can be expressed via an alternating Taylor series so you can estimate
the answer within some margin of error [3].

Stan is a piece of software that runs state-of-the-art MCMC methods to
basically just compute fancy integrals. A Stan model will take an order of
magnitude more time to run than a simple neural network via something like
PyTorch on the same dataset. But they answer different questions.

[1] [https://math.stackexchange.com/questions/1635949/is-
there-a-...](https://math.stackexchange.com/questions/1635949/is-there-a-
chain-rule-for-integration/)

[2] [https://math.stackexchange.com/questions/1397132/why-cant-
so...](https://math.stackexchange.com/questions/1397132/why-cant-some-
integral-befound-though-they-are-anti-derivative-exist)

[3] [https://math.stackexchange.com/questions/145087/how-to-
calcu...](https://math.stackexchange.com/questions/145087/how-to-calculate-
the-integral-in-normal-distribution)

~~~
adamnemecek
You are right, however the same way with automatic differentiation, you never
get the expression, with automatic integration it would be the same.

~~~
laplacesdemon48
Can you please elaborate on "you never get the expression"?

When you input the variable values into a symbolic derivative you just get a
value at the end. d/dx x^2 = 2x. If x = 0.5 then d/dx = 1. The same is true
for symbolic integrals. For most practical applications, we don't really care
about the full symbolic expression. We just want the answer, or at least a
good approximation. This post uses a specific example of the difference
between two Beta distributions. We want to get that 0.71. It is very hard to
"automatically" make that happen.

~~~
adamnemecek
Are you familiar with automatic differentiation?
[https://blog.demofox.org/2014/12/30/dual-numbers-
automatic-d...](https://blog.demofox.org/2014/12/30/dual-numbers-automatic-
differentiation/) note that you get f(x) and f'(x) for some x without getting
the derived function.

~~~
justinpombrio
Ooh, that's a neat technique. I think you _can_ get an expression for the
derivative from it.

Say we want the derivative of f(x) = x^3 - 2x^2 + 5. That becomes:

(x+e)^3 - 2(x+e)^2 + 5

= x^3 + 3x^2e - 2x^2 -4xe + 5

= (x^3 - 2x^2 + 5) + (3x^2 - 4x)e

The term in front of 'e' is "3x^2 - 4x", which is f'(x).

~~~
judofyr
Just to clarify: This result follows directly from the definition of the
derivative:

f'(x) = lim_{e->0} (f(x+e) - f(x))/e

If you're able to express f(x+e) on the form (f(x) + y e) then it follows that
y is the derivative.

It also should be noted that auto-diff doesn't let you skip the rules for
derivation and you're using the same calculation as you would to show that
e.g. f'(x^n)=n*x^{n-1}.

