
A sane introduction to maximum likelihood estimation and maximum a posteriori - perone
http://blog.christianperone.com/2019/01/a-sane-introduction-to-maximum-likelihood-estimation-mle-and-maximum-a-posteriori-map/
======
lenticular
This is a really cool and clear introduction to MAP/MLE, especially since you
take great pains to explain what all of the notation means. I'll definitely be
pointing some people I know to this blog.

OT on technical blogs: Experts often are unable to put themselves in the shoes
of someone with no experience, which really harms the pedagogy. When one
practices a technical topic for a long time, concepts that were once foreign
and difficult become instinctual. This makes it very hard to understand in
what ways a beginner could be tripped up. It takes a large amount of thought
to avoid this problem, which I think is why much introductory material - blog
posts, books, etc., is really sub-par.

~~~
nordsieck
> Experts often are unable to put themselves in the shoes of someone with no
> experience, which really harms the pedagogy.

If anyone is interested in learning more, this phenomenon is typically called
"the curse of knowledge".

~~~
theoh
See also "unconscious competence":
[https://en.wikipedia.org/wiki/Four_stages_of_competence](https://en.wikipedia.org/wiki/Four_stages_of_competence)

------
subjectHarold
Could someone explain in a bit more detail the move from 26 to 27? I don't get
the significance of being "worried about optimization" or why/how we cancel
p(x). I do get the later point about integration and the convenience of the
reformulation. I just don't get why or how it is "allowed".

Sorry if this is obvious but I have been doing a lot of reading on this and
have come across this step a few times before...but am just missing some part
of every explanation.

~~~
throwaway287391
Because we're optimizing (taking an argmax) with respect to theta for some
fixed dataset x, the 1/p(x) is just a constant factor -- p(x) is just some
number (and a non-negative one, since it's a probability). It's like saying
argmax_{theta} 0.87*f(theta) = argmax_{theta} f(theta).

------
kahoon
Nice, clear explanation. Looking forward to the Bayesian inference one!

One note though: I think on equation 25 you are missing a log on the left hand
side.

~~~
perone
Will fix it, thanks a lot for the feedback !

------
stilley2
Nice write-up! Minor nitpick: ML/MAP estimators don't _require_ observations
to be independent. At least, in my field we're looking at a single observation
of a multivariate distribution, and we don't need to assume the elements are
independent (ie, we permit a non-diagonal covariance matrix). My intuition
says this is equivalent to assuming multiple correlated scaler observations,
but I'd have to sit down with some paper. Also, you use "trough" where I think
you mean "through."

~~~
nerdponx
Dependence between elements of the same observation is irrelevant. The point
is that _different_ observations must be independent and identically
distributed for the standard formulation of the likelihood to be valid.

Typically we write the likelihood function as

    
    
        L = Π P(y | θ)
    

If you didn't have identically-distributed observations, the functional form
of P would be different for each observation.

And if you didn't have independent observations, then you're basically screwed
in the general case. That expression for L is basically the definition of
probabilistic independence: a finite set of random variables is mutually
independent if and only if their joint probability function is equal to the
product of the individual variables' probability functions.

If you have dependence between observations, you lose the ability to write L
in that nice form. This is a non-negotiable consequence of basic probability
theory.

The only way to do MAP estimation without iid observations is to know the
joint distribution of your entire dataset, and be able to maximize that
distribution with respect to θ given an arbitrary data set. This is possible
but it's not quite the same thing as dumping your data into a GLM.

~~~
conjectures
The post this is a reply to was correct, and this is not. E.g. a simple
counter example is finding the autocorrelation parameter in an AR(1) model for
an economic time series. Under your suggested definition of MLE this can't be
done, which is simply not the case.

In fact, not approaching the more general case is liable to confuse learners
as they may think that independence assumption is somehow baked into MAP/MLE,
which it is not.

~~~
nerdponx
I never suggested a definition of MLE. You need independence to use the "L = Π
P(y | θ)" formulation, full stop.

~~~
conjectures
Yes, you do need independence to assume the likelihood factorises. You do not
need independence to find a MLE/MAP.

------
master_yoda_1
So you think all the others are insane ;)

~~~
perone
For that, I would need the likelihood of all others xD

