
Modes, Medians and Means: A Unifying Perspective (2013) - niklasbuschmann
http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/
======
bo1024
One can generalize this approach further to take any given loss function, not
just a p-norm, and ask what statistic is achieved by minimizing loss.

A field that studies this is called property elicitation, from the idea that
the loss function elicits a certain kind of prediction be it mean, mode, etc.

This field actually originated with human experts like weather forecasters -
what is a loss function (or “scoring rule”) that incents them to predict
accurately? Squared loss is good if by accurately we imply we want the mean.

One fact people find surprising is that squared loss is not the only one that
elicits the mean. There is an entire family of losses that do, including KL
divergence.

~~~
CrazyStat
This approach is also used to extend concepts like the median into higher
dimensions and non-euclidian spaces.

What is the "median" or "mean" of a set of points located on the surface of a
sphere, or a torus, or some more complicated manifold (which may only be
defined implicitly by a distance function)? If you try to start from the one-
dimensional definition it's not really clear. You can define such concepts,
however, by minimizing the corresponding norms. If you minimize the sum of L1
norms then you get something like a median. If you minimize the sum of L2
norms then you get something like a mean. Once you have something like a mean
you can also define something like a variance.

Wikipedia has a short article about the concept [1]. It really deserves more
treatment, as it's a central (no pun in intended) idea in topological data
analysis [2].

[1]
[https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean](https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean)

[2]
[https://en.wikipedia.org/wiki/Topological_data_analysis](https://en.wikipedia.org/wiki/Topological_data_analysis)

------
andreareina
I've come back to this very page a number of times over the years and I've
always been dissatisfied with how it glosses over _why_ E_1 implies we should
take the median, and E_2 implies we should take the mean.

Assume that we have some value of s_1, with elements x_lt \in {x_1, x_2, ...
x_k}, x_lt < s_1 and x_gt \in {x_k+1, x_k+2, ... x_n}, x_gt > s_1 (ignoring
the case where any element is equal to x_1 to make the reasoning easier). If
we move s_1 + epsilon increases E_1 by epsilon * |X_lt| and decreases E_1 by
epsilon * |X_gt|, this implies that E_1 is minimized when the number of
elements of X on either side of s_1 is equal (now that I've written this out I
think that's what 'ogogmad is getting at).

I'm still trying to develop the intuition for why minimizing E_2 implies
taking the mean.

~~~
leereeves
The intuition isn't clear to me either, but the calculus is fairly simple. At
the minimum, the derivative of the aggregate discrepancy is zero:

0 = d/ds sum (x_i - s)^2 = -2 sum(x_i - s) = -2 [(sum x_i) - n * s]

Thus

n*s = sum x_i

so

s = sum x_i / n

~~~
ImaCake
I find turning the math into working code sometimes helps me. The for-loop in
my code only approximates the mean within +- 0.5 intervals. This is because it
relies on the definition for finding the mean provided in OP's article, rather
than your derivation of the arithmetic mean.

import numpy as np

X = [0,1,2,3,3,3,4,5,6,6,6]

mean_dict = {}

for i in [x/10 for x in range(0,100)]:

    
    
        mean_dict[sum([abs(x-i)**2 for x in X])] = i
    

print(mean_dict[min(mean_dict)], "score:", min(mean_dict))

#add actual mean to series

mean_dict[sum([abs(x-np.mean(X)) __2 for x in X])] = np.mean(X)

print("real mean:", mean_dict[min(mean_dict)], "score:", min(mean_dict))

~~~
mkl
You can get proper code formatting by indenting two spaces.

------
ogogmad
Interestingly, since the L^1 discrepancy is convex, it's possible to
analytically minimize it by using the _subdifferential_ , and thereby prove
that its minimum is attained at the median. The subdifferential is a set-
valued generalisation of the derivative to all convex functions, including
non-differentiable ones. See
[https://en.wikipedia.org/wiki/Subderivative](https://en.wikipedia.org/wiki/Subderivative)
and [https://towardsdatascience.com/beyond-the-derivative-
subderi...](https://towardsdatascience.com/beyond-the-derivative-
subderivatives-1c4e5bf20679)

Whenever the subdifferential of a convex function at a point includes 0, that
point is a global minimum of the function.

~~~
amitport
"Whenever the subderivative of a convex function at a point includes 0, that
point is a global minimum of the function"

* you mean a _strictly_ convex function

~~~
ogogmad
In the non-strictly convex case, the minimum is still a global minimum; it's
just not unique.

~~~
amitport
[edit] right, you're correct sorry

------
amitport
the wikipedia version for anyone interested:
[https://en.wikipedia.org/wiki/Central_tendency#Solutions_to_...](https://en.wikipedia.org/wiki/Central_tendency#Solutions_to_variational_problems)

missing from the article is that using L_inf will give you the midrange

~~~
ImaCake
Thanks this is a great way to extend and generalise OP's post. Thanks for
sharing. I don't think I would have been able to understand the wikipedia
argument without the more intuitive explanation in the original post.

As an aside, I wonder if math is sometimes difficult to understand/explain
because the language is so constrained. Obviously, careful use of words is
important for translating mathematical definitions verbatim, but I think a
more relaxed approach to word translations of math equations would be
beneficial to everyone below second year math/statistics at university. Even
if they are left with a technically incomplete/incorrect definition or
intuition for a given concept.

------
tomrod
I like Mr. White's writing. I interacted with him a bit several years ago with
Julia's dataframe packages. He is a brilliant guy.

I think this insight (measures of centrality) and an article posted 4-5 years
back to HN tying quantum theory as a separate branch of statistics (statistics
under a different Lp norm) really helped tie together a lot of quantitative
thread for me.

~~~
tobbe2064
Do you hace a link to the article? It sound very interesting

~~~
sls
Not the person to whom you were speaking but their description reminds me of
Scott Aaronson's work e.g.
[https://www.scottaaronson.com/democritus/lec9.html](https://www.scottaaronson.com/democritus/lec9.html)

which has been discussed on HN
[https://news.ycombinator.com/item?id=19161028](https://news.ycombinator.com/item?id=19161028)

~~~
tobbe2064
Thanks! Looks really interesting

------
cvigoe
For anyone interested in seeing this in the context of statistical decision
theory, see the final page of this lecture notes pdf:

[http://www.stat.cmu.edu/~siva/705/lec16.pdf](http://www.stat.cmu.edu/~siva/705/lec16.pdf)

In particular, for the parametric estimation setting, it can be shown that the
Bayes Estimator under L_0 loss corresponds to simply finding the posterior
distribution of a parameter given data, and then finding the mode of this
distribution. Similarly, for L_1 loss, all we need do is find the median of
the posterior distribution. And under L_2 loss, it’s just the expectation of
the posterior. CMU’s 705 course is a great intro to statistical decision
theory and stats more broadly for anyone interested!

(Disclaimer: I am a CMU PhD student in the machine learning department so I am
somewhat biased to thinking these notes are good having taken this course
myself)

------
kevinventullo
And minimizing E_∞ means taking the midpoint between the max and min values!

------
pps43
I like the elevator bank analogy. You are waiting for an elevator. There are
four equidistant elevator doors, all arranged in one line. Elevator #3 is out
of order. Where do you stand?

Turns out the answer depends on what you want to optimize - maximum walking
distance, average walking distance, or average square walking distance.

------
k__
Does this mean we can calculate when to use mean, mode, or median and don't
need to have an intuition for it?

Caculate every value, check which has the lowest discrepancy and be done.

~~~
JulianWasTaken
Part of the point is any of those metrics are the correct one if they model
the discrepancy you want.

You pick the one that treats discrepancy the way that's appropriate.

E.g., using the mode is like saying "if you're not the right answer, you're
all equally bad".

Using the median is like saying "if you're not the right answer, you're as
wrong as however far off you are".

Using the mean is like saying "if you're not the right answer, you're more and
more wrong the further away you are, and the amount you're penalized itself
gets more and more, so you're really incentivized to be closer"

Which of these is appropriate depends on the real thing you care about in
whatever your numbers actually signify.

~~~
k__
Could you elaborate the difference between mean and median a bit more?

------
buddhiajuke
This is true, but isn’t “it”.

If ten boxes fall at random from an Amazon Prime truck and you estimate a mean
value of the parcels per m^3, you can extrapolate that the total value of the
loot in the truck.

------
gotoeleven
There's further discussion of these ideas here:
[https://ram.rachum.com/median/](https://ram.rachum.com/median/)

------
dang
If curious see also

2017
[https://news.ycombinator.com/item?id=15946239](https://news.ycombinator.com/item?id=15946239)

