
Modes, Medians and Means: A Unifying Perspective - meribold
http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/
======
bowaggoner
There is a much wider generalization here which is studied under the name of
"property elicitation" in computer science, machine learning, and statistics.

The generic question is: Given a loss function, what "property" of the
distribution minimizes average loss; and given a "property", characterize all
such loss functions. For example, Bregman divergences are (essentially) all
losses that "elicit" the mean of a distribution. If you have any monotone
continuous function g(), then |g(x) - g(s)| actually also elicits the median,
and these are essentially all that do.

Apologies for self-promotion, but you can read more at references on this page
(disclaimer: I'm one of the researchers who posted it):
[https://sites.google.com/site/informationelicitation/](https://sites.google.com/site/informationelicitation/)

or tutorials on this subject at my blog:
[http://bowaggoner.com/blog/series.html#convexity-
elicitation](http://bowaggoner.com/blog/series.html#convexity-elicitation)

~~~
onurcel
This is awesome. Are there any video of the presentation at EC'16 ?

~~~
bowaggoner
Unfortunately no video, sorry. Glad you find it interesting!

------
FabHK
One small extension:

* L_0 -> mode

* L_1 -> median

* L_2 -> mean

* L_infinity -> midrange, I think

that is, (smallest observation + largest observation)/2

(BTW, the author is also a big contributor to the wonderful Julia language, I
believe)

~~~
bowaggoner
Hmm, that example is really interesting. I think you're right once we
formalize L_infinity, because the goal will be to minimize the maximum
possible distance. But the other scores can be phrased as penalties and
minimizing expected penalty or total penalty summed over the data points, and
it's not clear how to phrase L_infinity this way because the "penalty" would
be infinity for any outcome...

~~~
mattb314
I might be missing some context here, but usually the L_infinity norm means
the max norm, or the maximum absolute difference over all components (data
points). This gives (min + max)/2 as GP suggested.

Wikipedia entry:
[https://en.wikipedia.org/wiki/Norm_(mathematics)#Maximum_nor...](https://en.wikipedia.org/wiki/Norm_\(mathematics\)#Maximum_norm_\(special_case_of:_infinity_norm,_uniform_norm,_or_supremum_norm\))

For what it's worth, I think the confusion of your comment and jules's below
comes from the idea that the lp-norm is sum(x_i^p), when it's actually
[sum(x_i^p)^1/p] (please forgive my notation). Since raising to the 1/p power
is monotonic, it doesn't actually make a difference in minimizing or
maximizing the norm, so people often use them interchangeably.

~~~
bowaggoner
You're right and we're on the same page about the solution. The reasoning
behind my comment of "interesting" is that formula you gave is not well-
defined for p=infinity, although we can define it as the limit of that
expression as p-->infinity. Furthermore, for p < infinity the lp-norm can
assign a well-defined penalty to each x, and the goal is to minimize the sum
of the penalties. But that's not really true for p=infinity.

~~~
FabHK
In the limit it's indeed the max of the absolute differences, so that's how
you conventionally define the infinity norm (even though the base formula
itself is not well defined, just as with 0). And when you try to minimise the
max of the distances, you get the midrange.

------
Scene_Cast2
I should note that square-error loss is not the only one that gives the
average. Log-loss is another one, for example (you can prove it by taking the
derivative of your minimization, setting it to zero and solving)

------
jdonaldson
These relationships are pretty clear once you see the other distributional
metrics. Going further there's skewness and kurtosis for the third and fourth
statistical moments, resp.

~~~
FabHK
Not sure what you're saying? The article is about "central tendency" or
"location" statistics (ie "first moment"), and how 3 common ones pop out of
minimising different distances (L0, L1, L2).

It doesn't even mention variance (second central moment), let alone skew or
kurtosis?

~~~
jdonaldson
I'm saying distance techniques are related to covariance, and there's a lot of
useful info from stats when you go to higher moments. I never see this in ML
and I'm wondering why so I'm pointing it out.

------
enriquto
there is so much unexplained beauty in this text!

notice that the unifying perspective depends on a continuous parameter p,
which is 2 for the mean, 1 for the median and -infinty for the mode. Thus,
there is a continuous family of statistics that interpolate between these
three things!

he does not mention it neither, but this means that you can define modes of a
continuous variable (without resorting to histogram bins)

------
saurik
For mode to fit into this unifying framework, this article assumes that 0^0 is
not indeterminate and is instead simply 0 (instead of the more usual
assumption of 1).

[https://en.m.wikipedia.org/wiki/Zero_to_the_power_of_zero](https://en.m.wikipedia.org/wiki/Zero_to_the_power_of_zero)

[http://mathforum.org/dr.math/faq/faq.0.to.0.power.html](http://mathforum.org/dr.math/faq/faq.0.to.0.power.html)

~~~
ssivark
Mathematically, it's a question regarding the order of limits. Evaluating 0^0,
is akin to considering a^b as a-->0 and b-->0, and the order in which the two
limits are taken will make a difference. If 'a' approaches zero and while 'b'
is still positive, the answer shall be 0. If 'b' approaches zero while 'a' is
still positive, the answer shall be 1.

Pragmatically, what this means is that 0^0 is not sufficiently specified (just
meaningless symbols) unless you prescribe the context/meaning with which
you're using it. And with regards to defining summary statistics, the article
talks about a context where the exponent scans over different (continuous)
values, passing by {2,1,0} along the way.

It should be an interesting exercise to consider what happens when the
exponent approaches positive or negative infinity i.e. large magnitude
positive or negative numbers.

~~~
thaumasiotes
> Mathematically, it's a question regarding the order of limits. Evaluating
> 0^0, is akin to considering a^b as a-->0 and b-->0, and the order in which
> the two limits are taken will make a difference.

Mathematically, you don't take limits as a sequence of single-variable limits.
You take them by constricting an n-dimensional circle (two points / circle /
surface of a sphere / etc.) around the point whose limit you're interested in.

This immediately implies that there are infinite possible approaches to the
(0,0) point, rather than only two as there would be if you were taking it as
two single-variable limits in sequence.

~~~
gugagore
What are the different limiting values for a^b (a->0, b->0) considering all
paths in the a-b plane? If there are as many points generated as are generated
by considering only "Manhattan" paths, then I think it's appealing to think of
it as a sequence of single-variable limits.

~~~
thaumasiotes
The figure in
[https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero#Cont...](https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero#Continuous_exponents)
strongly suggests that all nonnegative real numbers are limit points of the
function f(x,y) = x^y as (x,y) approaches (0,0). It explicitly states that 0,
0.5, 1, and 1.5 are all limit points.

------
FabHK
Mods, could you maybe put (2013) in the title? Content is timeless, but just
as a heads up that one might have come across it before.

------
ganonm
I remember introducing someone to the concept of variance in a set of data and
I used a very similar approach. Variance seems like an arbitrary (but obvious)
definition but in fact it can be derived from first principles by just looking
for the simplest possible function that firstly has some dependency on the
difference between values and the arithmetic mean, secondly has the property
that it is independent of whether the differences are positive or negative (to
the right or left of the mean) and thirdly that it does not depend on the size
of the data set (i.e. duplicating each member of a data set would leave the
variance unaffected). When you consider each of these, the equation for
variance arises very naturally.

Arithmetic difference satisfies for first property

Squaring each difference satisfies the second property

Taking the arithmetic mean satisfies the third property

Var(X) = E[(X - mean)^2]

~~~
sesqu
I've seen this argument before, and it annoys me. In particular, to satisfy
the second property, there is no need for a multiplication - simply assert the
property. This yields the Mean Absolute Deviation error function, or E_1 in
the article.

To get variance, you need to be chiefly concerned with distributions that
_have_ a variance in the first place - and then the additional information
contained in that statistic has an amount of descriptive power over that of
the median.

------
joshgel
While I loved this post, could have also been helpful to include more info
about Pythagorean Means (geometric mean and harmonic mean). Worth looking into
depending on the type of variability seen in your data.

EDIT: just saw this was mentioned in one of the comments...

~~~
no_identd
I agree. I'd also like to add that I think that generalizations of the median
for higher dimensions should get explicit coverage, as well as a few related
(and often neglected) concepts. Since the Wikipedia articles on this
constitute a hot mess of not linking to each other in a sane way, I'll provide
a list of articles on this topic here:

[https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean](https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean)

[https://en.wikipedia.org/wiki/Mode_(statistics)#Use](https://en.wikipedia.org/wiki/Mode_\(statistics\)#Use)

[https://en.wikipedia.org/wiki/Central_tendency#Measures](https://en.wikipedia.org/wiki/Central_tendency#Measures)

[https://en.wikipedia.org/wiki/Nonparametric_skew#Relationshi...](https://en.wikipedia.org/wiki/Nonparametric_skew#Relationships_between_the_mean,_median_and_mode)

[https://en.wikipedia.org/wiki/Geometric_median](https://en.wikipedia.org/wiki/Geometric_median)

[https://en.wikipedia.org/wiki/Weber_problem#Definition_and_h...](https://en.wikipedia.org/wiki/Weber_problem#Definition_and_history_of_the_Fermat,_Weber,_and_attraction-
repulsion_problems)

[https://en.wikipedia.org/wiki/Centerpoint_(geometry)](https://en.wikipedia.org/wiki/Centerpoint_\(geometry\))

[https://en.wikipedia.org/wiki/Trimean](https://en.wikipedia.org/wiki/Trimean)

[https://en.wikipedia.org/wiki/K-medians_clustering](https://en.wikipedia.org/wiki/K-medians_clustering)

[https://en.wikipedia.org/wiki/Medoid](https://en.wikipedia.org/wiki/Medoid)

[https://en.wikipedia.org/wiki/Generalized_mean](https://en.wikipedia.org/wiki/Generalized_mean)

[https://en.wikipedia.org/wiki/Quasi-
arithmetic_mean](https://en.wikipedia.org/wiki/Quasi-arithmetic_mean)

[https://en.wikipedia.org/wiki/Lehmer_mean#Special_cases](https://en.wikipedia.org/wiki/Lehmer_mean#Special_cases)

[https://en.wikipedia.org/wiki/Logarithmic_mean#Generalizatio...](https://en.wikipedia.org/wiki/Logarithmic_mean#Generalization)

[https://en.wikipedia.org/wiki/Stolarsky_mean#Special_cases](https://en.wikipedia.org/wiki/Stolarsky_mean#Special_cases)

[https://en.wikipedia.org/wiki/Facility_location](https://en.wikipedia.org/wiki/Facility_location)

See Also:

[https://en.wikipedia.org/wiki/Problem_of_points](https://en.wikipedia.org/wiki/Problem_of_points)

[https://en.wikipedia.org/wiki/Chebyshev%27s_inequality](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality)

[https://en.wikipedia.org/wiki/L-estimator](https://en.wikipedia.org/wiki/L-estimator)

[https://en.wikipedia.org/wiki/M-estimator](https://en.wikipedia.org/wiki/M-estimator)

~~~
twic
I just came across the Stolarsky mean in a completely separate context. It
seems like a really useful framework for unifying different kinds of central
measures, similar to the idea in the original article here.

------
arnioxux
The Generalized mean[1] linked to in the blog comments was similarly
insightful.

It unifies the inequalities:

max > root mean square > arithmetic mean > geometric mean > harmonic mean >
min

that I remember from highschool math competitions.[2]

[1]
[http://en.wikipedia.org/wiki/Power_mean#Special_cases](http://en.wikipedia.org/wiki/Power_mean#Special_cases)

[2] [https://artofproblemsolving.com/wiki/index.php?title=Root-
Me...](https://artofproblemsolving.com/wiki/index.php?title=Root-Mean_Square-
Arithmetic_Mean-Geometric_Mean-Harmonic_mean_Inequality)

~~~
no_identd
You might appreciate the other higher generalizations and related measures I
pointed to in my comment here:

[https://news.ycombinator.com/item?id=15947157](https://news.ycombinator.com/item?id=15947157)

------
RoboTeddy
Is there any fundamental reason to measure discrepancy by abs(s - x_i)^2
rather than say by abs(s - x_i)^1.5? Is something special about 2 in this
context, or is it just a social convention that seems to work pretty well?

~~~
moultano
Yes! Lots of them actually.

1\. The gaussian distribution is important (all sums converge to it) and the
squared difference from the mean characterizes it.

2\. Euclidean space is important, and squared errors stay the same if you
rotate everything. (Other errors don't).

3\. Linear regression with squared error has a closed form solution. Other
types of models converge very fast when using it because the further you are
from optimal, the bigger your gradient is.

------
mr_toad
If your loss function is an actual financial $ loss (or revenue), then
arithmetic-mean times n gives the best estimate of total/long-run expected
loss.

If the distribution of losses is skewed or has outliers then estimates other
than the mean (median, trimmed means etc) often under-estimate total losses.

Under-estimating total losses in the long run could be very bad for business.

------
jaddood
In case anyone wants to dig into the follow up on Lp norms, here it is:
[http://www.johnmyleswhite.com/notebook/2013/03/22/using-
norm...](http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-
understand-linear-regression/)

------
known
Painfully, American families are learning the difference between median and
mean

[https://qz.com/260269/painfully-american-families-are-
learni...](https://qz.com/260269/painfully-american-families-are-learning-the-
difference-between-median-and-mean/)

------
alvis
The beauty of math is often missed. That's why we say maths is an art!

------
sytelus
Is there similar generalization for geometric mean and geometric median?

------
kensai
"To sum up, we’ve just seen that the three most famous single number summaries
of a data set are very closely related: they all minimize the average
discrepancy between s

and the numbers being summarized. They only differ in the type of discrepancy
being considered:

    
    
        The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
        The median minimizes the average distance between each number and our summary.
        The mean minimizes the average squared distance between each number and our summary."

~~~
fjsolwmv
Making the pattern even stronger: "number of times" is average 0th-power of
distance,

And "distance" is "the 1st power of distance".

And "squared" is "2nd power"

------
ycmbntrthrwaway
Math does not render unless I allow cloudflare.com to execute scripts. Why
can't we just self-host scripts, is it that hard?

~~~
bowaggoner
Unfortunately MathJax is a bit large and nontrivial to host yourself. And it
recently switched to being hosted on cloudfare instead of mathjax.org.

I struggled with this choice myself and so far, decided to do the same as the
author with a noscript tag to explain to the reader why I'm loading cloudfare
code.

------
moomin
On my phone, this entire article reads “blah blah blah [Math processing error]
blah blah blah [Math processing error] blah blah blah [Math processing error]
blah blah blah [Math processing error]”

~~~
Scene_Cast2
On my Android, the math is small but readable

~~~
k__
Same here

------
vorg
To get the median of an even number of values, you must calculate the mean of
the middle two values. Therefore the definition of the median relies on the
mean already being defined when working with a discrete number of values,
which isn't really explained in the post.

In fact, there's a whole spectrum of averages defined with mean and median on
each end, depending on how many outliers you eliminate. For example, if you
have eight numbers, you can define a spectrum of four averages:

    
    
      2,3,5,7,11,13,17,19 // mean, here 9.6250
      3,5,7,11,13,17 // mean with outlier on each side stripped, here 9.3333
      5,7,11,13 // mean of central two quartiles, here 9.0000
      7,11 // median (i.e. mean of center two numbers), here 9.0000
    

You could then repeat the process on that spectrum of averages to get a
shorter spectrum, here [9.2396 (mean), 9.1667 (median)], recursively until you
have one "mean-median" left, here 9.2031.

I wonder how this fits in with the explanation in the post.

~~~
j2kun
It relies on a quantity being defined which happens to be equal to the mean,
but that value can be arrived at without having defined the mean a priori. The
minimizers of (7,11) with respect to the 1-norm defined in the post include
all values between 7 and 11. You need not have defined the mean to state this
optimization problem. I suppose which median you pick can be considered a
heuristic.

I think "removing outliers" is also snugly in the camp of practical
heuristics. A mathematical definition might not want to automatically
eliminate outliers when "outlier" is also subject to a choice of definition.

