There is a much wider generalization here which is studied under the name of "property elicitation" in computer science, machine learning, and statistics.
The generic question is: Given a loss function, what "property" of the distribution minimizes average loss; and given a "property", characterize all such loss functions. For example, Bregman divergences are (essentially) all losses that "elicit" the mean of a distribution. If you have any monotone continuous function g(), then |g(x) - g(s)| actually also elicits the median, and these are essentially all that do.
Hmm, that example is really interesting. I think you're right once we formalize L_infinity, because the goal will be to minimize the maximum possible distance. But the other scores can be phrased as penalties and minimizing expected penalty or total penalty summed over the data points, and it's not clear how to phrase L_infinity this way because the "penalty" would be infinity for any outcome...
I might be missing some context here, but usually the L_infinity norm means the max norm, or the maximum absolute difference over all components (data points). This gives (min + max)/2 as GP suggested.
For what it's worth, I think the confusion of your comment and jules's below comes from the idea that the lp-norm is sum(x_i^p), when it's actually [sum(x_i^p)^1/p] (please forgive my notation). Since raising to the 1/p power is monotonic, it doesn't actually make a difference in minimizing or maximizing the norm, so people often use them interchangeably.
You're right and we're on the same page about the solution. The reasoning behind my comment of "interesting" is that formula you gave is not well-defined for p=infinity, although we can define it as the limit of that expression as p-->infinity. Furthermore, for p < infinity the lp-norm can assign a well-defined penalty to each x, and the goal is to minimize the sum of the penalties. But that's not really true for p=infinity.
In the limit it's indeed the max of the absolute differences, so that's how you conventionally define the infinity norm (even though the base formula itself is not well defined, just as with 0). And when you try to minimise the max of the distances, you get the midrange.
I just want to give a bit more intuition on the L_p norms, basically a measure of length of a vector. A norm then induces a distance: the distance between a and b is the length of (a-b).
We assume that for one dimension, we can tell the length: length of d is just |d|, the absolute value of d. But what if we have many dimensions i=1..N?
Then we can use this L_p norm, and the idea is basically just that we set the norm to
[ sum_i=1..N |d_i|^p ] ^(1/p)
Now, what does that give us? Note a few things:
* when all of the d_i are zero except for one, we get back just the value that's not zero (in absolute value), because we go to the p-th power and then back. So, we recover the intuition (and number) from the one-dimensional case.
* for p=1, we just add all the absolute values of the d_i. That just gives us the citiblock/Manhattan norm.
L1 norm of [3, 4] = 7.
L1 norm of [100, 1, 1] = 102.
* for p=2, we square all values, add, and take the square root. This, by repeated application of Phytagoras, gives us the usual "Euclidean" distance in space: if you're 3 m west and 4 m north, then you're 5 = sqrt(9+16) m away.
L2 norm of [3, 4] = 5.
L2 norm of [100, 1, 1] = 100.00999
* now, think about what happens if d_1, say, is very big, and all the others are tiny. If we add them (L_1), we just get the sum, a bit bigger than d_1. If we add the squares, and then take the square root, we'll be even closer to d_1, because the square of d_1 is going to be very big, and the other squares really small, and then we add and take the square root to basically get back d_1. See example.
* Now imagine p=4, or 100. We take a huge power of all d_i, then add, then "invert" the power. All the terms in the sum are going to be dwarfed, except the biggest one!
L4 norm of [3, 4] = 4.284572295
L4 norm of [100, 1, 1] = 100.0000005
* Thus, the infinity norm goes towards the maximum of the d_i. And that's what it's defined as: max_i=1..N |d_i|
* So, big p tend to exaggerate the differences between the d_i, and only the biggest counts. Conversely, small p tend to make the d_i more similar, until it only really matters whether a d_i is 0 or not. In other words, you just count how many are not zero. And that motivates the L0 norm.
Now, think about the unit circle in 2D.
Here some ASCII art:
L0:
|
-+-
|
L1:
/\
/ \
\ /
\/
L2:
-
/ \
( )
\ _ /
Ok, that's my best ASCII rendering of a circle...
L_infty:
___
| |
|___|
You see that the 4 points up/down and left/right are fixed (-1, 0), (0, 1), (1, 0), (0, -1), because of this property that if all are zero except one we fall back to the normal distance. But then, for others, the points we consider close enough to be within the unit circle "bulge out" as we increase p.
I should note that square-error loss is not the only one that gives the average. Log-loss is another one, for example (you can prove it by taking the derivative of your minimization, setting it to zero and solving)
These relationships are pretty clear once you see the other distributional metrics. Going further there's skewness and kurtosis for the third and fourth statistical moments, resp.
Not sure what you're saying? The article is about "central tendency" or "location" statistics (ie "first moment"), and how 3 common ones pop out of minimising different distances (L0, L1, L2).
It doesn't even mention variance (second central moment), let alone skew or kurtosis?
I'm saying distance techniques are related to covariance, and there's a lot of useful info from stats when you go to higher moments. I never see this in ML and I'm wondering why so I'm pointing it out.
Careful tho:
'Kurtosis' doesn't equal 'Kurtosis', to the point where different R packages have functions named "Kurtosis" that implement different things. Here's an example I stumbled upon today:
"pkg:moments uses the ratio of 4th sample moment to square of second sample moment, while pkg:fBasics uses the variance instead of the second moment and subtracts 3 (for reasons to do with the Normal distribution)."
"The "correct" number for kurtosis depends on your purpose. The number for "kurtosis" that subtracts 3 estimates a "cumulant", which is the standard fourth moment correction weight in an Edgeworth expansion approximation to a distribution. Neither of the numbers described ... compute the "4th sample k statistic", which is "the unique unbiased estimator" for that number (http://mathworld.wolfram.com/k-Statistic.html)."
notice that the unifying perspective depends on a continuous parameter p, which is 2 for the mean, 1 for the median and -infinty for the mode. Thus, there is a continuous family of statistics that interpolate between these three things!
he does not mention it neither, but this means that you can define modes of a continuous variable (without resorting to histogram bins)
For mode to fit into this unifying framework, this article assumes that 0^0 is not indeterminate and is instead simply 0 (instead of the more usual assumption of 1).
Mathematically, it's a question regarding the order of limits. Evaluating 0^0, is akin to considering a^b as a-->0 and b-->0, and the order in which the two limits are taken will make a difference. If 'a' approaches zero and while 'b' is still positive, the answer shall be 0. If 'b' approaches zero while 'a' is still positive, the answer shall be 1.
Pragmatically, what this means is that 0^0 is not sufficiently specified (just meaningless symbols) unless you prescribe the context/meaning with which you're using it. And with regards to defining summary statistics, the article talks about a context where the exponent scans over different (continuous) values, passing by {2,1,0} along the way.
It should be an interesting exercise to consider what happens when the exponent approaches positive or negative infinity i.e. large magnitude positive or negative numbers.
> Mathematically, it's a question regarding the order of limits. Evaluating 0^0, is akin to considering a^b as a-->0 and b-->0, and the order in which the two limits are taken will make a difference.
Mathematically, you don't take limits as a sequence of single-variable limits. You take them by constricting an n-dimensional circle (two points / circle / surface of a sphere / etc.) around the point whose limit you're interested in.
This immediately implies that there are infinite possible approaches to the (0,0) point, rather than only two as there would be if you were taking it as two single-variable limits in sequence.
What are the different limiting values for a^b (a->0, b->0) considering all paths in the a-b plane? If there are as many points generated as are generated by considering only "Manhattan" paths, then I think it's appealing to think of it as a sequence of single-variable limits.
The figure in https://en.wikipedia.org/wiki/Zero_to_the_power_of_zero#Cont... strongly suggests that all nonnegative real numbers are limit points of the function f(x,y) = x^y as (x,y) approaches (0,0). It explicitly states that 0, 0.5, 1, and 1.5 are all limit points.
Right, those are two different names for the L_1 distance, but why does that show that they're not the same? For binary vectors v,w we have abs(v[i] - w[i]) = v[i] xor w[i], and the L_1 distance is the sum of the former whereas the Hamming distance is the sum of the latter.
I remember introducing someone to the concept of variance in a set of data and I used a very similar approach. Variance seems like an arbitrary (but obvious) definition but in fact it can be derived from first principles by just looking for the simplest possible function that firstly has some dependency on the difference between values and the arithmetic mean, secondly has the property that it is independent of whether the differences are positive or negative (to the right or left of the mean) and thirdly that it does not depend on the size of the data set (i.e. duplicating each member of a data set would leave the variance unaffected). When you consider each of these, the equation for variance arises very naturally.
Arithmetic difference satisfies for first property
Squaring each difference satisfies the second property
Taking the arithmetic mean satisfies the third property
I've seen this argument before, and it annoys me. In particular, to satisfy the second property, there is no need for a multiplication - simply assert the property. This yields the Mean Absolute Deviation error function, or E_1 in the article.
To get variance, you need to be chiefly concerned with distributions that have a variance in the first place - and then the additional information contained in that statistic has an amount of descriptive power over that of the median.
While I loved this post, could have also been helpful to include more info about Pythagorean Means (geometric mean and harmonic mean). Worth looking into depending on the type of variability seen in your data.
EDIT: just saw this was mentioned in one of the comments...
I agree. I'd also like to add that I think that generalizations of the median for higher dimensions should get explicit coverage, as well as a few related (and often neglected) concepts. Since the Wikipedia articles on this constitute a hot mess of not linking to each other in a sane way, I'll provide a list of articles on this topic here:
I just came across the Stolarsky mean in a completely separate context. It seems like a really useful framework for unifying different kinds of central measures, similar to the idea in the original article here.
Is there any fundamental reason to measure discrepancy by abs(s - x_i)^2 rather than say by abs(s - x_i)^1.5? Is something special about 2 in this context, or is it just a social convention that seems to work pretty well?
1. The gaussian distribution is important (all sums converge to it) and the squared difference from the mean characterizes it.
2. Euclidean space is important, and squared errors stay the same if you rotate everything. (Other errors don't).
3. Linear regression with squared error has a closed form solution. Other types of models converge very fast when using it because the further you are from optimal, the bigger your gradient is.
The ^2 gives the normal distribution, which is special for many reasons.
With ^1 you get the Laplace distribution. This is widely used when the data is sparse (has many elements exactly 0). This is less sensitive to ourliers and in image reconstruction it gives sharper images than ^2.
To expand on the other answers, if you want a notion of angle between things then you need the distance to be defined by an inner product (dot product). This leads to the square. So, in some sense squared error generalized our geometric intuitions the best.
If your loss function is an actual financial $ loss (or revenue), then arithmetic-mean times n gives the best estimate of total/long-run expected loss.
If the distribution of losses is skewed or has outliers then estimates other than the mean (median, trimmed means etc) often under-estimate total losses.
Under-estimating total losses in the long run could be very bad for business.
"To sum up, we’ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between s
and the numbers being summarized. They only differ in the type of discrepancy being considered:
The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.
The median minimizes the average distance between each number and our summary.
The mean minimizes the average squared distance between each number and our summary."
Unfortunately MathJax is a bit large and nontrivial to host yourself. And it recently switched to being hosted on cloudfare instead of mathjax.org.
I struggled with this choice myself and so far, decided to do the same as the author with a noscript tag to explain to the reader why I'm loading cloudfare code.
To get the median of an even number of values, you must calculate the mean of the middle two values. Therefore the definition of the median relies on the mean already being defined when working with a discrete number of values, which isn't really explained in the post.
In fact, there's a whole spectrum of averages defined with mean and median on each end, depending on how many outliers you eliminate. For example, if you have eight numbers, you can define a spectrum of four averages:
2,3,5,7,11,13,17,19 // mean, here 9.6250
3,5,7,11,13,17 // mean with outlier on each side stripped, here 9.3333
5,7,11,13 // mean of central two quartiles, here 9.0000
7,11 // median (i.e. mean of center two numbers), here 9.0000
You could then repeat the process on that spectrum of averages to get a shorter spectrum, here [9.2396 (mean), 9.1667 (median)], recursively until you have one "mean-median" left, here 9.2031.
I wonder how this fits in with the explanation in the post.
It relies on a quantity being defined which happens to be equal to the mean, but that value can be arrived at without having defined the mean a priori. The minimizers of (7,11) with respect to the 1-norm defined in the post include all values between 7 and 11. You need not have defined the mean to state this optimization problem. I suppose which median you pick can be considered a heuristic.
I think "removing outliers" is also snugly in the camp of practical heuristics. A mathematical definition might not want to automatically eliminate outliers when "outlier" is also subject to a choice of definition.
If I'm not mistaken, if there is an even number of values, any value between (and including) the middle two values minimises the L1 norm. To choose the "central" (mean) of those two is just a convention.
> To get the median of an even number of values, you must calculate the mean of the middle two values
This is actually not true!
The correct way to do this would be to take the (right-hand) limit of argminₛ ∑ᵢ |xᵢ - s|ᵈ as d → 1⁺. You get back a unique number and it is not the mean of the middle two values.
Well, very sloppily speaking - the median is the first derivative of the mean, and the mode is the second derivative. That would be another way of thinking about this in terms of rates of change in discrepancy.
The generic question is: Given a loss function, what "property" of the distribution minimizes average loss; and given a "property", characterize all such loss functions. For example, Bregman divergences are (essentially) all losses that "elicit" the mean of a distribution. If you have any monotone continuous function g(), then |g(x) - g(s)| actually also elicits the median, and these are essentially all that do.
Apologies for self-promotion, but you can read more at references on this page (disclaimer: I'm one of the researchers who posted it): https://sites.google.com/site/informationelicitation/
or tutorials on this subject at my blog: http://bowaggoner.com/blog/series.html#convexity-elicitation