Hacker News new | comments | show | ask | jobs | submit login
Why isn't everything normally distributed? (johndcook.com)
250 points by tambourine_man on July 20, 2017 | hide | past | web | favorite | 135 comments



As a (former) astronomer, I've never understood why people assume normal distributions for everything. I understand that there are good theoretical motivations for this --- namely, the normal distribution is the distribution that maximizes entropy for a given mean and variance. But in astronomy, nothing is normally distributed. (At least, nothing comes to mind.) Instead, everything is a power law. The reason for this is that most astrophysical processes are scale free over many orders of magnitude, and if you want a scale free process, it must be distributed as a power law.

There's actually a joke in the field that when you get a new dataset, the first thing you do is fit it to a power law. If that doesn't work, you fit it to a broken power law.


I work in CG animation, and one thing artists do when procedurally generating geometry is tweak the size distribution of things meant to mimic natural phenomena, such as leaves and rocks. They are typically assembling these as expressions, so it's common to first reach for a simple uniform distribution in the form of scale*rand(), but more experienced folks know to go straight to a power law instead. I think it's fascinating that such an extremely high-level way to characterize natural processes could make it into an artist's toolkit like that.


To build on what 'blevin says, check out for instance https://bost.ocks.org/mike/algorithms/


I don't quite get the relevance to his comment, but wow! What a great link.


A smooth star cluster can be plotted by giving stars a normally distributed variate for each dimension. Normal distribution is the only kind which will result in a radially even cluster - any other distribution results in visible deformation over the dimensions axis. This geometric property of normal distribution is true for any number of dimensions.


I dabble in my own crap gamey stuff, and I use random [mostly 0-1] to power of something all the time; not because I know what I'm doing, but because it's simple and often good enough. I don't know math, but multiplying all sorts of things with each other to see what happens I can do; that and getting random numbers from perlin noise can cover a lot of sins, or at least make for a lot of fun :)


There's this funny "heuristic" converse of the central limit theorem (by Mandelbrot, I think):

The ONLY variables that are normally distributed are those that are averages of many independent, identically distributed variables of finite variance.

Thus, if you cannot find the finite-variance variables that average up to form a variable X, then X is not normally distributed.

(The parts about independent and identically distributed are technical red-herrings. The only essential condition is finite variance.)


Averages or sums. But yes, the point still holds.


Well, a sum is just an average multiplied by a constant.


Not to mention that in the wider world, the most common types of processes follow a Zipfian distribution[1]. It think this misunderstanding comes from undergraduate statistics courses teaching you all about normal and uniform distributions, while most students are unlikely to learn the limits and general inapplicability of that model unless they go into research. Most people's first encounter with real statistics is bell curves. Not to mention that the central limit theorem lulls you into a false sense of security that "at the end of the day" everything will look normal.

[1]: https://en.m.wikipedia.org/wiki/Zipf%27s_law


I think it has a lot to do with people in the social sciences wishing that the CL Theorem could simplify everything sampled to a single mathematical model they can have tools programmed to do magically when fed the data.


There's also a lot of cultural memory from te times when computation was really hard. Much can be improved in the most-data-impoverished social sciences with things like MCMC or even bootstrapping of frequenting statistics, buuuut... statistics are tools, not the focus of social scientists (who design and validate their hypotheses in a primarily qualitative fashion), so the old ways hang on for a long time.


If I hadn't recently read Benoit Mandelbrot's book "The Misbehavior of Markets," I might dismiss this as empty cynicism. But I think there is some truth to it.

Macro and financial economists have traditionally been the worst offenders.


I don't mean it to be cynical, it's just my experience that a lot of data gathering in the social sciences (my experience was in education) makes terrible assumptions, often implicitly, the worst being that the variables are IID. And it's very obvious from the publication that the number crunching is some cargo-cult approach they probably learned from their mandated semester or two in research methods. (Okay, that last bit was cynical!)


I could probably be characterized as a social scientist, at least a behavioral scientist.

What you're saying is probably part of it, although in my experience that criticism can be leveled as much, if not more, at wet-lab-type biologists who eschew all but the most minimal stats.

With the social sciences, though, there's another phenomenon at play, which is that the phenomena are so abstract often that there's not really a good theoretical reason to assume anything in particular. And if that's the case, because the normal is the entropy-maximizing distribution, you're actually better off assuming that rather than some other distribution. You could also use nonparametric stats, but that has its own advantages and disadvantages.

Bias-variance dilemma and all that.

The truth is, it's hard to beat the normal even when it's wrong. And if you subscribe to the inferential philosophy that every model is wrong, you're better off being conservatively wrong, which implies a normal.

I'm not saying everything should be assumed to be normal. But unless things are (1) obviously super non-normal, or (2) you have some very strongly justified model that produces a non-normal distribution, you're probably best off using a normal if you're going to go parametric. And I think those two conditions are met much more often than we like to admit.

The normal distribution is kind of over-maligned, I think. I started my stats career being enamoured of rigorously nonparametric stats, and still am (esp. exact tests, bootstrapping/permutation-based inference, and empirical likelihood), but have grown to strongly appreciate normal distributions (or whatever maxent distribution is appropriate).


From my perspective, either you have data and you have an idea what sort of distributions you're working with, or you don't, and you should fix that problem first rather than going down the theory rabbit hole.

With data in hand, a skew/kurtosis scatter plot is a good way to gauge the higher dimensional distribution of your data. Another option is to cluster the variables of the data set using something like HDBSCAN and color the plot points based on cluster membership.

If you have to go guessing distributions without evidence, you're better off choosing a low k student's T distribution (for robustness to outliers) or a gamma distribution (if you think your data might be skewed).


I'm ignorant of "heavy-tailed statistics" as a field, but the normal distribution in quantitative finance is justified by the Levy characterization of Brownian motion (basically every continuous stochastic process is a drift-diffusion driven by a Brownian) and the Levy representation of a discontinuous process as a sum of a diffusion and a pure jump process.

Mind, you might find different distributions when solving diffusion equations (the Cox-Ingersoll-Ross process involves a Bessel distribution if I'm not mistaken). But the diffusion-jump paradigm is much better justified there than distributional (or even finite moment) assumptions in discrete land.

Put it this way: if finance is abusing distributional assumptions, there's money being left on the table.


The main issue with this being that markets don't exhibit behaviour analogous to Brownian motion. It's a nice little assumption that makes the math work, but frequency analysis of real world markets shows that they behave like pink noise instead.

And unsurprisingly there are quant funds and prop trading firms that use this very fact to make lots of money. Academic economist's love for the normal distribution is derided by pretty much any fellow real world trader I know.


You are throwing in a lot of technical words around and the specific fact you state arent wrong, but that cannot make a strong case for the choice of the Normal distribution. All you need is that the variance of bounded (in time) increments be unbounded for the Gaussian behavior to fall apart (for example non-rare jumps of sufficient magnitude).


I did say "diffusion-jump process".

Look, even if we disregard jumps as a possible term in equations, merely considering volatility to be stochastic and driven by a Brownian already gets you variances that may grow arbitrarily fast. This is without considering stranger nonlinearities on the Brownian term itself.

People are overly impressed by forceful arguments of the Taleb variety and log-log plots of empirical distributions and start parroting talking points about heavy/fat tails.

If I wasn't computerless and on my phone I would link to a paper that does the analog of the Anscombe quartet for log-log distribution plots "proving that data is Pareto/power law/etc." It's good vaccine for fat tail hipsterism, look it up.


I dont think we are disagreeing (as long as you do not claim that stock prices are merely a Gaussian process.


I thought of a simple example on my way to my commute. (It's a short walk.)

The Tanaka equation is an example (admittedly: not a diffusion, not a case o plug-and-chug the Ito lemma) of a process driven by a Brownian with a discontinuous probability distribution. How the hell? From memory,

dTNK = dB if dB>0 else -dB

Now imagine a model with three equations.

X1 is a bog-standard geometric diffusion (we could have picked something that's chi-squared distributed driven by a Brownian from standard interest rate models) as in the 1970s Black-Scholes models. But instead of having an exogenous volatility, it has its Brownian term dB1 multiplied by a second equation, X2.

dX2 could be a standard mean-reversion equation, but again its dB2 is multiplied by X3.

dX3 is something like abs(dB3 - dB4* X3).

Voilà, an equation (X1) with a sudden break driven by the level of a mean-reverting equation (X2, which tells us volatility should come down in finite time even if it grows by a lot at times) that's set to blow up at a X2-dependent but stochastic level.

Don't get me wrong, Poisson-like jumps are very common (they're precisely the limiting process for sudden jumps) but people overstate (perhaps because they didn't really read the conditions for Ito isometry) how much a Brownian motion forces a system into normality or smoothness.

But hey, people get away with being hipsters about programming languages, why shouldn't they do that for stochastic calculus too, you know?


Hey. Why do people upvote a post that says "Taleb is a jerk" and makes a reputation argument in a forceful manner (reminiscent of Taleb's own style) -- and then downvote a post with technical detail?


Probably the fact that you flatly ignored the remark about stuffing your post with opaque technical terms--I know a lot of maths, but I can't make heads or tails of what you're saying. Who are you even talking to? I assume it means something to you, but it doesn't even look like you're trying to communicate your point in a clear manner. Just that you enjoy using words. When trying to read around those technical terms, what is left has a rather nasty and arrogant tone. Therefore, doesn't add much to the discussion = downvote.


Can't recommend the Mandelbrot book highly enough - he was a student of Paul Levy, and he wrote extensively about why Gaussian is not a good choice to model financial time series.


Disagree; economists have failed more visibly, but they've got nothing on social scientists.


There's also joke that goes:

Power law walks into a bar. Bartender says, "I've seen a hundred power laws. Nobody orders anything." Power law says, "1000 beers, please".


> Power law walks into a bar. Bartender says, "I've seen a hundred power laws. Nobody orders anything." Power law says, "1000 beers, please".

Agh, I'm a mathematician and should get this, but I don't. Is the joke just that power laws give behaviour that is very small for quite a while, and then becomes suddenly large?


I'm not a mathematician, but I think the joke is that you can't infer a meaningful average from a power law.


> if you want a scale free process, it must be distributed as a power law.

Could you elaborate a little, or maybe give an example?


The links that others have provided are good, but the way I think about this is basically that the scale of a process basically sets the order of magnitude at which it occurs. We tend to think of processes that have a particular scale, which makes power laws somewhat unintuitive (even though there are also lots of scale-free processes in ordinary life, too). So, cars have a natural scale, since they have to fit people inside them. So the distribution of car sizes is pretty limited --- you aren't going to find cars that are microns long, and you won't find cars that are kilometers long.

But if something can happen at any given scale, the distribution must operate very differently. The distribution can't pick out any favorite size to work at --- some fraction of events must happen at every possible scale. So, if we're looking at the distribution of stellar luminosities, then there will be some stars which are as bright as the Sun, some which are 1% as bright as the Sun, and some which are 100 times as bright as the Sun. And since the star formation process doesn't have any preferred scale, then the number of Solar luminosity stars to stars which are 1% as luminous as the Sun must be about the same as the number of stars which are 100 times as bright as the Sun to the number of Solar luminosity stars. it's perfectly fine for the distribution to prefer dimmer or brighter stars because this doesn't end up introducing a preferred scale --- it just tilts the slope of the distribution (which is a line on a log-log graph).

I'm not a mathematician, so I'm not sure if these statements are true in any rigorous sense, but in practice scale-free processes tend to produce power laws in the universe.



This isn't a proof but if you want to understand it, these slides are alright: http://cs.brynmawr.edu/Courses/cs380/spring2013/section02/sl...

I recommend reading them through, but slide 39 is the tl;dr.


Richard McElreath has some nice points about variable independence assumptions. Concretely, for a model this assumption represents the state of "ignorance" and is the most conservative choice (i.e. doesn't assume any correlation). That said, he also mentions the "mind projection fallacy" of confusing epistemological claims with ontological claims, which may in turn be the reason why many assume normal distributions for everything.


http://bactra.org/weblog/491.html

"So You Think You Have a Power Law — Well Isn't That Special?"


reminds me of this joke from a lecture Lawrence Krauss gave in 2009 of his work in 'A Universe From Nothing':

> it was made after the discovery that on a log log plot everything is a straight line(o)

i'd recommend the entire lecture, but definitely at least check out the anecdote this joke bookends

it is about hubble's original 1929 embarrassing failed attempt to calculate the rate of the expansion of the universe(i)

(o) https://www.youtube.com/watch?v=7ImvlS8PLIo&t=18m25s

(i) https://www.youtube.com/watch?v=7ImvlS8PLIo&t=12m55s


The version I've heard is "everything on a log log plot in a sufficiently thick marker".


> I understand that there are good theoretical motivations for this --- namely, the normal distribution is the distribution that maximizes entropy for a given mean and variance.

That is indeed true, but why should such a property imply that the use of Normal distribution is appropriate ? Just a rhetorical question, of course, because your comment does indicate that Normal is not a good choice unless one has compelling reasons to do so.

Another argument that is used to justify its use is the central limit theorem. That says a sum of (nearly) independent variables with (nearly) identical distributions with finite mean converge to the Normal distribution. If the process under observation is indeed a superposition of such random processes, then yes the choice of the Gaussian can be justified. But it is surprisingly common that one of the 3 requirements are violated. A common violation is that the variance is infinite or it is so high that the process is better modeled as one with an infinite variance. In such situations the family of stable distributions are the more appropriate choice.

Gauss' own use of the Gaussian was motivated by convenience rather than any deep theory. Even at that time it was well known that other distributions, for example, Laplace distribution works better.


As you say, mathematical convenience is another argument.

The negative log of the (unnormalized) Gaussian density is basically x^2 and this fact comes very handy.

It's used extensively in Bayesian regression, modeling noise as additive Gaussian leads to the squared loss function, and if you also model the prior as Gaussian it leads to an L2 regularizer.

It also simplifies many other calculations. Fitting a mixture of Gaussians by expectation-maximization or the Kalman filter come to mind.


Absolutely !

For squared loss almost every theorem you want to be to be true are actually true. In a way its dangerous because it sets up bad expectations (* no nerd pun intended).

A result that holds for squared loss but not for others is the SVD as a low rank approximation of a matrix.

*Expectation minimizes the squared loss


I think that "good theoretical motivations" is the key point.

I am a (also former) physicist and it drives me crazy when people fit what ever happens to be in the chart to a straight line, "to get the trend". Whenever I ask them for the theory which predicts that x and y will be linked by y = ax + b, they do not have any.

This also goes on with extrapolations or interpolations, usually without the slightest theoretical reason to do so.

The problem is that when working with marketing, HR or even finance, they are so used to "getting the trend" with a linear fit that I am hopeless.

I once draw a parabole and asked for the trend (between the minimum). They were surprised by this stupid question as "it is obvious that there is no trend". Taking some random points linking share value with the number of women in a company "obviously fits to a trend line".


My boss and I talk about this phenomenon a lot. Normal distributions have lots of nice properties, and lots of tools that can do lots of fancy things, and infer lots of nice "facts" from them.

So there is a huge bias among researchers to assume them to make their treatment of the data easier.


Physicists in particular have a reputation for overusing the assumption of a power law! http://bactra.org/weblog/491.html


> There's actually a joke in the field that when you get a new dataset, the first thing you do is fit it to a power law. If that doesn't work, you fit it to a broken power law.

That's funny -- my field has the same thing with a different distribution. Analyzing stochastic processes is much easier if you assume an exponential distribution. It's one of my criteria for whether somebody's giving a bad job talk. If an unfounded assumption that waiting times are exponentially distributed shows up in the first three slides, the rest of the presentation is probably B.S. Even if the presenter used Beamer and filled the slides with beautiful equations. (The exponential distribution of waiting times is basically an assumption that events are independent.)

Edit: Especially if the presenter's slides are full of beautiful equations.


..is it not? Waiting times would be a poisson distribution. In my field where most folks are much less math-savvy, it's WAY more common for someone to be "surprised" or find it "insightful" that the 80/20 rule exists on a dataset (that I had concluded would already be power-law distributed before I looked at it); or who struggle with defining how another data set like email opens over time is distributed (because it's Poisson and hey we're never really introduced to that idea)


> Waiting times would be a poisson distribution

This is only true if events are independent. Suppose we're modeling rider arrivals at a bus stop. The bus comes once an hour. Does the exponential distribution adequately model how long you need to wait for another person to show up? If a Poisson process is an appropriate model, the expected value of the number of people arriving in the 20 minutes after the bus departs is the same as the expected value of the number of people arriving in the 20 minutes before. Clearly not. Rider arrival rates depend on how long it is until the bus is supposed to depart. Indeed, most arrival processes are not ideal Poisson processes, but it may be a better approximation in some cases than in others.

We use Poisson processes for two reasons:

1. They're often a good enough approximation of reality to make useful predictions.

2. They're mathematically tractable.

Unfortunately, our reasons for using Poisson processes are often more (2) than (1).


I don't really understand your example. The number of people arriving at the bus stop would match a poisson distribution where the x-axis is time-to-next-bus.

None of this before/after bus mess.


> I don't really understand your example.

Evidently. I'm afraid we have no common ground on which to discuss stochastic processes. Have a pleasant weekend.


Ok dude you too. Nice chatting


That's almost not a joke - I did particle astrophysics, not astronomy, but for the exact same reason you state, our physics results (we used particle physics as a window to astronomy, or at least we tried to) would end up with broken power law fits.

There's something deeply satisfying using a model so simple to explain a piece of nature. :)

Example: cosmic ray energy spectrum. http://iopscience.iop.org/1367-2630/12/7/075009/downloadFigu...


When you want to fit a distribution to a star's image on the plate, normal does indeed seem a weird choice. But the point about multidimensional gaussians is that they are the only distribution that is maximally "ignorant" (maximum entropy) -- so if there's a small (unresolvable) companion star we don't know about, we can do no better than a gaussian if we want to ignore it.


Well, that's not quite true. The normal distribution maximizes entropy for a finite mean and variance. But it's not guaranteed that certain physical processes have well defined means and variances. A power law is the maximum entropy distribution with a well defined mean, but undefined variance, on the range (0, infinity). [1]

Of course in reality, at some point, some other physical process starts to dominate and cuts off the power law. this then turns it into a power law with a different slope or an exponential cutoff.

Edit: I forgot to add that a star's light profile on an image is actually approximately Gaussian. But it isn't exactly Gaussian --- I seem to remember that the core more closely resembles a Lorentzian. I am obviously not an observer, otherwise I would have thought of that immediately!

[1]: https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...


> A power law is the maximum entropy distribution with a well defined mean, but undefined variance, on the range (0, infinity).

The maximum entropy distribution with a given mean is the exponential. I think to define power law distributions you will need additional constraints more concrete than "undefined variance".


That's if you require the range to be [0, infinity). If you don't require 0 to be a valid point in the distribution then a power law is the maximum entropy distribution. A power law just diverges if you try to integrate to zero, so you have to find a lower entropy distribution.


I think that if the power law cannot be integrated on [0,infinity), it cannot be integrated on (0,infinity) either.


> But in astronomy, nothing is normally distributed. (At least, nothing comes to mind.) Instead, everything is a power law.

Power law fits to empirical data are heavily misused.

http://cs.unm.edu/~aaron/blog/archives/2007/06/power_laws_an...


that's really weird. I'm not an astronomer so I thought "hmmm, what's the first thing I can think of in astronomy."

Well, how about the magnitude of stars. So, is the magnitude of stars normally distributed? I googled "distribution magnitude of stars" and got images (by clicking on images tab) that seem pretty normally distributed to me. Aren't they?


I think you're being downvoted a bit unfairly. What most non-statisticians don't realize is that some of the most "unnormal" distributions actually look pretty normally distributed, and some of the most ugly looking distributions actually behave like a normal distribution. When you look at a distribution's pdf, you're only looking at the "center" of the distribution. But what statisticians (usually) care about are the tails of a distribution.

Consider the following example: The [Cauchy distribution](https://en.wikipedia.org/wiki/Cauchy_distribution) is a pathological example of a distribution with very heavy tails. Heavier than the power law distributions meantioned in the GP. So heavy, the distribution doesn't even have a mean value. But if you just look at the pdf, it looks almost exactly like a gaussian.


It is pretty easy to tell a Cauchy or a low k Student's T apart from a Gaussian. Just look at your outliers. One thing a lot of people miss is that a normal distribution with outliers _IS_ a long tail distribution.

Too bad inference with long tailed distributions is so hard.


thanks! Very interesting. Is the standard deviation still useful for some of the other distributions you mentioned, that look vaguely Gaussian?


You're referring to the square root of the second moment about the mean. The short answer is yes, it's always useful, provided you have an estimator.

Check out Chebyshev's inequality.

Just don't start talking about it to your pointy-haired boss. You'll get fired.


thanks. but what does your last line, about the boss mean?


That this sort of knowledge is often uninteresting to non-technical types.

I'm pretty jaded at this point.


For the specific example of the Cauchy distribution, the standard deviation isn't defined.


The standard deviation is defined as

E[(X-EX)^2]

If EX = infinity but X is finite with prob > 0, the standard deviation will be infinite too.


You mean like this one?

http://www.astro.yale.edu/astrom/spmcat/histov.html

Not really. Plus keep in mind magnitude is a logarithmic measure. Things get skewed when you take logarithms. Also if you look for example at star sizes:

http://spiff.rit.edu/classes/phys440/lectures/size/diam_hist...

(from http://spiff.rit.edu/classes/phys440/lectures/size/size.html).

The distribution is certainly not gaussian, and very much likely heavy tail.


Not to mention that the distributions from three different sky surveys (in your first link) look dramatically different. Which means the histograms are saying at least as much about the measurement process as they are about what's actually out there in the universe.


To add to that, the sharp cutoff at faint (large number) magnitudes is almost certainly a reflection of the fact that astronomical surveys and images have a flux limit, below which objects cannot be individually detected.


First sentence on Wikipedia: "In astronomy, magnitude is a logarithmic measure of the brightness of an object..."


> First sentence on Wikipedia: "In astronomy, magnitude is a logarithmic measure of the brightness of an object..."

That doesn't tell you whether the magnitudes are normally distributed though. That just tells you how the magnitude is computed from an object's observed/measured flux.


I'm explaining why magnitude looking normally distributed doesn't contradict the parent.


I think it's about size and independence. Gaussian distributions connote small scale (the background noise of the electrical output of this sensor) vs power laws connote large scale (the number of electrical sensors per company) that you normally deal with in your field.

The size of a ruler that a machine cuts (quasi normal/Gaussian)

Wealth (Power/Pareto)


It's not exactly scale. Things tend to fit a Gaussian when individuals tend towards the mean, and power laws when there are "the rich get richer" processes pushing individuals towards extrema.


Power law indicates an exponential distribution, which is often taught right after normal distribution in most probability classes. Normal distribution usually considered as white noise, hence a fixed value + white noise is often Normal, which is quite common as well.


> Power law indicates an exponential distribution

This sounds wrong. One is invariant under scaling, the other under translation.


> This sounds wrong.

It is.

However there is a way to make it correct. Your second sentence is spot on the intuition. If you log transform power-law distributed random variable the RV that you get is an exponentially distributed RV.


> As a (former) astronomer, I've never understood why people assume normal distributions for everything.

It shouldn't, because there is a big reason to lean towards normal distribution, and that is the central limit theorem. If you sample a variety of different distributions, the sum of all these samples will go towards a normal distribution. That is why the normal distribution pops up in so many places.


I understand that. But in astronomy, there are so many vastly different scales to work with, that in practice, at any particular scale, there is usually only one physical process that dominates. It is extremely unlikely for two independent physical processes to operate at nearly the same scale! So it's hard to get processes that are the sum of enough random variables to make a distribution approximately Gaussian. The situation is probably different in a field like biology because there are many factors that, say, affect a person's height. The closest that I've ever encountered in astronomy was the distribution of the powers of a class of radio galaxies I studied that exhibited an approximately log-normal distribution. (See figure 7 of [1].)

[1]: http://adsabs.harvard.edu/abs/2012ApJ...756..116A


That sounds nice, justifying why you expect your distribution to be the sum of other distributions, and not another operation like multiplication takes a lot more work; in some fields (e.g. finance), it's completely wrong.


If you multiply random numbers together you get a log normal distribution. Which means the logarithm of the number is normally distributed (multiplication is just adding logarithms after all.) However multiplicative noise is a lot less common than additive noise.


Central limit theorem only yields normality when the source distributions have defined variance, otherwise they go to levy alpha.


They want everything to hit a Bell Curve because that is the only thing they remember from statistics class.

I've had debates that not everything comes out as a Bell Curve only to be called an idiot for claiming that.


Perhaps a better question is "Why is anything normally distributed?" It appears originally to have been a simplification to make the math more convenient:

As Rand Wilcox reports, "Why did Gauss assume that a plot of many observations would be symmetric around some point? Again, the answer does not stem from any empirical argument, but rather a convenient assumption that was in vogue at the time. This assumption can be traced back to the first half of the 18th century and is due to Thomas Simpson. Circa 1755, Thomas Bayes argued that there is no particular reason for assuming symmetry, Simpson recognized and acknowledged the merit of Bayes's argument, but it was unclear how to make any mathematical progress if asymmetry is allowed." (Wilcox, p. 4)

Wilcox, R. (2010). Fundamentals of modern statistical methods: Substantially improving power and accuracy (2nd ed.). New York, New York: Springer.[1]

[1] http://amzn.to/2tkMRoI


As others have pointed out, power laws are more "normal" than the normal distribution.

The reason for this is that if you have the sum independent identically distributed (I.I.D.) random variables (R.V.s), if they converge to a distribution, that distribution is Levy Stable [1], which is power law in it's tails. The Gaussian is a special case in the family of Levy Stable distributions.

The article states that "the sum of many independent, additive effects is approximately normally distributed" which is patently false. The sum of many independent random variables with finite variance is normally distributed. Once you relax the finite variance (and in more extreme cases, finite mean) power laws result.

There are other ways to generate power laws, including having killed exponential processes [2]. There are many other references that talk about the rediscovery of power laws [3] and give many ways to "naturally" create power laws [3] [4] [5].

The article claims that multiplicative processes lead to log normal distributions. I've heard that this is actually false but unfortunately I don't have enough familiarity to see how this is not true. If anyone has more insight into this I would appreciate a link to an article or other explanation.

[1] https://en.wikipedia.org/wiki/Stable_distribution

[2] http://www.angelfire.com/nv/telka/transfer/powerlaw_expl.pdf

[3] https://arxiv.org/pdf/physics/0601192v3.pdf

[4] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122...

[5] http://www.angelfire.com/nv/telka/transfer/powerlaw_expl.pdf


An example from a book I used to own, called, I think, _Treatment of Experimental Data_: imagine a factory manufacturing ball bearings. They want them all to have the same radius, but because of random errors, the radii will be normally distributed about some mean. If this is true, other random variables, such as the mass, can not be normally distributed.


Technically true, but the smaller the standard deviation of the radius (relative to the mean), the more closely the distribution of masses will approximate a normal distribution.


This. Technically the radii can't even have a normal distribution, because the support of the normal is (-Inf, Inf) but the radii are bounded below by 0.

So you might say that the radii = r + e, where r is constant and e is normal around 0 with variance way, way smaller than r. But then the volume of the ball bearings = K * (r+e)^3, and because r >> e, the largest source of randomness term is going to be 3Kr^2e, which makes the whole thing look pretty normal.


Sorry, could you explain why mass can't be normally distributed? If, say, the mass of the bearing is related to its radius, then shouldn't it follow similar distribution?


Mass is proportional to radius cubed. Cubing a normal distribution makes it not normal.


Though it should be noted that "cubing" doesn't refer to just taking the cube, i.e if f(x) = e^{-x^2/2} is proportional to the standard normal distribution then the cube of f(x), namely f * f * f is not the distribution you're considering here. Rather it's the "pushforward measure" [1], in this case is 1/3 y^{-2/3} e^{-y^{2/3}/2} where y=x^3 is the mass and x is the radius. (Up to constants.)

[1] https://en.wikipedia.org/wiki/Pushforward_measure


Cubing does refer to just taking the cube, but the cube of random variable, and not its probability density function -- the density after cubing is indeed the pushforward.


This is indeed one confusion that I ran into very often when first learning stats.

Transforming the random variable is a very different thing than transforming the probability density function directly.

The distinction between a random variable and a distribution (or density) is not made clear enough in classes I think.


    R = r + e

    R^{3} = (r+e)^{3} = r^{3} + 3r^{2}e + O(e^{2})
Unless you expect the error to be significant with respect to r, you can ignore the higher order terms. And voila, you get an approximately normal distribution.


Yes, that's why.


mass = 4/3 pi radius cubed

The cube of a normal distribution is indeterminate.

Had the relationship between the mass and the radius been linear (M = a * R + b) then yes the mass would have followed a normal distribution as well (with different parameters of course).


That factor should be 4/3 rather than 3/4. When you differentiate it, you need to get 4πr² for the surface area.


You are correct, my memory failed me and I didn't check. Thank you.


That distribution still has the same median, with half the samples on one side and the other half on the other side.

That looks a lot like a normal distribution even if it's not one.


Suppose the radius is normally distributed, and is a random varable X. The mass will be proportional to the volume, which increases with the cube of the radius, so the mass will be a random variable kX^3 (k = 4/3 x pi x density here). X and kX^3 won't have the same distribution.


Depends on whether the radius errors are generated from a separate process than the mass errors. (1) Get a glob of molten metal, mass determined by how much metal is there, some error there; (2) shape it into a ball, and suppose the radius is mainly determined by air that gets trapped, then some nearly independent error there.


That's true. But the idea that distributions related non-linearly cannot both be normal holds.


That's kind of an odd metric to begin with. Radius would probably be derived from the imperfections in the consistency of matter. Whether that distribution is gaussian probably depends on the fabrication process and the materials...


This kinda blows my mind.

What happens if they do quality control on both mass and radius?

I imagine the reason why, in your example it is the radius rather than the mass that is normal, is due to what QA is focused on.


The reason that QA is done on radius is because the purpose of a ball bearing is to sit inside its housing to within a fairly tight tolerance. If the balls are out of spec, then potentially the thing the bearing is coupling will be out of alignment/won't rotate freely. The mass is less important.

If you did QA on mass, you could easily calculate what mass would be required for a particular radius, but that would let misshapen bearings through.

You could do QA on both mass and radius, but unless you have potential contaminants, the radius gives you the mass for free so there isn't much point.


But its still log-normal


Hmm, why is that?


with constant density, p, and spherical particles:

$ m = pi * r^3 * p $


Have some Unicode symbols!

m = πr³ρ

(with no disrespect to your use of TeX)

But also it should be m = (4/3)πr³ρ because the volume of the sphere is (4/3)πr³, not πr³.


Alternative: have a mathjax bookmarklet :) https://www.checkmyworking.com/misc/mathjax-bookmarklet/


Wow, nice!


Not enough Unicode. Let's make it:

m = (¾)⁻¹πr³ρ


From the comments:

Personally, I don’t find it surprising that not everything is normally distributed. Why should any real phenomenon follow a theoretical limiting distribution anyway, never mind a symmetric, infinite-tailed distribution that is exact only in an unachievable limit? The surprise is that so many things _are_ sufficiently near normality for it to be useful!


Interestingly enough - you don't need random variables to be independent or identically distributed for a CLT to apply.

See

https://en.wikipedia.org/wiki/Central_limit_theorem#CLT_unde...

and

https://en.wikipedia.org/wiki/Central_limit_theorem#Lyapunov...


But the variance needs to be bounded. With power laws is is often unbounded. The trick that many get trapped in is that short samples of power laws typically look regular. Estimates of variance will always be finite on finite sample size. Nassim Taleb goes into more detail on this.


Then, Taleb is a jerk when someone tries to pin him on one of his wild exaggerations.

Even as a quant Taleb was prone to smushing over details to push a narrative. In the 90s already Derman was enabling Taleb into claiming the Black-Scholes formula was an interpolation algorithm already known to option traders and served basically to justify using the risk-free rate as a drift (price trend) parameter in accordance to the "economics establishment". But (as noted by more than one response in the literature) making a different assumption on the drift (or even the distribution) -- i.e. leaving the "Black-Scholes world" and merely interpolating two world-states -- you're left with calibrating a stochastic discount rate that gives put-call parity. But hey, not that technicalities should get in the way of a good story!


Because for random variable X, sometimes need to care about X^2, and when X is normally distributed X^2 is chi-square distributed and not normally distributed.

Because under mild assumptions, for arrivals, say, of visitors to a Web site, gas station, hospital, the arrivals form a Poisson process, the times between arrivals are independent and identically exponentially distributed, and that's not normally distributed.

Now at nearly any server farm, it is easy to get wide, deep, rapidly flowing oceans of data on the performance of the server farm, and, as U. Grenander once explained to me in his office at Brown, the data is wildly different from what statistics was used to, e.g., in medical data. In that ocean of data, finding anything normally distributed will be very rare.

The claims in the OP about many effects are nothing like good evidence for the central limit theorem or normally distributed. E.g., from the renewal theorem, many examples of Poisson processes are from the results of many independent effects.

E.g., the usual computer based random number generators return what look like independent, identically distributed random variables uniform on [0,1], and that is not normally distributed.

The question in the OP about why not normally distributed is, in one word, just absurd.


What's more surprising is how little the central limit theorem holds and is useful, yet is still used all the time to justify poor analysis, usually a/b testing. When the underlying distribution has high variance, as many metrics I've come across with extreme long tailed behavior, the aggregates need large N before they adhere to Gaussian


We see Pareto distribution in nature and society a lot.

https://en.wikipedia.org/wiki/Pareto_distribution#Applicatio...


Someone poke Taleb.


> Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

This makes me wonder if in countries where these environmental effects are mostly optimised (everyone having access to good nutrition), or at least nearly identical for everyone, the normal distribution of height breaks down.

Is height normally distributed in the tallest countries in the world? What about the shortest?

edit: I'll just copy this question to the comments under his blog, maybe the author has some idea about that.

edit2: just noticed the blog post is from 2015... oh well, it was worth a shot.


I came across this CMU presentation on why the purported ubiquity of power laws must be taken with a grain of salt:

https://goo.gl/23PP7v

Check from slide 32.

As someone who doesn't have significant experience in statistics, I'd be grateful for an expert's opinion on the arguments presented in this presentation.


I'd take the CLT as everything complicated being normally distributed by default, but with fairly common exceptions:

1 - If the problem isn't actually that complicated, the CLT doesn't do much.

2 - If the problem is dominated by one component, it will still mostly look like that component.

3 - Most ways of slicing a normal distribution lead to other distributions. For example the Rician distribution.


Because not everything has finite variance.


Because many phenomena aren't in fact representations of arbitrary random variables.

Take word distribution in any human language for instance. Word frequencies follow a Zipf distribution because it decreases entropy and hence is more efficient.


It's not arbitrary random variables, it is independent additive variables which leads you to gaussian distributions frequently.


The surprising thing about the CLT is that it applies whenever the means and variances of the summed random variables exist. This is really a very mild condition, but the surprise to me is that the result is independent of the actual distributions of the variables being summed!


The convergence rates are different though, so at finite sample sizes and as a practical matter, it does matter.


Now, having read some comments, do not get lost in your assumptions (which implies you should know your assumptions). It's really that simple.


There are a lot of quantities which are positive definite. If it's bounded from below, it's not Gaussian.


Strictly speaking nothing physical is Gaussian, because everything is bounded. There are a finite number of particles in the observable universe.


Nassim Nichalas Taleb doesn't have many nice things about the bell curve in his Black Swan book.


From the question without even reading a thing, why would any set of events follow a gaussian?


Can anyone quickly explain to me why some things _are_ normally distributed?


Is the distribution of distributions itself normal?


All models are wrong, some are useful.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: