
Zipf’s Law Arises Naturally When There Are Underlying, Unobserved Variables - jmount
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005110
======
argonaut
Time for a funny quote: "Power laws... have been termed 'the signature of
human activity'... They are certainly the product of one particular kind of
human activity: looking for power laws."
[http://people.seas.harvard.edu/~babis/T797003/Readings/fabri...](http://people.seas.harvard.edu/~babis/T797003/Readings/fabrikant02powerlaw.pdf)

------
wodenokoto
I've recently come across a word frequency set of Chinese subtitles [0] and
the word distribution _doesn 't_ seem to follow Zipf's Law / The power
distribution, which seems counter to all my textbooks on statistical
linguistics.

This is not discussed in the associated paper either, though the paper
concludes that the distribution is a good representation of common Chinese
words (i.e., it isn't skewed). Any clues to why that is?

[0] [http://crr.ugent.be/programs-data/subtitle-
frequencies/subtl...](http://crr.ugent.be/programs-data/subtitle-
frequencies/subtlex-ch)

~~~
rahimnathwani
"though the paper concludes that the distribution is a good representation of
common Chinese words"

I don't know why the word distribution doesn't follow a power law, but I can
tell you that it isn't a good representation of common Chinese words. I
downloaded the data set a while back, intending to use it for improving my
vocabulary. I found some words (like the Chinese word for 'vampire') had too
high a frequency.

The lists are based on American TV shows' Chinese fansubs. I guess vampires
and zombies are common themes.

~~~
wodenokoto
I completely agree that the lists looks odd in many regards, but they test it
on humans, and find correlations with frequency in the list and how quickly
native speakers can recognise the word.

I don't consider myself an expert in word-frequencies, but the article makes a
convincing argument that these frequencies are representative of natural
Chinese and a better or just as good measure as their baseline corpus.

------
mhneu
Similar idea, applied to different data.
[http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.113...](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.113.068102)
(Can also see the preprint at arxiv)

The joint probability distribution of states of many degrees of freedom in
biological systems, such as firing patterns in neural networks or antibody
sequence compositions, often follows Zipf’s law, where a power law is observed
on a rank-frequency plot. This behavior has been shown to imply that these
systems reside near a unique critical point where the extensive parts of the
entropy and energy are exactly equal. Here, we show analytically, and via
numerical simulations, that Zipf-like probability distributions arise
naturally if there is a fluctuating unobserved variable (or variables) that
affects the system, such as a common input stimulus that causes individual
neurons to fire at time-varying rates. In statistics and machine learning,
these are called latent-variable or mixture models. We show that Zipf’s law
arises generically for large systems, without fine-tuning parameters to a
point. Our work gives insight into the ubiquity of Zipf’s law in a wide range
of systems.

------
generj
Warning: I'm not a statistician, just an economist with some training in it.

My quick understanding of this paper is that lurking variables can cause
Zipf's law. If enough variables are controlled for, in some domains Zipf's Law
goes away? For example, they included parts of speech for natural language and
for some parts of speech Zipf's Law wasn't existent but for others it was.

This makes intuitive sense to me.

They also find a way to estimate how much of a Zipf's Law effect is caused by
a particular variable. Seems like a nice test to run in Stata or R.

~~~
kem
Your questions were similar to mine. I perused the article and hope to read it
very carefully later today, but I wondered how often you _wouldn 't_ expect
unobserved variables to be playing some role, especially in observational
data. It's an interesting paper, but when I think of rigorous evaluation of
its ideas I get a little confused. I'm probably missing something though.

~~~
alok-g
I would appreciate a short and intuitive explanation of it. :-) Thanks.

------
chrismealy
Do all probability distributions have known processes that generate them? Like
how you can make a normal distribution with a Galton box? Is there something
like that with power law distributions? It seems like the deal with power laws
is that the events aren't independent (an event becomes more likely the more
it occurs, or the rich get richer).

~~~
klodolph
A Galton box generates a binomial distribution.

Most probability distributions are designed to model some kind of process. The
time between clicks of a geiger counter is exponential, the number of clicks
in a second is poisson. The amount of time your geiger counter lasts before it
breaks and you have to buy a new one is weibull. That's the whole reason why
we invented probability distributions in the first place: to model processes.

~~~
dekhn
The Galton box generates a binomial distribution _which when taken to the
limit of n- >infinity, produces a normal distribution_.

~~~
klodolph
This is incorrect, you need to take a limit as n->infinity while dividing by a
factor of √n, otherwise the limit does not exist. But Galton boxes in practice
are not very large, so they aren't very good approximations of normal
distributions.

------
laurencea
Author here, happy to answer questions!

~~~
canjobear
There are many ways to derive Zipf's Law. Does yours make any novel
predictions about systems that exhivit this behavior?

~~~
CephalopodMD
If you read the introduction, it does. In particular, it explains how
normalizing for parts of speech (all nouns, verbs, etc) zipf's law does not
emerge. This method predicts that phenomenon as parts of speech being a latent
variable, which they do not believe other models can do.

------
dmux
VSauce has a Youtube video that discusses Zipf's:
[https://www.youtube.com/watch?v=fCn8zs912OE](https://www.youtube.com/watch?v=fCn8zs912OE)

------
johnlbevan2
"when observations are ranked from most to least frequent, the frequency of an
observation is inversely proportional to its rank"

Is this just a complex way of saying "when you put items in ascending order,
that order's the inverse of the same items listed in descending order", or
have I missed something?

~~~
secretasiandan
From wikipedia:
[https://en.wikipedia.org/wiki/Zipf's_law](https://en.wikipedia.org/wiki/Zipf's_law)

Thus the most frequent word will occur approximately twice as often as the
second most frequent word, three times as often as the third most frequent
word, etc.: the rank-frequency distribution is an inverse relation.

~~~
johnlbevan2
Ah; I'd interpreted inversely proportional to mean "y = -mx" rather than "y =
m / x", forgetting my GCSE maths terms :S. Thanks for the pointer.

------
imagist
What does "rank" mean in this context?

~~~
webmaven
The Ordinal of the position in the sorted list of observed values.

~~~
imagist
So basically if you order by frequency, order correlates with frequency? Isn't
that just two ways of saying the same thing? Isn't it obvious? I don't get why
we would need a law for that so I feel like I'm missing some detail.

~~~
jjaredsimpson
Not just correlates.

rank * frequency = constant

~~~
imagist
Ah, that is the missing piece!

