
Benford's Law - kjhughes
https://en.wikipedia.org/wiki/Benford%27s_law
======
BenoitEssiambre
Benford's law, Zipf's law and other power laws tend to fascinate but they are
a manifestation of something really simple. It's how things behave when they
are evenly distributed over multiplicative operations instead of additive
operations. By this I mean things are as likely to be x% bigger than x%
smaller instead of the more common definition of evenly distributed (things x
unit larger are as likely as x unit smaller).

Being evenly distributed over multiplication is natural for many measures. I
think it is considered the most even, "maximum entropy" distribution for
things that can't be negative.

The most even, maximum entropy distribution for an unbounded (-infinity to
+infinity) prior is an unbounded uniform distribution but you can't apply it,
for example, in the case of the size of objects since it would imply that
negative sizes are as likely as positive sizes (and objects don't have
negative sizes).

In general position parameters tend to be uniformly distributed over
addition/subtraction whereas scale parameters tend to be evenly distributed
over multiplication (uniform over log(x)). It's the reason some graphs use log
axis. For Bayesians, it's a fairly straightforwards concept. See also improper
priors
([https://en.wikipedia.org/wiki/Prior_probability#Improper_pri...](https://en.wikipedia.org/wiki/Prior_probability#Improper_priors))

~~~
hyperbovine
The max entropy distribution over (-oo,+oo) is usually taken to be either
Gaussian or Laplace (two sided exponential), depending on moment conditions.
There is no such thing as a uniform distribution over the real line.

~~~
BenoitEssiambre
Of course there is. It's an improper prior (doesn't sum up to one), see link
above. Theoretically the Gaussian is only maximum entropy given a mean and
stddev. An infinite uniform is maxent without any knowledge of mean and
stddev.

In practice though you can use a wide Gaussian if it is more mathematically
convenient and get about the same results.

~~~
hyperbovine
The entropy of any improper prior diverges. There is no point in arguing over
which one maximizes entropy; they all do, vacuously. Max entropy is not useful
for discriminating amongst probability distributions unless you require them
to actually be probability distributions.

------
ronenlh
Quoted from The New York Times, Tuesday, August 4, 1998
[http://www.rexswain.com/benford.html](http://www.rexswain.com/benford.html)
(linked in Wikipedia):

To illustrate Benford's Law, Dr. Mark J. Nigrini offered this example: "If we
think of the Dow Jones stock average as 1,000, our first digit would be 1.

"To get to a Dow Jones average with a first digit of 2, the average must
increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.

"Let's say that the Dow goes up at a rate of about 20 percent a year. That
means that it would take five years to get from 1 to 2 as a first digit.

"But suppose we start with a first digit 5. It only requires a 20 percent
increase to get from 5,000 to 6,000, and that is achieved in one year.

"When the Dow reaches 9,000, it takes only an 11 percent increase and just
seven months to reach the 10,000 mark, which starts with the number 1. At that
point you start over with the first digit a 1, once again. Once again, you
must double the number -- 10,000 -- to 20,000 before reaching 2 as the first
digit.

"As you can see, the number 1 predominates at every step of the progression,
as it does in logarithmic sequences."

------
Symmetry
I used this, well the idea that numbers often tend to be distributed evenly in
exponential space, in my MEng thesis. The idea was that because so often in
computers our additions or other arithmetic operations involve at least one
small number then most of the switching activity in an adder should be
concentrated in the lower order bits.

So the plan was to make the low order bits of the adder slower and more power
efficient. Make the high order bits faster and less power efficient. Meet the
same overall clock rate target given by the worst case but use less power in
the average case.

Sadly memory operations, which look a lot like pure entropy, were sufficiently
common that the gains just weren't worth the design effort.

------
tbrock
This law actually led to Bernie Madoff being caught. I worked at a hedge fund
at the time and we applied this to his returns and immediately knew it was a
scam.

~~~
skunkworker
I’ve heard of this before, but part of me thinks also: How many people are
getting away with fraud because they are utilizing Benfords law in order to
make their data appear more normal? Or is there another attribute that can’t
be accounted for when aiming for that?

~~~
Archit3ch
I've read that they have multiple "tells" for catching fraudsters. There's
almost 100 of them and not all of them are as known as Benford's Law. At that
point, it's just easier to run a legit business.

~~~
dredmorbius
Any pointers to details or research?

~~~
m3kw9
A lot of them probably deals with timing and some correlation type analysis

------
dang
Sundry previous threads:
[https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=%22Benford%27s%20Law%22%20comments%3E0&sort=byDate&type=story)

~~~
cjohnson318
If anyone's interested, this is the breakdown of the leading digits of the
number of comments for these threads:

{'1': 0.6896551724137931, '2': 0.13793103448275862, '3': 0.06896551724137931,
'4': 0.06896551724137931, '5': 0.0, '6': 0.0, '7': 0.0, '8':
0.034482758620689655, '9': 0.0}

It's not _exactly_ Benford's distribution. It doesn't span more than two
orders of magnitude, but it's somewhat Benfordian, and it is what you might
expect from a distribution of comments.

~~~
lonelappde
Now do lengths of comments.

Also, please don't report 16 digits when your data set only has 1 sigfig.

~~~
tantalor
ASCII bar chart or get out.

~~~
dredmorbius

        i: 10  min: 0.000000  max: 0.689655  range: 0.689655  width: 70  scale: 101.500000
        1   0.69 **********************************************************************
        2   0.14 **************
        3   0.07 *******
        4   0.07 *******
        5   0.00
        6   0.00
        7   0.00
        8   0.03 ***
        9   0.00

------
knzhou
Interestingly enough, it's also a decent fit for fundamental physical
constants, as one can read off from a table of them. [0]

0:
[https://physics.nist.gov/cuu/pdf/all.pdf](https://physics.nist.gov/cuu/pdf/all.pdf)

~~~
polyphonicist
That's interesting indeed. Any idea why Physics constants obey Benford's Law?
Did the universe pick the physical constants from an exponential distribution?

~~~
knzhou
In this particular case it's mostly reflecting our choices of units, and the
fact that changing the units can change the constants by many orders of
magnitude. Benford's law comes from a distribution which is invariant under
that kind of change.

However, it's an interesting question whether the _dimensionless_ fundamental
constants have a deeper explanation, since they don't depend on unit choices.
The fermion masses are roughly evenly distributed logarithmically, for example
(which leads to Benford's law). There isn't a clear reason why.

------
eqmvii
Spent some time in college studying the use of this to detect election fraud.
A really intriguing concept, but then the question becomes: what next? Is the
statistical data enough to trigger a deeper investigation?

~~~
patrickthebold
You are phrasing your question like it might be controversial. Why wouldn't
'statistical data' be enough to trigger a deeper investigation?

~~~
ceejayoz
Everything in politics is controversial. "Let's feed kids more veggies in
school lunches" somehow became socialism.

------
jedimastert
VSauce did a pretty fun video on the same topic (sort of)

[https://www.youtube.com/watch?v=fCn8zs912OE](https://www.youtube.com/watch?v=fCn8zs912OE)

------
logicallee
I thought I could try this really easily in Python.

Here's what I did:

Open IDLE and type:

    
    
       from random import seed
       from random import random
       seed(1)
    

You can then try typing random() and get values:

    
    
       >>>random()
       0.13436424411240122
    
       >>>random()
       0.8474337369372327
    
       >>>random()
       0.763774618976614
    

I actually forgot that random() doesn't just return an integer, like in c
where if you type rand() you get a random integer. But these numbers looked
great to me. I could easily imagine the first one 0.13436424411240122 just
being 13436424411240122 and so on. I thought it's a great way to get the first
digit of a random numbers with an unlimited size, and for this result I
thought this is a good way to go ahead.

So with the above basis, I want to look at the distribution of the first
nonzero digit after the decimal point.

So next put the following function in. (basically it just gets a random as
above, but keeps multiplying by 10 as long as the first digit of the string
conversion is a "0". this returns the first non-zero digit at the left.)

    
    
        def getone():
           floaty = random()
           while(str(floaty)[0] == '0'):
              floaty *= 10
           return int(str(floaty)[0])
    

This now gets us the leftmost digit, check it out:

    
    
       >>>getone()
       3
       >>>getone()
       2
       >>>getone()
       4
    
       >>>getone()
       2
    

All right. Now let's just do this a few million times and keep track of
buckets. Make some buckets:

    
    
       buckets = [0,0,0,0,0,0,0,0,0,0] # buckets for: 0,1,2,3,4,5,6,7,8,9
    

This next line will run for up to a few minutes:

    
    
       for tries in range(10000000):
          buckets[getone()] += 1
    

Now print the results:

    
    
       for result in range(10):
          print (result, ":", buckets[result])
    

This gets me....

    
    
       0 : 0
       1 : 1112927
       2 : 1110821
       3 : 1109741
       4 : 1110793
       5 : 1111604
       6 : 1112178
       7 : 1111272
       8 : 1110402
       9 : 1110262
    

Clearly Benford's law does NOT apply here.

Why not?

~~~
susam
Your experiments show that when we pick numbers from a uniform distribution,
each digit from 1 to 9 is equally likely to be the leading significant digit.
This is perfectly fine. Numbers picked from a uniform distribution indeed do
not obey Benford's law.

~~~
logicallee
thanks for the analysis! I thought I must have made a mistake.

------
ggm
If we counted in hex, how would it vary? It appears to me to be "not much"
because it says most things scale to express as instances of the first non-
zero value in the number system plus an arbitrary number of powers of the
base. So.. it's Fermi arithmetic 0, 1, 10, 100 ...

~~~
chess93
Benford's law works in any base.

~~~
angry_octet
Except base 1.

------
m3kw9
The law definitely has a human psychology to it. When stock prices are close
to say 1000, it will like gets pushed over to the psychological level.

------
pugworthy
There are a number of fun random generators online - worth googling.

------
susam
Experimental evidence for uniform distribution:

    
    
      # uniform.py
    
      import random
    
      n = 10**4
      samples = 10**6
      counts = [0] * 10
    
      for i in range(samples):
          random_number = random.randint(1, n)
          first_digit = int(str(random_number)[0])
          counts[first_digit] += 1
    
      for i, count in enumerate(counts):
          print('{} => {:.3f}'.format(i, counts[i] / samples))
    

Output:

    
    
      $ python3 uniform.py
      0 => 0.000
      1 => 0.111
      2 => 0.111
      3 => 0.111
      4 => 0.111
      5 => 0.111
      6 => 0.111
      7 => 0.111
      8 => 0.111
      9 => 0.111
    

Experimental evidence for exponential distribution:

    
    
      # exponential.py
    
      import random
    
      n = 10**4
      samples = 10**6
      counts = [0] * 10
    
      for i in range(samples):
          random_number = 2 ** random.randint(1, n)
          first_digit = int(str(random_number)[0])
          counts[first_digit] += 1
    
      for i, count in enumerate(counts):
          print('{} => {:.3f}'.format(i, counts[i] / samples))
    

Output:

    
    
      $ python3 exponential.py
      0 => 0.000
      1 => 0.301
      2 => 0.176
      3 => 0.125
      4 => 0.097
      5 => 0.079
      6 => 0.067
      7 => 0.058
      8 => 0.051
      9 => 0.046

~~~
dmurray
You chose N in such a way to make your hypothesis true.

~~~
susam
The variable 'n' has to be sufficiently large. The larger 'n' is, the more it
agrees with Benford's law. For example, with n = 10 (as opposed to n = 10000
in my original code), we get results that disagree with Benford's law quite a
bit:

    
    
      0 => 0.000
      1 => 0.301
      2 => 0.200
      3 => 0.100
      4 => 0.100
      5 => 0.100
      6 => 0.100
      7 => 0.000
      8 => 0.100
      9 => 0.000
    

In fact, the results for digits 7 and 9 are in complete disagreement with
Benford's law. That's not surprising because there is no integer n between 0
and 10 such that 2^n has 7 or 9 as its first digit. Therefore, no matter how
many trials we perform, the frequency of the first digit as 7 or 9 would
always be 0. The first power of 2 for which we get 7 as the leading digit is
2^46. Similarly, the first power of 2 for which we get 9 as the leading digit
is 2^53. So unless the powers are much larger than these numbers, we won't get
frequencies for 7 and 9 that agree with Benford's law.

With n = 100, the experimental results become closer to Benford's law:

    
    
      0 => 0.000
      1 => 0.300
      2 => 0.170
      3 => 0.130
      4 => 0.100
      5 => 0.070
      6 => 0.070
      7 => 0.060
      8 => 0.050
      9 => 0.050
    

But with n = 1000, the results become even closer to Benford's law:

    
    
      0 => 0.000
      1 => 0.301
      2 => 0.176
      3 => 0.125
      4 => 0.097
      5 => 0.079
      6 => 0.069
      7 => 0.056
      8 => 0.052
      9 => 0.045

~~~
susam
Instead of performing trials in an experiment, another way to look at this
problem is to simply analyze the frequency distribution of the leading
significant digit for a function that grows exponentially. Like before, let us
choose 2^n as that function.

Here is the code:

    
    
      import random
    
      N = 10
      counts = [0] * 10
    
      for n in range(N):
          first_digit = int(str(2**n)[0])
          counts[first_digit] += 1
    
      for i, count in enumerate(counts):
          print('{} => {:.3f}'.format(i, counts[i] / N))
    

For N = 10, i.e., 0 < n < 10, we get:

    
    
      0 => 0.000
      1 => 0.300
      2 => 0.200
      3 => 0.100
      4 => 0.100
      5 => 0.100
      6 => 0.100
      7 => 0.000
      8 => 0.100
      9 => 0.000
    

For N = 100, i.e., 0 < n < 100, we get:

    
    
      0 => 0.000
      1 => 0.300
      2 => 0.170
      3 => 0.130
      4 => 0.100
      5 => 0.070
      6 => 0.070
      7 => 0.060
      8 => 0.050
      9 => 0.050
    

For N = 1000, i.e., 0 < n < 1000, we get:

    
    
      0 => 0.000
      1 => 0.301
      2 => 0.176
      3 => 0.125
      4 => 0.097
      5 => 0.079
      6 => 0.069
      7 => 0.056
      8 => 0.052
      9 => 0.045
    

For N = 10000, i.e., 0 < n < 10000, we get:

    
    
      0 => 0.000
      1 => 0.301
      2 => 0.176
      3 => 0.125
      4 => 0.097
      5 => 0.079
      6 => 0.067
      7 => 0.058
      8 => 0.051
      9 => 0.046
    

These results are consistent with the experimental results in my previous
comment. As we can see, the larger N is, the more the frequencies of the
leading digit obey Benford's law. Similar results can be reproduced with many
other exponentially increasing functions too, e.g., 3^n, 5^n, Fibonacci
numbers, Lucas numbers, and even factorials that grow faster than exponential
functions.

~~~
dmurray
You keep choosing N to be an integer power of 10. Make it a power of 2 or 7 or
pretty much any number, and you'll see different results.

~~~
susam
I do not see different results:

    
    
      import random
    
      N = 8192
      counts = [0] * 10
    
      for n in range(N):
        first_digit = int(str(2**n)[0])
        counts[first_digit] += 1
    
      for i, count in enumerate(counts):
        print('{} => {:.3f}'.format(i, counts[i] / N))
    

Output:

    
    
      0 => 0.000
      1 => 0.301
      2 => 0.176
      3 => 0.125
      4 => 0.097
      5 => 0.079
      6 => 0.067
      7 => 0.058
      8 => 0.051
      9 => 0.046
    

I still see results that agree with Benford's law.

~~~
logicallee
I did something similar (except you're a better coder, haha), but when I
looked up how to get a random number in Python the first thing I saw was how
to get one between 0 and 1 so I went with it. Here is my comment (very similar
in spirit to yours):

[https://news.ycombinator.com/item?id=22341416](https://news.ycombinator.com/item?id=22341416)

Any comments?

~~~
susam
Replied on your comment thread.

