
Testing Benford's Law - brycethornton
http://testingbenfordslaw.com
======
jgrahamc
Other fun I've had with Benford's Law.

1\. Spotting odd things in MPs' expenses: [http://blog.jgc.org/2009/06/its-
probably-worth-testing-mps.h...](http://blog.jgc.org/2009/06/its-probably-
worth-testing-mps.html)

2\. Spotting odd things in BBC executives' expenses:
[http://blog.jgc.org/2009/06/running-numbers-on-bbc-
executive...](http://blog.jgc.org/2009/06/running-numbers-on-bbc-
executives.html)

3\. The Iranian election: [http://blog.jgc.org/2009/06/benfords-law-and-
iranian-electio...](http://blog.jgc.org/2009/06/benfords-law-and-iranian-
election.html)

4\. New Age mumbo jumbo: [http://www.jgc.org/blog/2008/02/any-sufficiently-
simple-expl...](http://www.jgc.org/blog/2008/02/any-sufficiently-simple-
explanation-is.html)

~~~
wisty
It's also interesting how you showed how Benford's Law breaks down, especially
when prices are involved. There's lots of $10, $100, and $1000 limits, so you
will get a lot of prices being pushed _back_ to something starting with an 8
or 9.

~~~
jgrahamc
Yes, that's why it's an interesting tool for spotting 'anomalies'. That
doesn't mean it's spotting things that are incorrect, or fraudulent, or
illegal etc., just it spots things that are out of the ordinary.

~~~
lurker19
Well, a price that is specifically modified to dodge regulatory scrutiny is
suggestive of some sort of corruption.

Cf. _Say Anything_.

------
bluesmoon
I like the history section of the wikipedia article:

<blockquote>The discovery of this fact goes back to 1881, when the American
astronomer Simon Newcomb noticed that in logarithm books, the earlier pages
(which contained numbers that started with 1) were much more worn than the
other pages.</blockquote>

Can you imagine the sense of observation and curiosity that would make someone
look at a book of numbers and say, "I wonder why these pages are more worn
than those ones."

~~~
gnosis
Before ever hearing of Benford's law I've noticed that many books are more
worn at the beginning than further on.

I simply chalked it up to most people not being very serious about reading
books in general and any given book in particular. It's a rare person who
makes it all the way through.

I don't think my own observation was a particularly interesting or original
one.

What made Newcomb's observation interesting was that it was about books of
logarithm tables in particular, where (unlike a typical book) you'd think the
lookups would be uniformly distributed.

The other interesting thing that did require an unusual amount of curiosity
and dedication is the systematic testing of such a casual observation to try
to figure out what the underlying reasons for it were and how they might apply
to things other than books of logarithms. This desire and dedication to
observe, test, and figure out the underlying workings of things is the
hallmark of many a great scientist.

------
gjm11
There's a nice discussion of this from Terry Tao (outrageously smart
mathematician; has a Fields medal) at
[http://terrytao.wordpress.com/2009/07/03/benfords-law-
zipfs-...](http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-
and-the-pareto-distribution/) which contains, e.g., the following nice
observation: if X follows Benford's law and Y is any positive random variable
independent of X, then XY also follows Benford's law. (Tao goes a bit further
than this and thereby sheds some light on why many things approximately obey
Benford's law.)

[EDITED to add: Discussed before on HN:
<http://news.ycombinator.com/item?id=687241>. There have been quite a number
of other discussions of Benford's law on HN, too.]

------
kia
This is interesting (from Wikipedia article on Benford's Law):

In the United States, evidence based on Benford's law is legally admissible in
criminal cases at the federal, state, and local levels.

~~~
nl
What does that even mean? Is there some special law that says "Benford's Law
is admissible"? I'm guessing not, and that all it means is that it's a piece
of evidence that can be used, similarly to any other statistical evidence.

Wouldn't it be stranger (and _actually_ interesting) if that evidence _wasn't_
admissible?

~~~
loumf
It probably means that the precedent has been established already -- so future
courts are likely to just accept it rather than having to go through a re-
hearing of Benford from the very beginning.

------
polynomial
Benford's law only seems strange until you realise natural phenomena tend to
express logarithmic functions while our commonly used system of counting
counting and measuring is not.

It's still a bit of a brain f--- when you first encounter it. I found it
easier to _get_ using plotting tools, as opposed to aggregating lists of
numbers and measurements.

~~~
r00fus
Exactly. I immediately thought it's an artifact of our base-10 system.

~~~
polynomial
except that it applies to any base-n system. It's an artifact of the
underlying system, not the base value.

~~~
r00fus
What about base-e?

~~~
orangecat
That's just silly. On the other hand, base 2i would be very efficient:
<http://en.wikipedia.org/wiki/Quater-imaginary_base>

------
seasoup
Seems to me that when you have a group of somethings that are constantly
increasing in size it would be natural for the number 1 to come up in the
first digit more often because in order to get to 2, you need to pass through
1 first and in order to get to 9 you need to pass 1,2,3,4,5,6,7,8 first.
Therefore, you should get the distribution predicted by Benford's law. The way
to test this theory, would be to run the numbers on values that are constantly
decreasing. I'd expect the distribution would reverse itself.

If it proves itself true, then you could use it to test if a group of things
is increasing or decreasing.

~~~
verycleanteeth
It depends on where you start decreasing from. If you start at 0 and subtract
1 at a time you'll still follow Benford's Law. If you start at 999 and go
down, you'll see a reverse curve that begins the process of righting itself
again once it goes into the negatives.

~~~
seasoup
True, I was considering things like decreasing populations. Example, if you
take all cities/countries on the planet with declining population and plot
their current population, I would expect the reverse of Benford's law. Wish I
had time to test this hypotheses :)

------
rflrob
My favorite explanation of it is that if there is a distribution to the
numbers, then that distribution should hold no matter what base you're working
in (for natural things, after all, there's nothing special about base 10), and
Benfords law can be shown to be a) a law that satisfies this base-independent
property, and b) the only law that does so.

~~~
jules
s/base/units/

~~~
jules
Seeing that people are downvoting this, perhaps it wasn't obvious enough.

Proof that the starting digit in numbers is not base invariant:

In base 10 not all numbers start with 1. In base 2 all numbers start with 1.
Hence, the distribution is not base invariant. QED.

For an explanation why you get the right thing if you substitute "units", I
refer you to the Wikipedia page on Benford's Law.
<http://en.wikipedia.org/wiki/Benfords_law>

------
imurray
Searching reveals _lots_ of previous discussion on Benford's law on here, so I
won't give all the links. Of course, it's an interesting observation, so it's
worth advertising every so often.

Here are some hacker-newsers testing files in their home directories:
<http://news.ycombinator.com/item?id=1076534>

------
dfan
As far as I can tell, "Most common iPhone passcodes" doesn't belong on this
list, and I'm perplexed why it seems to follow the law. An iPhone numeric
password (which I'm assuming it's referring to) is simply a 4-digit string, so
all first digits should be equally probable, unless there's some psychological
issue at work. Or are they discarding leading zeros for the purpose of this
chart? I guess they must be (0 doesn't appear on the chart), but that's a
weird thing to do to a password.

~~~
regularfry
My guess is that people will be tending to pick "meaningful" numbers.

------
ColinWright
Other discussions on Benford's law and related material:

<http://news.ycombinator.com/item?id=100540>

<http://news.ycombinator.com/item?id=499405>

<http://news.ycombinator.com/item?id=699202>

<http://news.ycombinator.com/item?id=731176>

<http://news.ycombinator.com/item?id=1076405>

<http://news.ycombinator.com/item?id=1429336>

<http://news.ycombinator.com/item?id=1569669>

<http://news.ycombinator.com/item?id=1653808>

<http://news.ycombinator.com/item?id=1917514>

<http://news.ycombinator.com/item?id=2089809>

<http://news.ycombinator.com/item?id=2375453>

<http://news.ycombinator.com/item?id=2400049>

~~~
JasonPunyon
Check out those post ids :)

------
breck
Imagine you threw a single stone into the desert and asked your friend to go
find it. It would be hard. Now imagine you threw 2 stones into the desert and
asked your friend to go find them. It is twice as hard to find both stones as
it is to find 1 stone. Imagine you threw 3 stones. It is 3 times as hard to
find all 3 stones as it is to find 1 stone.

Now imagine that numbers are built out of stones. To "build" a 1, you only
need 1 stone. But to "build" a 2, you need 2 stones. Thus, if you wanted to
write a 3, you would have to go in the desert and find 3 stones. It's 3x as
hard, and so you'd expect people to "build" 1/3 as many 3's as 1's, 1/5 as
many 5's as 1's, and so on. Just as you'd expect there to be a lot more single
story buildings than skyscrapers. It's easier to build a single story
building.

Thus, the distribution is exactly what you'd expect. While it doesn't actually
take stones to build numbers, we don't write the number 3 unless we have 3 of
something. Unless you are lying. Which is why this is a great method of
detecting fraud.

UPDATE: What do I mean when I say "3 times as hard"?

Imagine the desert is a rectangle of 10 squares. Kind of like a mancala board
or a ladder on the ground. You start by stepping in square 1, and to get to
square 10 you have to step through each square.

If there is only 1 rock, what are the odds that you'll have to walk all 10
steps to find it? This is the same thing as asking what are the odds that this
rock is in square 10. The answer is 1/10 or 10%.

Now, if there are 3 rocks, what are the odds that you'll have to step into all
10 squares? Well, what are the odds that there's a rock in the last square?
26.1%, or approximately 3x as hard. It's interesting that it's not exactly 3x
as hard, it's 2.61x as hard. Which makes the data in the OP seem even more
logical since you'd expect 30.8% 1's given 11.8% 3's--the 32.62% actual number
is not that far off.

~~~
etruong42
It is less than twice as hard to find both stones when you threw 2 stones than
it is to find the only stone you threw when you threw only 1.

Suppose you are the guy looking for the stones. There are two stones in the
desert. Everything being random but equal, you are twice as likely to run into
a stone when there are two than when there is only one stone in the desert.
Once you find the first stone, it is equally difficult to find the second
stone as it is to find only one stone at the beginning (if you treat "finding
a stone" as independent events where you don't learn about the location of
subsequent stones).

So while the idea is interesting, the analogy is poor. I much prefer the
wikipedia explanation which is similar to yours but much more logically
rigorous:
[http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_expone...](http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_exponential_growth_processes)

Response to update: Now I feel that you are convoluting your analogy. Can
multiple stones occupy the same square? How is it appropriate to
equate/compare "the number of squares you walk through in order to pick up all
the stones" to "the number of times a digit should show up"? I apologize, but
your illustration has become completely lost to me.

~~~
breck
> Can multiple stones occupy the same square?

You're right I should have clarified. If multiple stones could not occupy the
same square, the odds would remain as I first explained them (3x, etc.). I
think in my stones analogy and real life, stones should be able to occupy the
same square. In fact, there should be a positive correlation (ie, given that
there's a rock in this square, odds of a second rock being there go up).

> How is it appropriate to equate/compare "the number of squares you walk
> through in order to pick up all the stones" to "the number of times a digit
> should show up"?

The odds of coming across 3 units of a quantity are 3x as hard as coming
across 1 unit. When we write numbers, we are either:

1) writing a truthful description of how many units we see/own/ate/taste/touch
etc. (I ate 2 bagels, I earned $5, I ran 10 miles.)

2) lying.

By "lying", I'm including things like writing a novel. Maybe a better word is
"imagining". With numbers, we are either writing down true observations or we
are imagining them. It's just as easy to "imagine" $9 million in your bank
account as it is to "imagine" $1 million, while truthfully finding $9 million
in your bank account is a lot more difficult :). This is why Benford's law
doesn't apply for "imagined" numbers. By using Benford's law, you can quickly
classify a number set into either "real" or "imagined".

~~~
etruong42
Ah. But there's the rub. What I find unintuitive about Benford's law is the
non-random distribution of the most significant digit regardless of base or
unit of measure. You propose that it's the "largeness" of a number that
enforces Benford's law. While that may be in some ways true, it does not
explain the transparency to base or unit of measure. You ate 2 bagels? I ate 4
half-bagels. You earned $5? I earned ¥600 motherfucker! You ran 10 miles? I
ran 3bf3e6800 micrometers, in base-16!

Again, your train of thought is not necessarily wrong, but I still find the
wikipedia explanation much more robust and illustrative. I hope this is where
we can agree to disagree.

~~~
breck
The reason why it only applies to the most significant digit is that I can say
for certain that quantities of 1_ will appear ~2x as much as quantities in the
2_x family. However, I can't say whether numbers ending in 1 are more common
than numbers ending in 6, because although 11 occurs more than any number
higher than it, it makes up a minuscule proportion of the numbers ending in 1,
and 16 occurs more than 21, 31, etc., so there's no clear way to predict what
number will occur most in any digit but the most significant.

Thanks for offering your views. My analogy may be wrong or weak and maybe
there is a better one to be found.

------
synnik
Why is this not common sense?

For the numbers 1-19, more than half of them start with 1. For the numbers
1-199, more than half of them start with one.

Change the examples to 1-299, 1-399, etc, and you'll get percentages of all
digits matching Benford's law.

~~~
jcarreiro
Benford's Law predicts the first digit will be 1 about 30% of the time, not
50% of the time.

Your method also seems to depend heavily on the choice of starting and ending
points. If I chose 1-99, then only 1/10th of the numbers in the interval will
start with 1. So why choose 199 and not 99?

~~~
synnik
It is not "my method". Just an explanation of why the law seems intuitive to
me. I expected everyone to extrapolate out from my examples for other ending
points, and which point, yes, the % for the digit of "1" would drop, and
approach that 30% rate.

I just selected end points to illustrate the concept. I think this place is
getting a little too literal. :)

------
skrebbel
Cool stuff. However, something mostly entirely offtopic that I genuiunely
wonder about: it seems everybody registers a .com just to make a HN post.
What's the point of this? Why not post the same data on your blog?

~~~
blackant
I can't speak for everyone else, but in this case we were hoping to create
something that could stand on its own and (hopefully) grow over time as more
datasets are added. The domain wasn't registered just to get on HN - the idea
lent itself to a simple site of its own, so why not use a fitting domain name?

~~~
hn_decay
Benford Law applies to anything where values appreciate or depreciate
exponentially. Kind of weird to see it relate to passcodes (where it isn't
really Benford's Law at all, but instead is simply numeric proximity), but for
monetary values, forces of nature, population counts, etc, Benford's Law will
always apply to a large enough set, for reasons that become very evident once
you understand the reasoning.

<http://blog.yafla.com/Demystifying_Benfords_Law>

Best page I've seen on it.

------
EGreg
Benford's law makes a lot of sense if you consider that many of the numbers
are derived from counting up from 0. The scale of these things is
exponentially distributed, and therefore the leading digits are more likely to
be 1 than 9. This is related to social media -- once your userbase gets big
enough it starts growing or shrinking proportionally to its size, i.e.
exponentially. This is also somewhat related to the value of a social network
... Metcalfe's law seems to be too optimistic. THe value is probably more like
nlog n

------
GregBuchholz
I always liked: "Explaining Benfords Law"
(<http://www.dspguide.com/CH34.PDF>).

------
scarmig
Whoah, check out the distribution of the leading digit in binary!

~~~
ianterrell
Generalized to base b:
[http://upload.wikimedia.org/math/a/7/e/a7e38730abb9b29099d10...](http://upload.wikimedia.org/math/a/7/e/a7e38730abb9b29099d109d4e39f0055.png)

------
kmod
"If a set of values were truly random, each leading digit would appear about
11% of the time"

This kind of mathematically unsophisticated reasoning is exactly why Benford's
law is so surprising to people. If you think of what it means for a value to
be "truly random", the result is not surprising at all.

~~~
sesqu
"random" is commonly taken to be uniformly distributed. There is no particular
reason to expect unbounded support, even if you feel it shouldn't be a certain
interval.

------
callenish
Perhaps you should let some of the open data citizen groups know about this so
they can add more data. Also, if you haven't already then take a look at
CKAN[1] for datasets to add.

[1] <http://ckan.net/>

------
cycojesus
'Presenting Benford's law' would be a more fitting title. Nicely presented,
and intriguing law for sure but I can't help to think "and?" At this point it
lacks a more user-friendly way to submit data-sets.

------
pkamb
This one is interesting: <http://testingbenfordslaw.com/most-common-iphone-
passcodes>

I wonder what influence the 'spatial' properties of a number pad password has
on this data. For example "5" gets a nice little spike... and "5" is the
center key on the 10-key iPhone number pad. The "1" is still the winner by
far, but I wonder how many of those are the easy-to-remember "1234".

------
iambot
great site, awesome design, i love benfords law, first heard about it on
WNYC's RadioLab (best podcast in the world ever, im not even kidding).

~~~
Dylanfm
Here's the Benfords Law snippet from the Radiolab episode:
<http://www.radiolab.org/2009/nov/30/from-benford-to-erdos/>

------
jbreinlinger
It seems to me there's a lot of interesting psychology elements to this, but
it's also a simple reflection of relatively constant growth rates. If
population of cities grow 3% every year, they will spend a lot more years in
the 1 millions than the 9 millions, etc

Chart looks like this. <https://url.odesk.com/a7och>

------
pragmatic
FYI, The text of the article is scrambled (Chrome 12, Windows 7 64 bit)

------
Havoc
>Imagine a large dataset, say something like a list of every country and its
population.

How is that a large dataset? There aren't that many countries.

------
moheeb
Benford's Law seems like common sense to me.

Any time you are counting something it seems obvious to me that you'd have 1
more often than 2.

------
paraschopra
I kind of feel that the initial data sets are selected just to reinforce the
Benford's law. It seems too good to be true!

------
cyberony
My first time hearing about this law (sadly) and I'm stumped! This is
amazing!!

------
dfc
Does anyone else have trouble with the font on that page?

------
blakerobinson
Benford's Law has always been kind of fascinating to me.

------
quasar
No black swan :P

------
mmff
nice!

