1. Spotting odd things in MPs' expenses: http://blog.jgc.org/2009/06/its-probably-worth-testing-mps.h...
2. Spotting odd things in BBC executives' expenses: http://blog.jgc.org/2009/06/running-numbers-on-bbc-executive...
3. The Iranian election: http://blog.jgc.org/2009/06/benfords-law-and-iranian-electio...
4. New Age mumbo jumbo: http://www.jgc.org/blog/2008/02/any-sufficiently-simple-expl...
Cf. Say Anything.
Consider a sequence increasing by 1 each period. Now pick a random number between 1 and 10000. Generate the sequence between 1 and your random number. It will approximately conform to Benford's law, modulo your ending number.
Benford's law is a property of growing numbers, not of any particular kind of growth.
You can test it here: http://www.mpi-inf.mpg.de/~fietzke/benford.html
Benfords law applies only to exponential growth over a long timescale.
I was thinking to make it into a little web tool.
<blockquote>The discovery of this fact goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm books, the earlier pages (which contained numbers that started with 1) were much more worn than the other pages.</blockquote>
Can you imagine the sense of observation and curiosity that would make someone look at a book of numbers and say, "I wonder why these pages are more worn than those ones."
I simply chalked it up to most people not being very serious about reading books in general and any given book in particular. It's a rare person who makes it all the way through.
I don't think my own observation was a particularly interesting or original one.
What made Newcomb's observation interesting was that it was about books of logarithm tables in particular, where (unlike a typical book) you'd think the lookups would be uniformly distributed.
The other interesting thing that did require an unusual amount of curiosity and dedication is the systematic testing of such a casual observation to try to figure out what the underlying reasons for it were and how they might apply to things other than books of logarithms. This desire and dedication to observe, test, and figure out the underlying workings of things is the hallmark of many a great scientist.
Imagine that your calculator break down every three-to-four months. After a couple of years, any hacker is bound to think "I should be able to take a couple of broken ones, pick working parts, and build a working one". Then, you discover that all of them have perfectly working '9' keys, but broken '1' keys.
[EDITED to add: Discussed before on HN: http://news.ycombinator.com/item?id=687241. There have been quite a number of other discussions of Benford's law on HN, too.]
In the United States, evidence based on Benford's law is legally admissible in criminal cases at the federal, state, and local levels.
Wouldn't it be stranger (and actually interesting) if that evidence wasn't admissible?
It's still a bit of a brain f--- when you first encounter it. I found it easier to get using plotting tools, as opposed to aggregating lists of numbers and measurements.
If it proves itself true, then you could use it to test if a group of things is increasing or decreasing.
Proof that the starting digit in numbers is not base invariant:
In base 10 not all numbers start with 1. In base 2 all numbers start with 1. Hence, the distribution is not base invariant. QED.
For an explanation why you get the right thing if you substitute "units", I refer you to the Wikipedia page on Benford's Law. http://en.wikipedia.org/wiki/Benfords_law
Here are some hacker-newsers testing files in their home directories:
Now imagine that numbers are built out of stones. To "build" a 1, you only need 1 stone. But to "build" a 2, you need 2 stones. Thus, if you wanted to write a 3, you would have to go in the desert and find 3 stones. It's 3x as hard, and so you'd expect people to "build" 1/3 as many 3's as 1's, 1/5 as many 5's as 1's, and so on. Just as you'd expect there to be a lot more single story buildings than skyscrapers. It's easier to build a single story building.
Thus, the distribution is exactly what you'd expect. While it doesn't actually take stones to build numbers, we don't write the number 3 unless we have 3 of something. Unless you are lying. Which is why this is a great method of detecting fraud.
UPDATE: What do I mean when I say "3 times as hard"?
Imagine the desert is a rectangle of 10 squares. Kind of like a mancala board or a ladder on the ground. You start by stepping in square 1, and to get to square 10 you have to step through each square.
If there is only 1 rock, what are the odds that you'll have to walk all 10 steps to find it? This is the same thing as asking what are the odds that this rock is in square 10. The answer is 1/10 or 10%.
Now, if there are 3 rocks, what are the odds that you'll have to step into all 10 squares? Well, what are the odds that there's a rock in the last square? 26.1%, or approximately 3x as hard. It's interesting that it's not exactly 3x as hard, it's 2.61x as hard. Which makes the data in the OP seem even more logical since you'd expect 30.8% 1's given 11.8% 3's--the 32.62% actual number is not that far off.
Suppose you are the guy looking for the stones. There are two stones in the desert. Everything being random but equal, you are twice as likely to run into a stone when there are two than when there is only one stone in the desert. Once you find the first stone, it is equally difficult to find the second stone as it is to find only one stone at the beginning (if you treat "finding a stone" as independent events where you don't learn about the location of subsequent stones).
So while the idea is interesting, the analogy is poor. I much prefer the wikipedia explanation which is similar to yours but much more logically rigorous: http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_expone...
Response to update: Now I feel that you are convoluting your analogy. Can multiple stones occupy the same square? How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"? I apologize, but your illustration has become completely lost to me.
You're right I should have clarified. If multiple stones could not occupy the same square, the odds would remain as I first explained them (3x, etc.). I think in my stones analogy and real life, stones should be able to occupy the same square. In fact, there should be a positive correlation (ie, given that there's a rock in this square, odds of a second rock being there go up).
> How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"?
The odds of coming across 3 units of a quantity are 3x as hard as coming across 1 unit. When we write numbers, we are either:
1) writing a truthful description of how many units we see/own/ate/taste/touch etc. (I ate 2 bagels, I earned $5, I ran 10 miles.)
By "lying", I'm including things like writing a novel. Maybe a better word is "imagining". With numbers, we are either writing down true observations or we are imagining them. It's just as easy to "imagine" $9 million in your bank account as it is to "imagine" $1 million, while truthfully finding $9 million in your bank account is a lot more difficult :). This is why Benford's law doesn't apply for "imagined" numbers. By using Benford's law, you can quickly classify a number set into either "real" or "imagined".
Again, your train of thought is not necessarily wrong, but I still find the wikipedia explanation much more robust and illustrative. I hope this is where we can agree to disagree.
Thanks for offering your views. My analogy may be wrong or weak and maybe there is a better one to be found.
According to Benford's law the odds of a leading 1 are 1.709511291351... times the odds of a leading 2. This isn't the factor of 2 you thought it should be. The odds of a leading 1 are 2.409420839653... times the odds of a leading 3. This isn't the factor of 3 you thought it should be.
Yes, I know that it is fun to try to figure things out for yourself. But it is essential to learn when you're headed down the wrong path. That lets you correct your misconceptions before they cement and lead to severely wrong impressions of how to do things. Your whole desert/rock analogy? That's a wrong path.
It looks to me though, that my line of reasoning(note I said more precisely 2.6x, I used 3x initially to simplify it) more closely matches the data than the numbers you provided.
I think my explanation is about what is it that causes the law to occur. Not what is benfords , but rather, why does it occur. I know about the law scale And the picture on Wikipedia. It's neat, but I don't think it reveals the underlying cause. I think the cause could be simply that it is about 2x easier to find 1 unit than 2 units, 10 units than 20 units, 1000 units than 2000 units.
Benford's law will hold approximately for any set of numbers with the property that they are distributed over many orders of magnitude, from a distribution which doesn't change much if you multiply by a random number in some range.
An example of such a set of numbers is the set of numbers that come up in intermediate calculations involving a lot of different numbers. (This explains the logarithm books where the phenomena was first noticed.)
Another example are the numbers you see coming out of any sort of self-similar phenomena. As fractals show, self-similar behavior is ubiquitous. As a result numbers like the length of rivers, the height of hills, and the size of cities all tend to follow Benford's law.
For any particular source of numbers, the explanation for why they fall into a category that matches Benford's law will differ. Benford's law is a property that mathematical models tend to have, rather than being a rigorous mathematical theorem.
(FYI Benford's law is something that I've known about, and thought about off and on, for close to 20 years ago now.)
For the numbers 1-19, more than half of them start with 1.
For the numbers 1-199, more than half of them start with one.
Change the examples to 1-299, 1-399, etc, and you'll get percentages of all digits matching Benford's law.
Your method also seems to depend heavily on the choice of starting and ending points. If I chose 1-99, then only 1/10th of the numbers in the interval will start with 1. So why choose 199 and not 99?
I just selected end points to illustrate the concept. I think this place is getting a little too literal. :)
I think it seems "counter-intuitive" to some because they are not used to thinking of numbers and counting as being related to exponents and bases.
This may seem more intuitive to those of us that work with computers all day since we are intimately familiar with how to count in a handful of different bases (base-2, base-10, base-16 etc).
1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 100
I think he most likely meant to exclude 1 and 100, but I don't know.
Best page I've seen on it.
This kind of mathematically unsophisticated reasoning is exactly why Benford's law is so surprising to people. If you think of what it means for a value to be "truly random", the result is not surprising at all.
I wonder what influence the 'spatial' properties of a number pad password has on this data. For example "5" gets a nice little spike... and "5" is the center key on the 10-key iPhone number pad. The "1" is still the winner by far, but I wonder how many of those are the easy-to-remember "1234".
Chart looks like this. https://url.odesk.com/a7och
How is that a large dataset? There aren't that many countries.
Any time you are counting something it seems obvious to me that you'd have 1 more often than 2.