Hacker News new | past | comments | ask | show | jobs | submit login
Testing Benford's Law (testingbenfordslaw.com)
546 points by brycethornton on June 27, 2011 | hide | past | web | favorite | 88 comments



Other fun I've had with Benford's Law.

1. Spotting odd things in MPs' expenses: http://blog.jgc.org/2009/06/its-probably-worth-testing-mps.h...

2. Spotting odd things in BBC executives' expenses: http://blog.jgc.org/2009/06/running-numbers-on-bbc-executive...

3. The Iranian election: http://blog.jgc.org/2009/06/benfords-law-and-iranian-electio...

4. New Age mumbo jumbo: http://www.jgc.org/blog/2008/02/any-sufficiently-simple-expl...


It's also interesting how you showed how Benford's Law breaks down, especially when prices are involved. There's lots of $10, $100, and $1000 limits, so you will get a lot of prices being pushed back to something starting with an 8 or 9.


Yes, that's why it's an interesting tool for spotting 'anomalies'. That doesn't mean it's spotting things that are incorrect, or fraudulent, or illegal etc., just it spots things that are out of the ordinary.


Well, a price that is specifically modified to dodge regulatory scrutiny is suggestive of some sort of corruption.

Cf. Say Anything.


I have a feeling that datasets that are largely defined by human psychology - the list of iPhone passwords is a good example - are less likely to adhere to Benford's law than "naturally" generated datasets.


Benford's Law really only applies to numbers where the growth rate is a function of the current value (ie. exponential growth). It's not some magic property that can be applied to any dataset.


This isn't true. Benford's law applies just as equally to ordinary, arithmetic growth as well. The reason Benford's law works is because a growing number spends as much time with "1" as its initial digit as it did traversing the entire previous order of magnitude. This is true true whether the growth is exponential or arithmetic or multiplicative.


Arithmetic growth refers to the situation where a value increases by a constant number per period. Benford's Law applies in this case?


It sure does, that's why I said it did.

Consider a sequence increasing by 1 each period. Now pick a random number between 1 and 10000. Generate the sequence between 1 and your random number. It will approximately conform to Benford's law, modulo your ending number.

Benford's law is a property of growing numbers, not of any particular kind of growth.


Linearly growing sequences don't follow Benford's law, but lower first digits (1, 2, 3) are still more probable than higher first digits (7, 8, 9) "most of the time", as you describe it.

You can test it here: http://www.mpi-inf.mpg.de/~fietzke/benford.html


That is not true, you just happened to choose lucky starting numbers. For example consider this one: choose a number between 1 and 10000. Generate the sequence between 9000000 and 9000000+(your number). Everything starts with 9.

Benfords law applies only to exponential growth over a long timescale.


It actually applies to more than just exponential growth. Fibonacci and factorial growth rates also follow Benford's Law.


Have you tried running the Law not only on the digits in base ten, but also on the digits in base 100? (I.e. the two highest digits?)


Can you post the code you used for the mp's?

I was thinking to make it into a little web tool.


I like the history section of the wikipedia article:

<blockquote>The discovery of this fact goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm books, the earlier pages (which contained numbers that started with 1) were much more worn than the other pages.</blockquote>

Can you imagine the sense of observation and curiosity that would make someone look at a book of numbers and say, "I wonder why these pages are more worn than those ones."


Before ever hearing of Benford's law I've noticed that many books are more worn at the beginning than further on.

I simply chalked it up to most people not being very serious about reading books in general and any given book in particular. It's a rare person who makes it all the way through.

I don't think my own observation was a particularly interesting or original one.

What made Newcomb's observation interesting was that it was about books of logarithm tables in particular, where (unlike a typical book) you'd think the lookups would be uniformly distributed.

The other interesting thing that did require an unusual amount of curiosity and dedication is the systematic testing of such a casual observation to try to figure out what the underlying reasons for it were and how they might apply to things other than books of logarithms. This desire and dedication to observe, test, and figure out the underlying workings of things is the hallmark of many a great scientist.


In those times, scientists were rather fond of their logarithm tables, in the same way they would of their slide rules, HP calculators and netbooks later on.

Imagine that your calculator break down every three-to-four months. After a couple of years, any hacker is bound to think "I should be able to take a couple of broken ones, pick working parts, and build a working one". Then, you discover that all of them have perfectly working '9' keys, but broken '1' keys.


See also: reading android PINs by the smudges on the screen.


There's a nice discussion of this from Terry Tao (outrageously smart mathematician; has a Fields medal) at http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-... which contains, e.g., the following nice observation: if X follows Benford's law and Y is any positive random variable independent of X, then XY also follows Benford's law. (Tao goes a bit further than this and thereby sheds some light on why many things approximately obey Benford's law.)

[EDITED to add: Discussed before on HN: http://news.ycombinator.com/item?id=687241. There have been quite a number of other discussions of Benford's law on HN, too.]


This is interesting (from Wikipedia article on Benford's Law):

In the United States, evidence based on Benford's law is legally admissible in criminal cases at the federal, state, and local levels.


What does that even mean? Is there some special law that says "Benford's Law is admissible"? I'm guessing not, and that all it means is that it's a piece of evidence that can be used, similarly to any other statistical evidence.

Wouldn't it be stranger (and actually interesting) if that evidence wasn't admissible?


It probably means that the precedent has been established already -- so future courts are likely to just accept it rather than having to go through a re-hearing of Benford from the very beginning.


Benford's law only seems strange until you realise natural phenomena tend to express logarithmic functions while our commonly used system of counting counting and measuring is not.

It's still a bit of a brain f--- when you first encounter it. I found it easier to get using plotting tools, as opposed to aggregating lists of numbers and measurements.


Exactly. I immediately thought it's an artifact of our base-10 system.


except that it applies to any base-n system. It's an artifact of the underlying system, not the base value.


What about base-e?


That's just silly. On the other hand, base 2i would be very efficient: http://en.wikipedia.org/wiki/Quater-imaginary_base


Really, why don't we have a transcendental numbering system? Also, I can't imagine it could make our financial markets any more irrational.


Seems to me that when you have a group of somethings that are constantly increasing in size it would be natural for the number 1 to come up in the first digit more often because in order to get to 2, you need to pass through 1 first and in order to get to 9 you need to pass 1,2,3,4,5,6,7,8 first. Therefore, you should get the distribution predicted by Benford's law. The way to test this theory, would be to run the numbers on values that are constantly decreasing. I'd expect the distribution would reverse itself.

If it proves itself true, then you could use it to test if a group of things is increasing or decreasing.


It depends on where you start decreasing from. If you start at 0 and subtract 1 at a time you'll still follow Benford's Law. If you start at 999 and go down, you'll see a reverse curve that begins the process of righting itself again once it goes into the negatives.


True, I was considering things like decreasing populations. Example, if you take all cities/countries on the planet with declining population and plot their current population, I would expect the reverse of Benford's law. Wish I had time to test this hypotheses :)


My favorite explanation of it is that if there is a distribution to the numbers, then that distribution should hold no matter what base you're working in (for natural things, after all, there's nothing special about base 10), and Benfords law can be shown to be a) a law that satisfies this base-independent property, and b) the only law that does so.


s/base/units/


Seeing that people are downvoting this, perhaps it wasn't obvious enough.

Proof that the starting digit in numbers is not base invariant:

In base 10 not all numbers start with 1. In base 2 all numbers start with 1. Hence, the distribution is not base invariant. QED.

For an explanation why you get the right thing if you substitute "units", I refer you to the Wikipedia page on Benford's Law. http://en.wikipedia.org/wiki/Benfords_law


Searching reveals lots of previous discussion on Benford's law on here, so I won't give all the links. Of course, it's an interesting observation, so it's worth advertising every so often.

Here are some hacker-newsers testing files in their home directories: http://news.ycombinator.com/item?id=1076534


As far as I can tell, "Most common iPhone passcodes" doesn't belong on this list, and I'm perplexed why it seems to follow the law. An iPhone numeric password (which I'm assuming it's referring to) is simply a 4-digit string, so all first digits should be equally probable, unless there's some psychological issue at work. Or are they discarding leading zeros for the purpose of this chart? I guess they must be (0 doesn't appear on the chart), but that's a weird thing to do to a password.


My guess is that people will be tending to pick "meaningful" numbers.


Passwords follow the law, too! Skewed of course by the preponderance of '123456'.


4 digit years make 'great' 4 digit passwords.



Check out those post ids :)


Imagine you threw a single stone into the desert and asked your friend to go find it. It would be hard. Now imagine you threw 2 stones into the desert and asked your friend to go find them. It is twice as hard to find both stones as it is to find 1 stone. Imagine you threw 3 stones. It is 3 times as hard to find all 3 stones as it is to find 1 stone.

Now imagine that numbers are built out of stones. To "build" a 1, you only need 1 stone. But to "build" a 2, you need 2 stones. Thus, if you wanted to write a 3, you would have to go in the desert and find 3 stones. It's 3x as hard, and so you'd expect people to "build" 1/3 as many 3's as 1's, 1/5 as many 5's as 1's, and so on. Just as you'd expect there to be a lot more single story buildings than skyscrapers. It's easier to build a single story building.

Thus, the distribution is exactly what you'd expect. While it doesn't actually take stones to build numbers, we don't write the number 3 unless we have 3 of something. Unless you are lying. Which is why this is a great method of detecting fraud.

UPDATE: What do I mean when I say "3 times as hard"?

Imagine the desert is a rectangle of 10 squares. Kind of like a mancala board or a ladder on the ground. You start by stepping in square 1, and to get to square 10 you have to step through each square.

If there is only 1 rock, what are the odds that you'll have to walk all 10 steps to find it? This is the same thing as asking what are the odds that this rock is in square 10. The answer is 1/10 or 10%.

Now, if there are 3 rocks, what are the odds that you'll have to step into all 10 squares? Well, what are the odds that there's a rock in the last square? 26.1%, or approximately 3x as hard. It's interesting that it's not exactly 3x as hard, it's 2.61x as hard. Which makes the data in the OP seem even more logical since you'd expect 30.8% 1's given 11.8% 3's--the 32.62% actual number is not that far off.


It is less than twice as hard to find both stones when you threw 2 stones than it is to find the only stone you threw when you threw only 1.

Suppose you are the guy looking for the stones. There are two stones in the desert. Everything being random but equal, you are twice as likely to run into a stone when there are two than when there is only one stone in the desert. Once you find the first stone, it is equally difficult to find the second stone as it is to find only one stone at the beginning (if you treat "finding a stone" as independent events where you don't learn about the location of subsequent stones).

So while the idea is interesting, the analogy is poor. I much prefer the wikipedia explanation which is similar to yours but much more logically rigorous: http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_expone...

Response to update: Now I feel that you are convoluting your analogy. Can multiple stones occupy the same square? How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"? I apologize, but your illustration has become completely lost to me.


> Can multiple stones occupy the same square?

You're right I should have clarified. If multiple stones could not occupy the same square, the odds would remain as I first explained them (3x, etc.). I think in my stones analogy and real life, stones should be able to occupy the same square. In fact, there should be a positive correlation (ie, given that there's a rock in this square, odds of a second rock being there go up).

> How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"?

The odds of coming across 3 units of a quantity are 3x as hard as coming across 1 unit. When we write numbers, we are either:

1) writing a truthful description of how many units we see/own/ate/taste/touch etc. (I ate 2 bagels, I earned $5, I ran 10 miles.)

2) lying.

By "lying", I'm including things like writing a novel. Maybe a better word is "imagining". With numbers, we are either writing down true observations or we are imagining them. It's just as easy to "imagine" $9 million in your bank account as it is to "imagine" $1 million, while truthfully finding $9 million in your bank account is a lot more difficult :). This is why Benford's law doesn't apply for "imagined" numbers. By using Benford's law, you can quickly classify a number set into either "real" or "imagined".


Ah. But there's the rub. What I find unintuitive about Benford's law is the non-random distribution of the most significant digit regardless of base or unit of measure. You propose that it's the "largeness" of a number that enforces Benford's law. While that may be in some ways true, it does not explain the transparency to base or unit of measure. You ate 2 bagels? I ate 4 half-bagels. You earned $5? I earned ¥600 motherfucker! You ran 10 miles? I ran 3bf3e6800 micrometers, in base-16!

Again, your train of thought is not necessarily wrong, but I still find the wikipedia explanation much more robust and illustrative. I hope this is where we can agree to disagree.


The reason why it only applies to the most significant digit is that I can say for certain that quantities of 1_ will appear ~2x as much as quantities in the 2_x family. However, I can't say whether numbers ending in 1 are more common than numbers ending in 6, because although 11 occurs more than any number higher than it, it makes up a minuscule proportion of the numbers ending in 1, and 16 occurs more than 21, 31, etc., so there's no clear way to predict what number will occur most in any digit but the most significant.

Thanks for offering your views. My analogy may be wrong or weak and maybe there is a better one to be found.


Every base is base 10. There's your answer.


I'm sorry, but your mathematical reasoning is very muddled, and the pattern you predict is wrong.

According to Benford's law the odds of a leading 1 are 1.709511291351... times the odds of a leading 2. This isn't the factor of 2 you thought it should be. The odds of a leading 1 are 2.409420839653... times the odds of a leading 3. This isn't the factor of 3 you thought it should be.

Yes, I know that it is fun to try to figure things out for yourself. But it is essential to learn when you're headed down the wrong path. That lets you correct your misconceptions before they cement and lead to severely wrong impressions of how to do things. Your whole desert/rock analogy? That's a wrong path.


Thank you, very interesting contributions to the conversation.

It looks to me though, that my line of reasoning(note I said more precisely 2.6x, I used 3x initially to simplify it) more closely matches the data than the numbers you provided.


I gave you exact numbers, not approximations from a small data set. Unless you match the numbers that I gave exactly, the numbers you give won't match what will be found in large datasets.


I understand that the law is a formula that can generate exact numbers. However, as experience shows, almost nothing generates these exact numbers. Almost nothing follows the formula exactly.

I think my explanation is about what is it that causes the law to occur. Not what is benfords , but rather, why does it occur. I know about the law scale And the picture on Wikipedia. It's neat, but I don't think it reveals the underlying cause. I think the cause could be simply that it is about 2x easier to find 1 unit than 2 units, 10 units than 20 units, 1000 units than 2000 units.


What can I say? Your explanation is wrong. It generates wrong numbers. And gives little to no insight as to why Benford's law works.

Benford's law will hold approximately for any set of numbers with the property that they are distributed over many orders of magnitude, from a distribution which doesn't change much if you multiply by a random number in some range.

An example of such a set of numbers is the set of numbers that come up in intermediate calculations involving a lot of different numbers. (This explains the logarithm books where the phenomena was first noticed.)

Another example are the numbers you see coming out of any sort of self-similar phenomena. As fractals show, self-similar behavior is ubiquitous. As a result numbers like the length of rivers, the height of hills, and the size of cities all tend to follow Benford's law.

For any particular source of numbers, the explanation for why they fall into a category that matches Benford's law will differ. Benford's law is a property that mathematical models tend to have, rather than being a rigorous mathematical theorem.

(FYI Benford's law is something that I've known about, and thought about off and on, for close to 20 years ago now.)


Really good comments here. I'll try and refine my position which I did a poor job of explaining and post something in the next few weeks or months.


Why is this not common sense?

For the numbers 1-19, more than half of them start with 1. For the numbers 1-199, more than half of them start with one.

Change the examples to 1-299, 1-399, etc, and you'll get percentages of all digits matching Benford's law.


Benford's Law predicts the first digit will be 1 about 30% of the time, not 50% of the time.

Your method also seems to depend heavily on the choice of starting and ending points. If I chose 1-99, then only 1/10th of the numbers in the interval will start with 1. So why choose 199 and not 99?


It is not "my method". Just an explanation of why the law seems intuitive to me. I expected everyone to extrapolate out from my examples for other ending points, and which point, yes, the % for the digit of "1" would drop, and approach that 30% rate.

I just selected end points to illustrate the concept. I think this place is getting a little too literal. :)


I am not a mathematician but the "law" seems to be an inherent property to any number system based on exponential increases, i.e. the hundreds digit, tens digit etc (is there any other kind of number system?)

I think it seems "counter-intuitive" to some because they are not used to thinking of numbers and counting as being related to exponents and bases.

This may seem more intuitive to those of us that work with computers all day since we are intimately familiar with how to count in a handful of different bases (base-2, base-10, base-16 etc).


I am curious as to why you think that Benford's Law in intuitive. I certainly would not have expected it. Could you explain your thinking in more detail?


I wouldn't say intuitive as much as "not totally surprising". I meant "more intuitive" in relation to those who think it is non-intuitive.


For the numbers 0-100, only 10 start with 1, exactly as many those start with 2, or 3, etc.


No, 11 numbers start with 1 in that range.


No, 12 numbers begin with 1 in the range [0,100] if that's indeed what the OP meant:

1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 100

I think he most likely meant to exclude 1 and 100, but I don't know.


Kudos for spotting, that. I feel stupid now.


Cool stuff. However, something mostly entirely offtopic that I genuiunely wonder about: it seems everybody registers a .com just to make a HN post. What's the point of this? Why not post the same data on your blog?


I can't speak for everyone else, but in this case we were hoping to create something that could stand on its own and (hopefully) grow over time as more datasets are added. The domain wasn't registered just to get on HN - the idea lent itself to a simple site of its own, so why not use a fitting domain name?


Benford Law applies to anything where values appreciate or depreciate exponentially. Kind of weird to see it relate to passcodes (where it isn't really Benford's Law at all, but instead is simply numeric proximity), but for monetary values, forces of nature, population counts, etc, Benford's Law will always apply to a large enough set, for reasons that become very evident once you understand the reasoning.

http://blog.yafla.com/Demystifying_Benfords_Law

Best page I've seen on it.


What better way to get your new website to show up in relevant search results?


Benford's law makes a lot of sense if you consider that many of the numbers are derived from counting up from 0. The scale of these things is exponentially distributed, and therefore the leading digits are more likely to be 1 than 9. This is related to social media -- once your userbase gets big enough it starts growing or shrinking proportionally to its size, i.e. exponentially. This is also somewhat related to the value of a social network ... Metcalfe's law seems to be too optimistic. THe value is probably more like nlog n


I always liked: "Explaining Benfords Law" (http://www.dspguide.com/CH34.PDF).


Whoah, check out the distribution of the leading digit in binary!



"If a set of values were truly random, each leading digit would appear about 11% of the time"

This kind of mathematically unsophisticated reasoning is exactly why Benford's law is so surprising to people. If you think of what it means for a value to be "truly random", the result is not surprising at all.


"random" is commonly taken to be uniformly distributed. There is no particular reason to expect unbounded support, even if you feel it shouldn't be a certain interval.


Perhaps you should let some of the open data citizen groups know about this so they can add more data. Also, if you haven't already then take a look at CKAN[1] for datasets to add.

[1] http://ckan.net/


'Presenting Benford's law' would be a more fitting title. Nicely presented, and intriguing law for sure but I can't help to think "and?" At this point it lacks a more user-friendly way to submit data-sets.


This one is interesting: http://testingbenfordslaw.com/most-common-iphone-passcodes

I wonder what influence the 'spatial' properties of a number pad password has on this data. For example "5" gets a nice little spike... and "5" is the center key on the 10-key iPhone number pad. The "1" is still the winner by far, but I wonder how many of those are the easy-to-remember "1234".


great site, awesome design, i love benfords law, first heard about it on WNYC's RadioLab (best podcast in the world ever, im not even kidding).


Here's the Benfords Law snippet from the Radiolab episode: http://www.radiolab.org/2009/nov/30/from-benford-to-erdos/


It seems to me there's a lot of interesting psychology elements to this, but it's also a simple reflection of relatively constant growth rates. If population of cities grow 3% every year, they will spend a lot more years in the 1 millions than the 9 millions, etc

Chart looks like this. https://url.odesk.com/a7och


FYI, The text of the article is scrambled (Chrome 12, Windows 7 64 bit)


>Imagine a large dataset, say something like a list of every country and its population.

How is that a large dataset? There aren't that many countries.


Benford's Law seems like common sense to me.

Any time you are counting something it seems obvious to me that you'd have 1 more often than 2.


I kind of feel that the initial data sets are selected just to reinforce the Benford's law. It seems too good to be true!


My first time hearing about this law (sadly) and I'm stumped! This is amazing!!


Does anyone else have trouble with the font on that page?


Benford's Law has always been kind of fascinating to me.


No black swan :P


nice!




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: