Hacker News new | comments | show | ask | jobs | submit login

Suggested exercise:

Predict and then create a histogram of the leading digits of the file sizes of the non-zero-length files on your computer.

[SPOILER: when I did this once I found sharp peaks around digits that weren't 1. You are likely to see this if you have a large number of files around a particular size, examples: a) whatever your digital camera typically produces; b) whatever size your software encodes a typical song into. These files violate the assumption that you are sampling sizes over a wide range of sizes. After excluding these files I observed Benford's law quite closely on the remainder.]




Curious. I did so:

    1       ****************************
    2       ***************
    3       *********
    4       *****************
    5       *******
    6       ******
    7       *****
    8       ****
    9       ***
That spike for 4 is due to the default directory size of 4096 (my experiment included directories as well as files). The information was pulled from 503,444 files and directories.


Very cool idea. For those who want to try at home, try this (mac and unix users only)...

cd /

find . -exec stat -f "%z" {} \; | cut -c -1 > /tmp/tally.txt

sort /tmp/tally.txt | uniq -c

Mine came out with...

  506 0
  80370 1
  30396 2
  25215 3
  21787 4
  22174 5
  26251 6
  12810 7
  10455 8
  5556 9
Very interesting...


I had to use a different stat command on my Linux system. This worked for me:

find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c

Note that I exclude directories to avoid the size 4096 bias.

I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.

After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.


Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.

find . -type f -ls | awk '{print $7}' | cut -c -1 | sort | uniq -c


MUCH faster, thanks.

Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.

Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":

"Data whose logarithm is uniformly distributed does [follow Benford's Law]."

I could use that to produce demo or test data.


Interesting, but can't this be explained by the distribution of file sizes. There will generally be many small files but fewer and fewer larger files. So there will be more 1k files than 2k files, more 2ks than 3ks, more files between 10 to 19k than 20 to 29k, more files between 100 to 199k than 200 to 299k etc.


That's actually the reason behind this law in general :)


Too interesting to pass up:

1 29.12 %

2 20.60 %

3 13.99 %

4 12.54 %

5 5.74 %

6 4.78 %

7 4.84 %

8 4.93 %

9 3.47 %


Nice. The "law" got the rough shape of the distribution, but (unless you have a very small filesystem) these numbers are statistically significantly different.

Your files must have been made up! ...or you have a nice demonstration of how people shouldn't do too careful calculations with Benford's law. "30.1%" — 3 significant figures, really?


I did this and graphed the results:

http://imgur.com/y0aH3.png

2192 0 - 0%

389003 1 - 38%

151943 2 - 15%

116663 3 - 11%

96393 4 - 9%

76590 5 - 8%

53572 6 - 5%

45381 7 - 4%

47138 8 - 5%

36983 9 - 4%




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: