Hacker Newsnew | comments | show | ask | jobs | submitlogin

Very cool idea. For those who want to try at home, try this (mac and unix users only)...

cd /

find . -exec stat -f "%z" {} \; | cut -c -1 > /tmp/tally.txt

sort /tmp/tally.txt | uniq -c

Mine came out with...

  506 0
  80370 1
  30396 2
  25215 3
  21787 4
  22174 5
  26251 6
  12810 7
  10455 8
  5556 9
Very interesting...



I had to use a different stat command on my Linux system. This worked for me:

find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c

Note that I exclude directories to avoid the size 4096 bias.

I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.

After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.

-----


Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.

find . -type f -ls | awk '{print $7}' | cut -c -1 | sort | uniq -c

-----


MUCH faster, thanks.

Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.

Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":

"Data whose logarithm is uniformly distributed does [follow Benford's Law]."

I could use that to produce demo or test data.

-----




Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: