Hacker News new | comments | show | ask | jobs | submit login

I had to use a different stat command on my Linux system. This worked for me:

find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c

Note that I exclude directories to avoid the size 4096 bias.

I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.

After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.




Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.

find . -type f -ls | awk '{print $7}' | cut -c -1 | sort | uniq -c


MUCH faster, thanks.

Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.

Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":

"Data whose logarithm is uniformly distributed does [follow Benford's Law]."

I could use that to produce demo or test data.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: