Hacker News new | past | comments | ask | show | jobs | submit login
The Difficulty of Faking Data (1999) [pdf] (kkuniyuk.com)
29 points by tontonius 11 months ago | hide | past | web | favorite | 3 comments

This is awesome, and could be a fun project for introductory students. It gives practice with:

- Extracting digits from numbers (easily done with a few function calls in Python, or some math)

- Calculating frequency of a known set of items in a sequence

- Some application of simple mathematical formulas[1]

I wouldn't mind trying this at some point...

[1] https://en.wikipedia.org/wiki/Benford%27s_law#Statistical_te...

Benford's law offers an approach to detecting fabricated data by looking at the distribution of the most significant digits of the data presented and matching.

There's a paper that was published in the last 5 years, which I unfortunately can't seem to find. It detects fabricated data by examining less significant digits and taking advantage of the fact that our data is often discrete.

The main idea is that an experiment surveyed 20 people about some statistic that take an integer value (age rounded down, the number of times participants blinked in a second, etc.), you should never see a mean value of, say, 1234.56. Why? The average of 20 integers must have a decimal part of {0.00, 0.05, 0.10, 0.15, ..., or 0.95}, so if the author reports some other statistic, then immediately you know that there's something fishy going on.

This brings to mind a classic pre-Disney, pre-NYT FiveThirtyEight article applying some of these principles to take some pollsters to task.

- From 2009, Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud: https://fivethirtyeight.com/features/strategic-vision-polls-...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact