Hacker News new | past | comments | ask | show | jobs | submit login
How to make fake data look meaningful (danbirken.com)
138 points by birken on Nov 20, 2013 | hide | past | favorite | 41 comments

A great method for spotting (or at least suspecting) fake data is to see if it follows Benford's law. (So remember to fit your fake data to conform!)


> Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.

As one might imagine:

> A test of regression coefficients in published papers showed agreement with Benford's law. As a comparison group subjects were asked to fabricate statistical estimates. The fabricated results failed to obey Benford's law.

So keep that in mind, data fabricators!

Just to explain this a bit, it has to do with relative growth/shrinkage and the base of the positional-numbering system you're using. If you have a random starting value (X) multiplied by a second random factor (Y), most of the time the result will start with a one.

You're basically throwing darts at logarithmic graph paper! The area covered by squares which "start with 1" is larger than the area covered by square which "start with 9".

> You're basically throwing darts at logarithmic graph paper! The area covered by squares which "start with 1" is larger than the area covered by square which "start with 9".

Best explanation for Benford's Law I've ever seen. I've always had a rough understanding of the "why" of it, but this really drives the point home.

You have to be careful with this though. Benford's law only holds for certain types of data, such as growing populations. It has to do with the nature of the decimal system.

To be more specific, it has to do with the nature of any positional-notation system, whether it is decimal, octal, hex, or even binary.

Wow, thinking about Benford's Law in binary is really interesting, thanks for this comment.

It's the most extreme expression of the concept. Every number in binary starts with 1. The only other possible leading digit would be a meaningless leading zero, so every number starts with 1. Benford's Law holds in the most extreme manner possible in binary notation, that numbers starting with 1 dominate the population over all others.

It has more to do with the fact that a good chunk of natural phenomenon are distributed exponentially, not linearly. Benford's Law actually applies to other notational systems.

That only helps you spot outright fabrication though, doesn't it? I would assume that this only happens in the minority of cases.

The much bigger issue is people trying many different statistical tests until they find the one that "works", removing outliers to make the data look better or just neglecting to mention things like p-values or confidence intervals in the hope that nobody would notice.

Much harder to spot, but potentially as misleading as just making numbers up...

most real datasets apparently have outliers... you should not remove them if you want it to look real (I heard this from this course: https://class.coursera.org/dataanalysis-002/)

(but probably don't fit it exactly)

I recently made a blog post on which universities have the most successful startup founders: http://minimaxir.com/2013/07/alma-mater-data/

In the post, I made it clear where the data came from, how I processed it, and additionally provided a raw copy of the data so that others could check against it.

It turns out the latter decision was especially a good idea because I received numerous comments that my data analysis was flawed, and I agree with them. Since then, I've taken additional steps to improve data accuracy, and I always provide the raw data, license-permitting.

You didn't actually clean up the data that you have tho.

Another flaw of your analysis is that you didn't look at where the founders went to school for their undergraduate/graduate(this probably doesn't matter).

I opened up the CSV and searched for "Urbana"(UIUC, where I went to school for) and counted 28 founders. In fact, some "University of Illinois" should match to UIUC too but I ignored those (for example, ZocDoc CTO).

I did attempt to clean up the data, just not enough. (I substituted for the abbreviations for 20-25 colleges. In retrospect I should have looked for a mapping online first.)

I worked in retail when I was younger. This would be excellent advice for a sales person working with uneducated customers.

I was the go-to guy, resulting in a lot of freedom when dealing with clients. It allowed me to do lot of social experimentation, especially when selling custom services (like fixing "My PC is slow").

Explaining the in-and-outs of different options (goal, short/long term consequences, risks, up/downsides...), then saying it would cost 50$ would usually result in the guy becoming all suspicious, saying he wanted me to guarantee him stuff that I couldn't, and that it was expensive.

Say "Oh you'll need our performance rejuvenation(tm) package, that'll be 400$" and the guy happily pulls out his credit card!

> If you are trying to trick somebody with your data, never share the raw data; only share the conclusions. Bonus points if you can't share the data because it is somehow priviledged or a trade secret. The last thing you want is to allow other people to evaluate your analysis and possibly challenge your conclusions.

Of course, I'm not against sharing data. However, the satire here is slightly too optimistic that people, when given raw data, will attempt to verify it for themselves. When people are given plain narrative text, they can still be profoundly influenced by a skewed headline -- something which everyone here may not ever be familiar with :)

I guess I'm being curmudgeonly about this...We should all share the data behind our conclusions, but don't think that by doing so -- and an absence of outside observations -- that you were correct. Most people just don't have time to read beyond the summary, nevermind work with data.

Well it is similar to open source. Just making a project open source doesn't automatically reduce the number of bugs. People need to be interested and capable of auditing it. However, even if only a small percentage of people do audit it, the gains from that apply to everybody.

I think it generally works out. More popular open source projects naturally have more eye-balls, which means more people auditing it. Similarly with sharing data, more controversial or interesting claims will naturally attract more attention and therefore more people who are capable and interested in verifying the claims.

There is of course a famous book on this topic:


Which is also linked in the blog :)

oh, whoops! I had checked and didn't see him mention it - it's behind the link for 'other'. Good spot.

I was hoping (by the title) that this was something about generating real-looking test data for your app, something very useful for UI devs before the API or backend is in place.

Raw data, and the source code which you used to arrive at your conclusion, or it didn't happen.

In today's world of github gists and python PyPI's and R's CRAN, there's no excuse to not document the entire process, in addition to raw data.

With some caveats. For example, I'm sure all the patients whose data I work with would rather not have it spilled out over GitHub.

Can you reasonably anonymize it?

For some things, yes. For others - not really. And there have been groups, historically, that have not been served well by "Everyone has access to this data". Given the need for their consent and cooperation, I generally side with what my subjects want their data to go toward, rather than some nebulous ideal.

Some more sophisticated (but very readable and sometimes laugh-out-loud funny) articles about detecting fake data can be found at the faculty website[1] of Uri Simonsohn, who is known as the "data detective"[2] for his ability to detect research fraud just from reading published, peer-reviewed papers and thinking about the reported statistics in the papers. Some of the techniques for cheating on statistics that he has discovered and named include "p-hacking,"[3] and he has published a checklist of procedures to follow to ensure more honest and replicable results.[4]

From Jelte Wicherts writing in Frontiers of Computational Neuroscience (an open-access journal) comes a set of general suggestions[5] on how to make the peer-review process in scientific publishing more reliable. Wicherts does a lot of research on this issue to try to reduce the number of dubious publications in his main discipline, the psychology of human intelligence.

"With the emergence of online publishing, opportunities to maximize transparency of scientific research have grown considerably. However, these possibilities are still only marginally used. We argue for the implementation of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and (3) online data publication. First, peer-reviewed peer review entails a community-wide review system in which reviews are published online and rated by peers. This ensures accountability of reviewers, thereby increasing academic quality of reviews. Second, reviewers who write many highly regarded reviews may move to higher editorial positions. Third, online publication of data ensures the possibility of independent verification of inferential claims in published papers. This counters statistical errors and overly positive reporting of statistical results. We illustrate the benefits of these strategies by discussing an example in which the classical publication system has gone awry, namely controversial IQ research. We argue that this case would have likely been avoided using more transparent publication practices. We argue that the proposed system leads to better reviews, meritocratic editorial hierarchies, and a higher degree of replicability of statistical analyses."

[1] http://opim.wharton.upenn.edu/~uws/

[2] http://www.nature.com/news/the-data-detective-1.10937

[3] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588

(this abstract link leads to a free download of the article)

[4] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704

(this abstract link also leads to a free download of the full paper)

[5] http://www.frontiersin.org/Computational_Neuroscience/10.338...

Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom. Letting the daylight in: reviewing the reviewers and other ways to maximize transparency in science. Front. Comput. Neurosci., 03 April 2012 doi: 10.3389/fncom.2012.00020

Start-up idea: Review site that generates scores for published scientific papers using an open methodology. Maybe even allow community to generate their own metrics!

Edit: looks like the domains "sciencetomatoes.com" and "rottenpvalues.com" are available ;) http://www.internic.net/whois.html

This is perhaps a ridiculous idea but I'll post it anyways. There are a lot of features of a paper that might correlate with it's truthfulness. For example surface level stuff like what journal published it or the authors, what subject it's on, how controversial/interesting it is. Then the data itself.

Then get a large sample of papers which are known to be true or false (or at least were able to be replicated or not) and use some machine learning or statistical method to make decent predictions how likely new papers are to be true or false.

How are you going to prove that your system isn't pulling random ratings out of thin air? ;)

Tangentially, the article gets confidence intervals wrong:

Simple change increased conversion between -12% and 96%

That's not what a confidence interval is. A confidence interval is merely the set of null hypothesis you can't reject.


A credible interval (which you only get from Bayesian statistics) is the interval that represents how much you increased the conversion by.

The classical frequentist interpretation of a confidence interval doesn't require invoking Fisherian hypothesis-testing. It's simply an interval estimate for a population parameter rather than a point estimate, which is semantics close to what he means here. A 95% confidence interval is an interval estimate with 95% coverage probability for the true population parameter, which in the frequentist sense means that it includes the true population parameter in 95% of long-run experiment repetitions. (One can get different kinds of frequentist coverage with a prediction or a tolerance interval, which vary what's being covered and what's being repeated.)

I see this all the time especially in social media blogs. They come to strange, convoluted conclusions about things like "the best time to tweet", without taking into account that 40% of tweets are from bots, and auto-tweets. Or that most of your followers are not human.

Standard methods of calculating confidence intervals are only applicable to parametric data. Most data I deal with on the Web is non-parametric, so I wouldn't take a lack of confidence intervals to necessarily insinuate non-meaningful data.

Honestly, anyone who isn't showing a 95% confidence interval is just being lazy. If you're not willing to run an experiment twenty or thirty times, you might as well not even look at the results and just publish it regardless of what it is. Why even research?

On the other hand, if you'd like to showcase actual insight and meaningful, surprising conclusions, it only makes sense to take the time to find the best dataset that supports it. Real researchers leave nothing to chance.

Reminds me of every single presentation I've ever seen at a games developer conference. "Look at this awesome thing we did! The graph goes up and to the right! No, I can't tell you scale, methodology, other contributing factors or talk to statistical significance. But it's AMAZING, right?!"

Another great checklist to consider when reading articles: Warning Signs in Experimental Design and Interpretation : http://norvig.com/experiment-design.html (found the link on HN a while back)

Are there any tips that would help when seeding two-sided markets with fake data, a la reddit getting started?


This isn't really on how to make fake data look meaningful, but rather how to make useless data look meaningful. If you can fake the data, then there's no need for these misleading analyses.

Indeed. Properly faked data, rather than just misleadingly spun analysis, can be analyzed using the most rigorous, statistically sound methods available.

Indeed, often very legitimate science does exactly this.

Deceptive analysis is pretty easy to spot. A genuine work of data fabrication artistry is much, much harder.

Meh. I mean... funny, but it just reads like yet another person who saw bogus numbers, got fed up, ranted.

I did notice a lot of misspellings on the graph legends :)

Nice one! Haha... But you need to be extra careful though because some people are really keen to details..

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact