
How to make fake data look meaningful - birken
http://danbirken.com/statistics/2013/11/19/ways-to-make-fake-data-look-meaningful.html
======
simonsarris
A great method for spotting (or at least suspecting) fake data is to see if it
follows Benford's law. (So remember to fit your fake data to conform!)

[http://en.wikipedia.org/wiki/Benford's_law](http://en.wikipedia.org/wiki/Benford's_law)

> Benford's Law, also called the First-Digit Law, refers to the frequency
> distribution of digits in many (but not all) real-life sources of data. In
> this distribution, the number 1 occurs as the leading digit about 30% of the
> time, while larger numbers occur in that position less frequently: 9 as the
> first digit less than 5% of the time. Benford's Law also concerns the
> expected distribution for digits beyond the first, which approach a uniform
> distribution.

As one might imagine:

> A test of regression coefficients in published papers showed agreement with
> Benford's law. As a comparison group subjects were asked to fabricate
> statistical estimates. The fabricated results failed to obey Benford's law.

So keep that in mind, data fabricators!

~~~
astine
You have to be careful with this though. Benford's law only holds for certain
types of data, such as growing populations. It has to do with the nature of
the decimal system.

~~~
Terr_
To be more specific, it has to do with the nature of __any __positional-
notation system, whether it is decimal, octal, hex, or even binary.

~~~
T-hawk
Wow, thinking about Benford's Law in binary is really interesting, thanks for
this comment.

It's the most extreme expression of the concept. _Every number_ in binary
starts with 1. The only other possible leading digit would be a meaningless
leading zero, so every number starts with 1. Benford's Law holds in the most
extreme manner possible in binary notation, that numbers starting with 1
dominate the population over all others.

------
minimaxir
I recently made a blog post on which universities have the most successful
startup founders: [http://minimaxir.com/2013/07/alma-mater-
data/](http://minimaxir.com/2013/07/alma-mater-data/)

In the post, I made it clear where the data came from, how I processed it, and
additionally provided a raw copy of the data so that others could check
against it.

It turns out the latter decision was especially a good idea because I received
numerous comments that my data analysis was flawed, and I agree with them.
Since then, I've taken additional steps to improve data accuracy, and I
_always_ provide the raw data, license-permitting.

~~~
siong1987
You didn't actually clean up the data that you have tho.

Another flaw of your analysis is that you didn't look at where the founders
went to school for their undergraduate/graduate(this probably doesn't matter).

I opened up the CSV and searched for "Urbana"(UIUC, where I went to school
for) and counted 28 founders. In fact, some "University of Illinois" should
match to UIUC too but I ignored those (for example, ZocDoc CTO).

~~~
minimaxir
I did attempt to clean up the data, just not enough. (I substituted for the
abbreviations for 20-25 colleges. In retrospect I should have looked for a
mapping online first.)

------
datphp
I worked in retail when I was younger. This would be excellent advice for a
sales person working with uneducated customers.

I was the go-to guy, resulting in a lot of freedom when dealing with clients.
It allowed me to do lot of social experimentation, especially when selling
custom services (like fixing "My PC is slow").

Explaining the in-and-outs of different options (goal, short/long term
consequences, risks, up/downsides...), then saying it would cost 50$ would
usually result in the guy becoming all suspicious, saying he wanted me to
guarantee him stuff that I couldn't, and that it was expensive.

Say "Oh you'll need our performance rejuvenation(tm) package, that'll be 400$"
and the guy happily pulls out his credit card!

------
danso
> _If you are trying to trick somebody with your data, never share the raw
> data; only share the conclusions. Bonus points if you can 't share the data
> because it is somehow priviledged or a trade secret. The last thing you want
> is to allow other people to evaluate your analysis and possibly challenge
> your conclusions._

Of course, I'm not against sharing data. However, the satire here is slightly
too optimistic that people, when given raw data, will attempt to verify it for
themselves. When people are given plain narrative text, they can still be
profoundly influenced by a skewed headline -- something which everyone here
may not ever be familiar with :)

I guess I'm being curmudgeonly about this...We should all share the data
behind our conclusions, but don't think that by doing so -- and an absence of
outside observations -- that you were correct. Most people just don't have
time to read beyond the summary, nevermind work with data.

~~~
birken
Well it is similar to open source. Just making a project open source doesn't
automatically reduce the number of bugs. People need to be interested and
capable of auditing it. However, even if only a small percentage of people do
audit it, the gains from that apply to everybody.

I think it generally works out. More popular open source projects naturally
have more eye-balls, which means more people auditing it. Similarly with
sharing data, more controversial or interesting claims will naturally attract
more attention and therefore more people who are capable and interested in
verifying the claims.

------
bazzargh
There is of course a famous book on this topic:

[http://en.wikipedia.org/wiki/How_to_Lie_with_Statistics](http://en.wikipedia.org/wiki/How_to_Lie_with_Statistics)

~~~
namenotrequired
Which is also linked in the blog :)

~~~
bazzargh
oh, whoops! I had checked and didn't see him mention it - it's behind the link
for 'other'. Good spot.

------
wingspan
I was hoping (by the title) that this was something about generating real-
looking test data for your app, something very useful for UI devs before the
API or backend is in place.

------
cscheid
Raw data, _and the source code which you used to arrive at your conclusion_ ,
or it didn't happen.

In today's world of github gists and python PyPI's and R's CRAN, there's no
excuse to not document the entire process, in addition to raw data.

~~~
Fomite
With some caveats. For example, I'm sure all the patients whose data I work
with would rather not have it spilled out over GitHub.

~~~
couchand
Can you reasonably anonymize it?

~~~
Fomite
For some things, yes. For others - not really. And there have been groups,
historically, that have not been served well by "Everyone has access to this
data". Given the need for their consent and cooperation, I generally side with
what my subjects want their data to go toward, rather than some nebulous
ideal.

------
tokenadult
Some more sophisticated (but very readable and sometimes laugh-out-loud funny)
articles about detecting fake data can be found at the faculty website[1] of
Uri Simonsohn, who is known as the "data detective"[2] for his ability to
detect research fraud just from reading published, peer-reviewed papers and
thinking about the reported statistics in the papers. Some of the techniques
for cheating on statistics that he has discovered and named include
"p-hacking,"[3] and he has published a checklist of procedures to follow to
ensure more honest and replicable results.[4]

From Jelte Wicherts writing in Frontiers of Computational Neuroscience (an
open-access journal) comes a set of general suggestions[5] on how to make the
peer-review process in scientific publishing more reliable. Wicherts does a
lot of research on this issue to try to reduce the number of dubious
publications in his main discipline, the psychology of human intelligence.

"With the emergence of online publishing, opportunities to maximize
transparency of scientific research have grown considerably. However, these
possibilities are still only marginally used. We argue for the implementation
of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and
(3) online data publication. First, peer-reviewed peer review entails a
community-wide review system in which reviews are published online and rated
by peers. This ensures accountability of reviewers, thereby increasing
academic quality of reviews. Second, reviewers who write many highly regarded
reviews may move to higher editorial positions. Third, online publication of
data ensures the possibility of independent verification of inferential claims
in published papers. This counters statistical errors and overly positive
reporting of statistical results. We illustrate the benefits of these
strategies by discussing an example in which the classical publication system
has gone awry, namely controversial IQ research. We argue that this case would
have likely been avoided using more transparent publication practices. We
argue that the proposed system leads to better reviews, meritocratic editorial
hierarchies, and a higher degree of replicability of statistical analyses."

[1] [http://opim.wharton.upenn.edu/~uws/](http://opim.wharton.upenn.edu/~uws/)

[2] [http://www.nature.com/news/the-data-
detective-1.10937](http://www.nature.com/news/the-data-detective-1.10937)

[3]
[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588)

(this abstract link leads to a free download of the article)

[4]
[http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704)

(this abstract link also leads to a free download of the full paper)

[5]
[http://www.frontiersin.org/Computational_Neuroscience/10.338...](http://www.frontiersin.org/Computational_Neuroscience/10.3389/fncom.2012.00020/full)

Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom. Letting
the daylight in: reviewing the reviewers and other ways to maximize
transparency in science. Front. Comput. Neurosci., 03 April 2012 doi:
10.3389/fncom.2012.00020

~~~
dwaltrip
Start-up idea: Review site that generates scores for published scientific
papers using an open methodology. Maybe even allow community to generate their
own metrics!

Edit: looks like the domains "sciencetomatoes.com" and "rottenpvalues.com" are
available ;)
[http://www.internic.net/whois.html](http://www.internic.net/whois.html)

~~~
Houshalter
This is perhaps a ridiculous idea but I'll post it anyways. There are a lot of
features of a paper that might correlate with it's truthfulness. For example
surface level stuff like what journal published it or the authors, what
subject it's on, how controversial/interesting it is. Then the data itself.

Then get a large sample of papers which are known to be true or false (or at
least were able to be replicated or not) and use some machine learning or
statistical method to make decent predictions how likely new papers are to be
true or false.

~~~
yen223
How are you going to prove that your system isn't pulling random ratings out
of thin air? ;)

------
yummyfajitas
Tangentially, the article gets confidence intervals wrong:

 _Simple change increased conversion between -12% and 96%_

That's not what a confidence interval is. A confidence interval is merely the
set of null hypothesis you can't reject.

[http://www.bayesianwitch.com/blog/2013/confidence_intervals....](http://www.bayesianwitch.com/blog/2013/confidence_intervals.html)

A _credible interval_ (which you only get from Bayesian statistics) is the
interval that represents how much you increased the conversion by.

~~~
mjn
The classical frequentist interpretation of a confidence interval doesn't
require invoking Fisherian hypothesis-testing. It's simply an interval
estimate for a population parameter rather than a point estimate, which is
semantics close to what he means here. A 95% confidence interval is an
interval estimate with 95% coverage probability for the true population
parameter, which in the frequentist sense means that it includes the true
population parameter in 95% of long-run experiment repetitions. (One can get
different kinds of frequentist coverage with a prediction or a tolerance
interval, which vary what's being covered and what's being repeated.)

------
AznHisoka
I see this all the time especially in social media blogs. They come to
strange, convoluted conclusions about things like "the best time to tweet",
without taking into account that 40% of tweets are from bots, and auto-tweets.
Or that most of your followers are not human.

------
taylorbuley
Standard methods of calculating confidence intervals are only applicable to
parametric data. Most data I deal with on the Web is non-parametric, so I
wouldn't take a lack of confidence intervals to necessarily insinuate non-
meaningful data.

------
logicallee
Honestly, anyone who isn't showing a 95% confidence interval is just being
lazy. If you're not willing to run an experiment twenty or thirty times, you
might as well not even look at the results and just publish it regardless of
what it is. Why even research?

On the other hand, if you'd like to showcase actual insight and meaningful,
surprising conclusions, it only makes sense to take the time to find the best
dataset that supports it. Real researchers leave nothing to chance.

------
kosei
Reminds me of every single presentation I've ever seen at a games developer
conference. "Look at this awesome thing we did! The graph goes up and to the
right! No, I can't tell you scale, methodology, other contributing factors or
talk to statistical significance. But it's AMAZING, right?!"

------
thesehands
Another great checklist to consider when reading articles: Warning Signs in
Experimental Design and Interpretation : [http://norvig.com/experiment-
design.html](http://norvig.com/experiment-design.html) (found the link on HN a
while back)

------
j_s
Are there any tips that would help when seeding two-sided markets with fake
data, a la reddit getting started?

[http://venturebeat.com/2012/06/22/reddit-fake-
users/](http://venturebeat.com/2012/06/22/reddit-fake-users/)

------
thanatosmin
This isn't really on how to make fake data look meaningful, but rather how to
make useless data look meaningful. If you can fake the data, then there's no
need for these misleading analyses.

~~~
Fomite
Indeed. Properly _faked_ data, rather than just misleadingly spun analysis,
can be analyzed using the most rigorous, statistically sound methods
available.

Indeed, often very legitimate science does exactly this.

Deceptive analysis is pretty easy to spot. A genuine work of data fabrication
artistry is much, much harder.

------
GrinningFool
Meh. I mean... funny, but it just reads like yet another person who saw bogus
numbers, got fed up, ranted.

------
kgu87
I did notice a lot of misspellings on the graph legends :)

------
kimonos
Nice one! Haha... But you need to be extra careful though because some people
are really keen to details..

