> Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.
As one might imagine:
> A test of regression coefficients in published papers showed agreement with Benford's law. As a comparison group subjects were asked to fabricate statistical estimates. The fabricated results failed to obey Benford's law.
So keep that in mind, data fabricators!
You're basically throwing darts at logarithmic graph paper! The area covered by squares which "start with 1" is larger than the area covered by square which "start with 9".
Best explanation for Benford's Law I've ever seen. I've always had a rough understanding of the "why" of it, but this really drives the point home.
It's the most extreme expression of the concept. Every number in binary starts with 1. The only other possible leading digit would be a meaningless leading zero, so every number starts with 1. Benford's Law holds in the most extreme manner possible in binary notation, that numbers starting with 1 dominate the population over all others.
The much bigger issue is people trying many different statistical tests until they find the one that "works", removing outliers to make the data look better or just neglecting to mention things like p-values or confidence intervals in the hope that nobody would notice.
Much harder to spot, but potentially as misleading as just making numbers up...
In the post, I made it clear where the data came from, how I processed it, and additionally provided a raw copy of the data so that others could check against it.
It turns out the latter decision was especially a good idea because I received numerous comments that my data analysis was flawed, and I agree with them. Since then, I've taken additional steps to improve data accuracy, and I always provide the raw data, license-permitting.
Another flaw of your analysis is that you didn't look at where the founders went to school for their undergraduate/graduate(this probably doesn't matter).
I opened up the CSV and searched for "Urbana"(UIUC, where I went to school for) and counted 28 founders. In fact, some "University of Illinois" should match to UIUC too but I ignored those (for example, ZocDoc CTO).
I was the go-to guy, resulting in a lot of freedom when dealing with clients. It allowed me to do lot of social experimentation, especially when selling custom services (like fixing "My PC is slow").
Explaining the in-and-outs of different options (goal, short/long term consequences, risks, up/downsides...), then saying it would cost 50$ would usually result in the guy becoming all suspicious, saying he wanted me to guarantee him stuff that I couldn't, and that it was expensive.
Say "Oh you'll need our performance rejuvenation(tm) package, that'll be 400$" and the guy happily pulls out his credit card!
Of course, I'm not against sharing data. However, the satire here is slightly too optimistic that people, when given raw data, will attempt to verify it for themselves. When people are given plain narrative text, they can still be profoundly influenced by a skewed headline -- something which everyone here may not ever be familiar with :)
I guess I'm being curmudgeonly about this...We should all share the data behind our conclusions, but don't think that by doing so -- and an absence of outside observations -- that you were correct. Most people just don't have time to read beyond the summary, nevermind work with data.
I think it generally works out. More popular open source projects naturally have more eye-balls, which means more people auditing it. Similarly with sharing data, more controversial or interesting claims will naturally attract more attention and therefore more people who are capable and interested in verifying the claims.
In today's world of github gists and python PyPI's and R's CRAN, there's no excuse to not document the entire process, in addition to raw data.
From Jelte Wicherts writing in Frontiers of Computational Neuroscience (an open-access journal) comes a set of general suggestions on how to make the peer-review process in scientific publishing more reliable. Wicherts does a lot of research on this issue to try to reduce the number of dubious publications in his main discipline, the psychology of human intelligence.
"With the emergence of online publishing, opportunities to maximize transparency of scientific research have grown considerably. However, these possibilities are still only marginally used. We argue for the implementation of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and (3) online data publication. First, peer-reviewed peer review entails a community-wide review system in which reviews are published online and rated by peers. This ensures accountability of reviewers, thereby increasing academic quality of reviews. Second, reviewers who write many highly regarded reviews may move to higher editorial positions. Third, online publication of data ensures the possibility of independent verification of inferential claims in published papers. This counters statistical errors and overly positive reporting of statistical results. We illustrate the benefits of these strategies by discussing an example in which the classical publication system has gone awry, namely controversial IQ research. We argue that this case would have likely been avoided using more transparent publication practices. We argue that the proposed system leads to better reviews, meritocratic editorial hierarchies, and a higher degree of replicability of statistical analyses."
(this abstract link leads to a free download of the article)
(this abstract link also leads to a free download of the full paper)
Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom. Letting the daylight in: reviewing the reviewers and other ways to maximize transparency in science. Front. Comput. Neurosci., 03 April 2012
Edit: looks like the domains "sciencetomatoes.com" and "rottenpvalues.com" are available ;) http://www.internic.net/whois.html
Then get a large sample of papers which are known to be true or false (or at least were able to be replicated or not) and use some machine learning or statistical method to make decent predictions how likely new papers are to be true or false.
Simple change increased conversion between -12% and 96%
That's not what a confidence interval is. A confidence interval is merely the set of null hypothesis you can't reject.
A credible interval (which you only get from Bayesian statistics) is the interval that represents how much you increased the conversion by.
On the other hand, if you'd like to showcase actual insight and meaningful, surprising conclusions, it only makes sense to take the time to find the best dataset that supports it. Real researchers leave nothing to chance.
Indeed, often very legitimate science does exactly this.
Deceptive analysis is pretty easy to spot. A genuine work of data fabrication artistry is much, much harder.