

Quantitative Data Cleaning for Large Databases [pdf] - austengary
http://db.cs.berkeley.edu/jmh/papers/cleaning-unece.pdf

======
perturbation
Have there been any black hat/white hat discussions of these sorts of
techniques? I.e., how to detect intentional biases introduced into a dataset,
faked lab data?

Most of what I remember about this is from the book "Bad Science" by Ben
Goldacre. Comparing different studies against one another is the gold standard
for determining a study's reliability, but of course there isn't always the
luxury of having an independent dataset for comparison.

Assuming the data-faking party hasn't been completely stupid and has produced
fake data following the expected distribution around each data point, the only
thing that comes to mind off hand is Benford's Law:
[http://en.wikipedia.org/wiki/Benford's_law](http://en.wikipedia.org/wiki/Benford's_law)

------
bernatfp
Why does HN create a Scribd link for every PDF posted?

~~~
_delirium
HN's owner (YCombinator) also owns a stake in Scribd.

