
Artificial data give the same results as real data without compromising privacy - sibmike
http://news.mit.edu/2017/artificial-data-give-same-results-as-real-data-0303
======
Cynddl
I'm highly dubious of the ability for synthetic data to model accurately
datasets without introducing unexpected bias, esp. to account for causality.

If you dig through the original paper, the conclusion is on the line with
that:

 _“For 7 out of 15 comparisons, we found no significant difference between the
accuracy of features developed on the control dataset vs. those developed on
some version of the synthesized data; that is, the result of the test was
False.”_

So, on the tests they developed, the proposed method doesn't work 8 times out
of 15…

~~~
hyperbovine
Agreed, seems suspect. If they are really able to learn the population-level
distribution then why even bother generating fake data. Just release that
instead.

~~~
bunderbunder
Well, just knowing a few distributions wouldn't be great for building machine
learning models.

------
sriku
I haven't read the original paper (yet), but something doesn't sit right with
the work, if the way it is portrayed is indeed faithful to it and I'm not
missing something important.

\- It looks like the work of the data scientists will be limited to the extent
of the modeling already done by recursive conditional parameter aggregation.
(edit: So why not just ship that model and adapt it instead of using it to
generate data?)

\- Its "validation" appears to be doubly proxied - i.e. the normal performance
measures we use are themselves a proxy, and now we're comparing those against
these performance measures derived from models built out of the data generated
by these models. I'm not inclined to trust a validation that is so removed.

Any one who can explain this well?

~~~
jj12345
Just finished the paper, so let me take a stab:

Peeling back the mystery a bit, what is happening is:

1\. From each child table upwards, model each column as a simple distribution
(e.g. Gaussian) and covariance matrix.

2\. Given those child table distribution parameters, pass them back as row
values to their respective parent tables.

What you end up with is a "flattened" version of each parent table that has
the information (in an "information theoretic" sense) of all child relations.
Sampling from distributions is straight forward. The stats methods are
outlined in section 3 of the paper.

Things of note:

\- The paper makes heavy use of Copula transformations to normalize data
whenever it passes around the distribution parameters.

\- It deals with missing values by adding something like a dummy column.

\- The key insight is that columns must be represented by parameterized
distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test
is used to choose the "best fit" CDF to model.

To your question about the role of the data scientists: they are using the
resulting simulations to solve more complex tasks. The goal of the experiment
was to see how well the sample data would perform against Kaggle competitions.
So I guess the idea was that if winners were indistinguishable, the
simple/hierarchical distributions would be considered robust enough for
complex tasks. In the end, I'm sure shipping the underlying is preferable for
consumers.

~~~
sriku
(Going through the paper .. a few questions/notes)

Table modeling: While column distributions are picked using the KS-test, the
covariance matrix calculation first normalizes the column distributions.
Assuming that is reasonable, there is a claim of "this model contains all the
information about the original table in a compact way..", but it doesn't
account for possible multi-dimensional relationships in the data. It only
looks at a series of projections to 2D. Can a d-dimensional dataset (in
practice) be effectively summarized by the set of projections on to the
d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm
unsure whether that is adequate for practical modeling work, especially if
folks try to apply high dimensional techniques (DL?) to this. (edit: I feel
reasonably sure it _isn 't_ adequate. If a column ends up being bi-modal, for
example, even that gets lost in translation in this approach?)

Crowdsourced validations: The synthetic sets were generated for already
available public datasets. It isn't clear from the paper how any bias
resulting from prior familiarity with the public datasets would be accounted
for in the study concluding equivalence.

Privacy claims: This is a bit unclear. The "apply random noise" technique
seems to suggest something similar to differential privacy, but makes no
mention of it. If not DP, what definition of "privacy" is being used here?
(I'm ok that proving their algorithm to be privacy safe according to a chosen
definition of privacy may be out of scope of the paper.)

(Edit2: I can't help the feeling I have that this paper is an elaborate April
fool's joke released early ;)

~~~
jj12345
To the first point, the paper mentions that "the covariance is calculated
after applying the Gaussian Copula to that table". The experiments seem to
conclude that, for their datasets, the 2D projections seem to work alright. I
think that the surprising conclusion is that this works so well for any
dataset at all.

Just thinking out loud here:

The typical case where a low dimensional representation would fail you is if
you had dependencies (e.g. bimodal relations) that weren't represented by a
datatype or foreign key. Recall that the simulation of data still occurs
within each table, so the higher the non-represented inter-table
dimensionality is, the supplied distributions can measure it. It's might be
that, for the most part, the raw columns (not from child tables) have much
more bearing on the merit of the table covariance. This seems natural, due to
the semantic nature of RDBMS structures.

It's probably an important caveat that typical RDBMS structures are created to
optimize the user's understanding of the data through semantic structure.
Since the claim of the paper was only that they could provide a useful
abstraction for simulation, I think it's OK to proceed with the assumption
that Gaussians can never be fully sufficient in modeling highly dimensional
data without help.

There are existing non-parametric models that attempt to do a similar thing
for relational data that I think are more promising. One drawback of current
solutions like BayesDB is that you're still dealing with the original table
structure, which this paper tries to get around. It would be nice to bridge
the gap for something like PyMC3 where we find a cute way to flatten the data,
like this paper.

[1] Probabilistic Search for Structured Data via Probabilistic Programming and
Nonparametric Bayes.
[https://arxiv.org/pdf/1704.01087.pdf](https://arxiv.org/pdf/1704.01087.pdf)

[2]
[http://probcomp.csail.mit.edu/bayesdb/](http://probcomp.csail.mit.edu/bayesdb/)

[3] [http://docs.pymc.io/index.html](http://docs.pymc.io/index.html)

------
mehrdadn
On a parallel note, search for "thresholdout". It's another (genius, I think)
way to "stretch" how far your data goes in training a model. I won't do a
better job trying to explain it than those who already have, so I won't
try—here's a nice link explaining it instead:
[http://andyljones.tumblr.com/post/127547085623/holdout-
reuse](http://andyljones.tumblr.com/post/127547085623/holdout-reuse)

~~~
claytonjy
I got really excited about thresholdout a couple weeks ago, but I've since
cooled; setting the threshold seems like too much black magic.

I thought the Zillow blogpost [1] was a nice intro (and I'm a sucker for
Seinfeld references), and it demonstrates the sensitivity-to-threshold value
in a way the original academic authors never did.

[1]: [https://www.zillow.com/data-science/double-dip-holdout-
set/](https://www.zillow.com/data-science/double-dip-holdout-set/)

------
lokopodium
They use real data to create artificial data. So, real data is still more
useful.

~~~
function_seven
The idea is to sidestep the need to access private information in order for
researchers to do their work. So in this case, the artificial data is more
useful, since the real data is inaccessible.

~~~
kalyanv
Hi, I'm one of the authors of this work. We're very proud that this has
attracted so much attention on Hacker News. I'm happy to answer a few
questions.

We had two requirements for the synthetic data: From the paper, “This
synthetic data must meet two requirements:

1\. it must somewhat resemble the original data statistically, to ensure
realism and keep problems engaging for data scientists.

2\. it must also formally and structurally resemble the original data, so that
any software written on top of it can be reused.”

Our goal was as follows:

* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.

* Have them do feature engineering and provide us the software that created those features. Feature engineering is a process of ideation and requires human intuition. So being able to have many people work on it simultaneously was important to us. But it is impossible to give real data to everyone.

* They submit this software and we execute it on the real data, train a model and produce predictions for test data.

* In essence, their work is being evaluated on the real data - by the data holder - us.

The tests we performed:

* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)

* For a 4th group we gave the real data.

* We did not tell the users that they were not working on real data.

* All groups wrote feature engineering software looking at the data they got.

* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.

* We did this for 5 datasets

* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.

Results:

* In 7 of those we found no significant difference.

* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.

What can we conclude:

* Our goal was to enable crowdsourcing of feature engineering by giving the crowd synthetic data, gather the software they write on top of the synthetic data (not their conclusions) and assemble a machine learning model.

* We found that this is feasible.

* While the synthetic data is capturing as many correlations as possible, in general, the requirement here is for it to be enough such that the user working on it does not get confused, can roughly understand the relationships in the data, be able to intuit features, write software, and debug. That is, they can conclude a particular feature is better for predictions vs. another, inaccurately, based on the dataset they are looking at and it is ok. Since we are able to get many contributions simultaneously, the features one user misses could be generated by others.

* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.

~~~
srean
It would be great to have a link to the paper. Is it on Arxiv or anywhere else
where we can download it from.

I was speculating wildly here
[https://news.ycombinator.com/item?id=16621633](https://news.ycombinator.com/item?id=16621633)
is any of that remotely close

------
pavon
If I was responsible for protecting privacy of data, I don't know that I would
be comfortable with this method. Anonymization of data is hard, and frequently
turns out to be not as anonymous as originally thought. At a high level, this
sounds like they are training a ML system on your data, and then using it to
generate similar data. What sort of guarantees can be given that the ML system
won't simulate your data with too high of fidelity? I've seen too many image
generators that output images very close to the data they were trained on. You
could compare the two datasets and look for similarities, but you'd have to
have good metrics of what sort of similarity was bad and what sort was good,
and I could see that being tricky, in both directions.

Although, I suppose that if the data was already anonymized to the best of
your ability, and then this was run on top of that, as a additional layer of
protection, that might be okay.

------
lopmotr
I wonder how secure it is against identifying individuals. With over-fitting,
you can producing the training data as output. Hopefully they have a robust
way to prevent that, or any kind of reverse engineering of the output to
somehow work out the original data.

------
srean
Could not get hold of the paper. Are they doing Gibbs sampling or a
semiparametric variant of that ?

[https://en.wikipedia.org/wiki/Gibbs_sampling](https://en.wikipedia.org/wiki/Gibbs_sampling)

Generating tuples(row) by Gibbs sampling will allow generation of samples from
the joint distribution. This in turn would preserve all correlations,
conditional probabilities etc. This can be done by starting at a original
tuple at random and then repeatedly mutating the tuple by overwriting one of
its fields(columns). To overwrite, one selects another random tuple that
'matches' the current one at all positions other than the column selected for
overwriting. The match might need to be relaxed from an exact match to a
'close' match.

If the conditional distribution for some conditioning event has very low
entropy or the conditional entropy is low, one would need to fuzz the original
to preserve privacy, but this will come at the expense of distorting the
correlations and conditionals.

~~~
malshe
I could download it from here: [https://dai.lids.mit.edu/wp-
content/uploads/2018/03/SDV.pdf](https://dai.lids.mit.edu/wp-
content/uploads/2018/03/SDV.pdf)

Are you facing any trouble while accessing this link?

~~~
srean
Ah ! thanks it works.

------
_0ffh
Seems like only helpful for testing methods that can't capture any
correlations the original method didn't.

------
odomojuli
Is this akin at all to random sampling with replacement ie bootstrapping?

~~~
myopicgoat
No, because that would take full rows of the feature matrix (thereby
corresponding to the full information of one individual). The idea here is to
“generate” rows corresponding to plausible artificial individuals. That way
you can give a third party artificial data to build an ML model without
compromising (too much) the privacy of the real individual in the initial
data.

------
EGreg
How is this related to and different from _differential privacy_?

~~~
majos
Differential privacy is a formal guarantee of an algorithm. Roughly, given
algorithm A that takes input database X, we say A is differentially private if
m, for any X' differing in at most one row from X, the output distributions of
A(X) and A(X') are similar. So to say an algorithm is differentially private
you need to prove a claim like this.

It's hard to compare to this paper, because this paper's privacy claims appear
to be _heuristic_ , not formal. This isn't necessarily bad, since existing
approaches for constructing synthetic data in a differentially private way is
still not very practical. But heuristics do necessarily lack provable privacy
guarantees, so there's no proof that something very bad privacy-wise can't
happen with sufficiently clever processing of the synthetic data.

~~~
jj12345
To add to this answer: the methods outlined in the paper allow for perfect
reconstruction of the underlying data in many cases, as the simulation of data
is simply sampling from fitted distributions.

------
fardin1368
I am looking into their experiments. Seems most of them are pretty simple
predictions/classifications. No wonder they get good results.

The claim is too bold and I would reject this paper They should clarify that
the data is good enough for linear regression. Not to say there is no
difference between real and syn data.

------
dwheeler
The abstract claims there was no difference only 70% of the time. So 30% of
the time there was a difference. Unsurprisingly it greatly limits the kind of
data analysis that was allowed, which greatly reduces the applicability even
if you believe it. I'm pretty dubious of this work anyway.

------
anon1253
Heh. I wrote a paper about this a while ago
[https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069](https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069)

------
aspaceman
Does someone have a link to the preprint / arxiv? The link in the story is a
404 (I presume that the paper just hasn't been posted yet or something?)

~~~
krab
I've found these documents:

\- [https://dspace.mit.edu/handle/1721.1/109616#files-
area](https://dspace.mit.edu/handle/1721.1/109616#files-area)

\-
[https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed4...](https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed491a814d3fe440c.pdf)

------
sandGorgon
Sounds very similar to homomorphic encryption, except with no compromise in
performance.

I wonder if this the technique behind Numerai

------
bschreck
The link to the actual paper is now working

