Hacker News new | past | comments | ask | show | jobs | submit login

Some interesting concepts to consider: K-anonymity, ℓ-diversity, t-closeness.

From the paper (https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069) I wrote a while ago:

While the risk of re-identification (of a record or individual participant) might be virtually non-existent with synthetic data, one could predict unknown attributes of a known individual, given an ideal model of synthesis. In other words, an attacker could find unknown attributes of some individual with a certain probability by looking for the closest match in the synthetic data. This is known as attribute disclosure.

There are several methods for quantifying attribute disclosure, most notably t-closeness, which is defined as: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold. A table is said to have t-closeness if all equivalence classes have t-closeness.

In short: the distribution of a particular sensitive value should not be further away than a distance t from the overall distribution.

Using the t-closeness metric circumvents issues associated with k-anonymity and ℓ-diversity. Briefly, k-anonymity states that a certain attribute class should be present in at least k records, which introduces ambiguity in the data set. However, if each of the k equivalence classes are the same, properties could still be resolved simply by elimination. The ℓ-diversity metric circumvents this problem by adding a further requirement: in addition to the class to being seen in k records, these records must have at least ℓ ‘well represented’ values. But if an attacker knows the real-world distribution of values, then attributes could still be disclosed with a certain probability, simply by combining different data sources




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: