While the risk of re-identification (of a record or individual participant) might be virtually non-existent with synthetic data, one could predict unknown attributes of a known individual, given an ideal model of synthesis. In other words, an attacker could find unknown attributes of some individual with a certain probability by looking for the closest match in the synthetic data. This is known as attribute disclosure.
There are several methods for quantifying attribute disclosure, most notably t-closeness, which is defined as: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold. A table is said to have t-closeness if all equivalence classes have t-closeness.
In short: the distribution of a particular sensitive value should not be further away than a distance t from the overall distribution.
Using the t-closeness metric circumvents issues associated with k-anonymity and ℓ-diversity. Briefly, k-anonymity states that a certain attribute class should be present in at least k records, which introduces ambiguity in the data set. However, if each of the k equivalence classes are the same, properties could still be resolved simply by elimination. The ℓ-diversity metric circumvents this problem by adding a further requirement: in addition to the class to being seen in k records, these records must have at least ℓ ‘well represented’ values. But if an attacker knows the real-world distribution of values, then attributes could still be disclosed with a certain probability, simply by combining different data sources
From the paper (https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069) I wrote a while ago:
While the risk of re-identification (of a record or individual participant) might be virtually non-existent with synthetic data, one could predict unknown attributes of a known individual, given an ideal model of synthesis. In other words, an attacker could find unknown attributes of some individual with a certain probability by looking for the closest match in the synthetic data. This is known as attribute disclosure.
There are several methods for quantifying attribute disclosure, most notably t-closeness, which is defined as: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold. A table is said to have t-closeness if all equivalence classes have t-closeness.
In short: the distribution of a particular sensitive value should not be further away than a distance t from the overall distribution.
Using the t-closeness metric circumvents issues associated with k-anonymity and ℓ-diversity. Briefly, k-anonymity states that a certain attribute class should be present in at least k records, which introduces ambiguity in the data set. However, if each of the k equivalence classes are the same, properties could still be resolved simply by elimination. The ℓ-diversity metric circumvents this problem by adding a further requirement: in addition to the class to being seen in k records, these records must have at least ℓ ‘well represented’ values. But if an attacker knows the real-world distribution of values, then attributes could still be disclosed with a certain probability, simply by combining different data sources