Hacker News new | past | comments | ask | show | jobs | submit login
Amnesia – High-Accuracy Data Anonymization (openaire.eu)
150 points by T-A on Dec 2, 2020 | hide | past | favorite | 96 comments

Fair warning: anonymization is a hard problem. It is never easy, and you'd be surprised how many bits can leak out of what you thought was properly anonymized data.

If you are using data for test purposes please use generated data, not anonymized data. This has the additional advantage that there is no potential path for live data to end up on a developers machine.

added in edit: And also realize that just using a service such as this or similar actually increases the chances that you are leaking sensitive data, in fact uploading somewhere it is the very best way in which you could ensure that at some point there is a breach. Don't take the 'made easy' line for granted if possible ask the company for their audit reports and what measures they have in place to ensure that your data doesn't end up elsewhere, a company providing such a service or a chunk of software that does this is - of course - a massive target. The only way you would stay away from that extra risk is by running this on your premises on a machine that is not connected directly or indirectly to the outside world.

Earlier this year we saw articles published about the finding of an Earth-sized rogue planet in our galaxy. The way it was discovered is remarkable. A telescope was looking at a distant star, then the star seemingly brightened over a period of 42 minutes. And that was basically it.

When a massive object passes between a distant star and an Earth-based observer the light coming from the star gets deflected and focused by the gravity of the massive object. The star seemingly brightens as a result. This is called gravitational microlensing. The duration and how the brightening happens allowed scientists to determine that it was likely a Mars-sized planet and that it likely had no star within 8 astronomical units. It's likely a rogue planet of roughly Earth size.

Think about how little actual information these astronomers had. Yet they were able to make a very credible prediction on what happened. You see this all over in science, particularly in physics, where the truth is coaxed out of very little direct data. This makes me think that similar things can probably be done with data about people. This would mean that effectively anonymizing people's data is very hard or maybe even impossible.

Indeed, the ability to very accurately infer lots from very small amount of data is something I've long though of as the "Sherlock Holmes" problem. If there's just a few people with the ability to deduce lots of things, then they can be amusing and offer some limited utility, like Sherlock Holmes.

Computers are powerful enough though, and people clever enough, that everyone can now have a Sherlock Holmes in their pocket. And large corporations can have smarts far beyond Sherlock Holmes. So what once was a "eh, there's no rule against it, if you have the information the smart guy should be able to use it", is now a case of "oh dang, hmmm maybe civilization is built on the idea that not everyone is Sherlock Holmes."

Maybe this should have been my own blog post somewhere, but anyway, it's one of the many new conundrums of our time.

I think it's the same problem as surveillance. When tailing someone required a substantial investment of an actual person's time and a wiretap literally required physical handling of the physical lines to that person's phone, it makes/made sense to give the police wide leverage in who they put under a lens because it required a substantial investment of time from the police in a way that was intrinsically limited and unable to scale to the population level.

Enter today, when nearly everyone can be and is surveiled with detail cops of previous generation could only dream of, and we're still dealing with those same laws, but we removed the unwritten premises that they're based on: surveillance couldn't scale.

I, for one, would be very interested in reading a more in-depth exploration of this idea, and would encourage you to write and publish that blog post.


33 bits... that's all you need.


Alternatively look for an open-licensed dataset if one exists in your domain (e.g. using https://fairsharing.org or, shameless plug, https://biokeanos.com). With generated data you always add some assumptions, with more 'wild' data you have a chance to discover edge cases earlier.

Self plug, I wrote about some aspects of why this is a hard problem a couple years ago, https://goteleport.com/blog/hashing-for-anonymization/

First sentence in [1]:

Amnesia is an application written in java and JavaScript and should be used locally for anonymizing a dataset.

[1] https://amnesia.openaire.eu/about-documentation.html

Agree with your point about anonymization, but here is a call out for generated data too: if the objective is to build a model, you might end up losing information, or worse, your model might end up modelling the assumptions in the generation process.

Yes, that's absolutely valid. This too is hard. But then again, if it were easy everybody would be doing it so consider it a problem that when solved properly becomes part of your moat and a thing that you could easily mention in a sales process to ensure a level playing field with other parties pitching.

What about building machine learning models that make predictions on said data? Can't just test on fake data.

I have made another comment on this page and this is indeed a problem barring specific cases where you have a "simulator" and you want to learn something applicable to some "aggregate of simulations". For ex. learning a Reinforcement Learning policy for tic-tac-toe is fine, because you can build a tic-tac-toe play simulator and you want a universal policy.

The problem goes beyond testing: your training might (1) be deprived of information your original data had, e.g., among a 1000 features, the key to the classification of interest might be one feature; how does your generator know to not distort this potentially at the cost of distorting the other 999 features? (2) might latch onto the assumptions made by the data generator, e.g., for a continuous valued feature, your model might bias itself towards the distribution moments that the generator assumed.

I am not sure generating good fake data is a problem different from good density determination. And in some cases, you might need to specify what parts of the original distribution you don't want the generator to mess with: consider an NLP dataset where your model must rely on sentence structure. Generating the right bag-of-words features might not help here: sequence matters. Or if you wanted to use contextual embeddings; sequence matters then too.

Even if you did manage to generate a "distributionally-compatible" version of the data, for cases where you perform some kind of data enrichment at a later stage, you could run into problems. For ex: if the original data has zipcodes that you wanted to mask, and your data generator substitutes them with arbitrary strings, then at a later point you cannot introduce a feature that measures the proximity of two locations.

This isn't my area of expertise, but I've spoken to computer vision researchers who apparently use generated data for training models for self-driving vehicle autonomy. Maybe they only use generated data for the train set and then do cross-validation on real data? I'd like to hear them chime in on this thread if any are reading here.

Theoretically speaking if the generated data has the same distribution and parameters as the real data [1], and encodes similar nonparametric features like seasonality and user activity, I think generated data might be fine. [2]


1. Admittedly tricky if you have limited data and no insight into the underlying population distribution/features, just those of the sample. But then you have a worse problem for modeling diagnostics anyway.

2. In the sense that anything is "fine", which is a spectrum that requires some critical skepticism in statistics. There are always caveats but it may still be robust and useful.

You could kickstart training on simulators and then do a transfer, i.e. make adjustments to your final model, on real world data. But to learn only on generated data the problem boils down to the nonparametric features you will be using to state that the generated data is similar to the real data. What is a complex enough feature to say that images are equivalent? They might be statistically equivalent according to your features, but are they really? I think this is a very hard problem, because if we did have a good answer to this question then Tesla & co. would already be training their models on perfect simulators and we wouldn't see the glitches currently found in autonomous driving applications.

That's what I figured, re: transfer modeling. Thanks for chiming in.

ARX (see other comment in this thread) also supports data anonymization for privacy-preserving machine learning.

There is a German company that specializes in this: https://www.statice.ai/ , and that too is a path that you should only walk if you fully understand the subject matter.

So you could instead test on fake data that has the same (or as much as possible the same) statistical properties as the data that you would like to use.

Hah! Came here to echo what you wrote. I’ve been in (too) many data anonymizing/sanitizing efforts. It’s nothing but easy. I would strongly consider investing in test data generation.

How are you going to do COVID-19 research with generated test data?

jacquesm was talking about test purposes, which is a different problem.

sure, but I don't think that is the use case for this tool

If you're interested in tools such as Amnesia, you might also want to take a look at ARX, which supports much more anonymization methods, including Differential Privacy:



Disclosure: I'm the main author of ARX.

And has a huge advantage: it runs local.

Amnesia doesn't...? Then what's the download link and github page for?

The Readme is not user friendly and doesn't explain how to use or install the software. In the Github issue the developer points to an installation documentation that throws a 404 error.

In my opinion, this source code was put out to appear "open" to the H2020 Programme, but they have no intention in actually helping users run the code locally.

Yeah, wow. This is one of the more aggressive examples of "open"-but-unrunnable I've ever seen. I think your opinion is dead on.

And if this isn't intentional, then the level of effort is so low how could you trust the software with such a complex and subtle problem? It is clearly intentional, but they didn't even give themselves a plausibly deniable story.

True, Amnesia can also be run locally!

That's not at all obvious from the marketing pages, possibly a point of improvement? It would be my #1 concern.

Download link at the top https://amnesia.openaire.eu/download.html

Less prominent than the "online version" button though.


> Amnesia Desktop Version 1.2.2

> Available for Windows and Linux

> [ mockup that is supposed to resemble a macOS app ]


Obviously their designer is using macOS. :)

This is not easy. This is not an easy thing to do and protect the data. Please don’t call this easy.

Schools think it's enough anonymization to call pupils by their initials while mailing back and forth grades and sensitive data and characterizations about them. It's a rampant problem.

For data to be anonymous under GDPR, it is not enough that individuals cannot be identified from the anonymized data set. If individuals can be identified when the anonymous data set is compared with the source data set, the anonymized data is not "anonymous".

For data to be truly anonymous under GDPR. there must be no other additional data that would allow for reidentification. If there is any other data that, when combined with the anonymous data, allows for reidentification, the data set is only pseudonymous and must be treated as personal data under GDPR.

Can someone explain the point of this requirement? If a malicious actor has access to the source data there's no need to compare it to anonymized data. What am I missing?

It's an easy-to-state largely foolproof test to see if data really is anonymized.

The thing that you're worried about with poorly-anonymized datasets is that if you have another non-anonymized dataset you can combine them to deduce the original information. "Your data set must not be able to be combined with any others that would allow them to infer the original data" is hard. How could you possibly test them all?

Well it turns out that there is one such non-anonymized dataset with the property that if you can't connect your anonymized data with it at all then you can be pretty sure that you couldn't connect them with any others -- the original data!

Let's say you're doing a study of fingerprint patterns. You anonymize a collection of fingerprints from a non-anonymized source by stripping everything but the fingerprint images. Because fingerprints are unique it seems like it'd be impossible to meet the GDPR criteria; even if the only thing that was left was the fingerprints, when compared against the source dataset they will be identified. a) is this interpretation accurate? b) if so, it seems that there's large swaths of data that can never be in compliance. What are the implications for medical research, for instance?

I think you nailed it that some data can’t really be anonymized. How could you anonymize emails, names, social security numbers, DNA samples?

You don’t have to use anonymized data all the time, it’s just that the requirements for handling and passing around such data is lower.

I don't understand the point though; if someone has the source data, what good is the anonymized data to them? What value is added by requiring more stringent safeguards on data that can't be anonymized this way?

If someone makes inferences on the de-identified data, or joins it against another dataset. The source dataset lets those inferences or joins be tied back to the original identifying data.

The main point is that de-identified data can still be "personal" so it's regulated. If you share or make public psuedonymous data, that data is still covered by GDPR so you have to inform the individuals, have a legal basis (such as consent), let them opt out (if applicable), etc. Even if it's been pseudonymized, I would want to know if/when my data is sold to a marketing firm or whatever.

> The source dataset lets those inferences or joins be tied back to the original identifying data.

But if the attacker lacks the source dataset, they can't do this, and if they possess the source dataset, they'd use it for their analysis rather than using the anonymised dataset.

The point is that if the attacker can connect your user record in the source data with user # 188da24a7789d in the "anonymized" data, they can use that de-identify all information derived or built on the "anonymized" data.

Oh, there is Netflix account for user # 188da24a7789d and the IRS released tax summaries for user # 188da24a7789d? That's interesting, since I know that user # 188da24a7789d is really MaxBarraclough.

If a dataset removes all information except for, say, a user's fingerprints, meaning the only information stored in the anonymous dataset is an image of a fingerprint. The nature of fingerprints prevents them from meeting this requirement, as stated, which effectively eliminates any research that can be done with the data. Given that the only way the dataset could be linked to the original user is if an attacker already had access to the source data, how is this regulation benefiting anyone?

Privacy and security are not the same. Security is to protect against malicious actors. Privacy is to protect data from everyone that’s not the person PII itself.

Google's Pair group has a great explainer here: https://pair.withgoogle.com/explorables/anonymization/

That's the most concise and clear formulation of that I've seen so far. Thanks.

k-anonymity is often only applied to "pseudoidentifiers", if you have the original dataset it'd be trivial to reverse k-anonymity applied that way. For example someone's blood pressure isn't considered an identifying variable, and would not need to be anonymised (should not too, to keep data utility high), however this would make linking against the original dataset trivial.

You are right, time series data like BPM over time does not lend itself to anonymization nicely, the provider most likely will have to ask the user organizations what kind of measures (features) they need and return an average (if that's what the receiving organisation was after) that itself can be k-anonymized.

Averaged time series are very different than individual ones.

This is a deep problem; it's basically unavoidable in e.g. medical research - the very factors you want to study may well be potentially identifying. The only way to address this is to balance the potential utility of the research against the potential impact of the information.

In my experience, this is a question of interpretation (see e.g. Recital 26 and the question of what is "reasonably likely"). You can ask ten different experts, and you will get ten different opinions.

Unfortunately, many aspects of the GDPR are interpreted very heterogeneously, both in individual countries and by different supervisory authorities within the countries themselves.

For this reason, it is essential that more specific guidelines and certifications are developed for the use of different technologies, including anonymization.

> In my experience, this is a question of interpretation (see e.g. Recital 26 and the question of what is "reasonably likely").

This is absolutely true. The hard part is that was it "reasonably likely" changes as technology changes. It's entirely possible that a data set that qualifies as anonymous today will not be anonymous in 5 years. Organizations are responsible for the data they publish. If data loses its anonymity in the future due to release of other data sets and/or improved technology, the organization releasing the data will be responsible for the release of personal data, even if it wasn't personal data at the time of release.

True. For this reason, even anonymous data can usually not be shared as open data. You have to control the environment in which the data is used to control what is "reasonably likely" (see also comment by La1n above).

Also this interpretation would completely block any sharing within the pharmaceutical field, where the original data is required by law to be kept for a minimum of 25 years. I personally like the definitions from UKAN, which talk about anonymous data as relating to data environments.

edit: https://msrbcel.files.wordpress.com/2020/11/adf-2nd-edition-...

Absolutely. They're doing a great job at UKAN!

Although what you say makes sense, which is the respective GDPR rule? I don’t recall seeing something like this.

Presentation: Ten Things I Have Learned about De-identification https://www.youtube.com/watch?v=56whNWFIicY&t=3055s

Upload my data to be anonymized? Nooooooooope

How much can you trust a third party to take your complete data set, full of sensitive information, and treat it correctly? How long will they store it for? Who can access it while it’s being processed? What gets output in their internal logs? They’re asking for a lot of blind faith

First sentence in [1]:

Amnesia is an application written in java and JavaScript and should be used locally for anonymizing a dataset.

[1] https://amnesia.openaire.eu/about-documentation.html

Do these methods provide any guarantees or is it assumed that a human reviews the output to verify that it was properly anonymized?

> Data anonymized with Amnesia are statistically guaranteed that they cannot be linked to the original data.

It looks like (from other text on their site) they use variants on k-anonymity. This can prevent re-linking attacks back to the original data, but we've also known for a decade that this isn't especially strong. For example, two independent k-anonymous releases can unique identify everyone in the dataset[0].

[0]: https://dl.acm.org/doi/10.1145/1401890.1401926

However that statistical guarantee also requires your pseudoidentifiers to be picked correctly, i.e. it only holds true if you select all variables the attacker could possibly know about a subject. I think that is the hard part here, it's not something I would recommend someone doing without a lot of research and experience for highly dimensional data.

Right. Even if you assume the worst-case-scenario there isn't some standard risk metric nor threshold to meet.

I feel like differential privacy is the strongest definition we have, but it is also lacking from a practical standpoint. What does it mean to have N nats/bits of information gain from seeing the result of a query? How does this translate to my risk of a PII leak?

The very best case for Amnesia: veneer of GDPR compliance, maybe survive an audit.

Clients are buying plausible deniability, hedging their liability.

"Yes, we regrettably leaked sensitive data. But we followed all the rules. What more could we possibly do?"

"Not save my data in illegal ways"

Edit: "Not save my data"

This product is for orgs sharing data.

Even more illegal.

There are plenty of valid reasons to share data, e.g. for medical research among others.

Then get informed consent

That's a must. In some countries this is now automated at the government level, Belgium for instance has an excellent consent mechanism for medical data.

Right. But that still involves anonymizing data.

I'd also point out that there is a huge amount of COVID data being shared at the moment and, outside of vaccine trials, I'd be pretty certain a lot of it is not under any sort of informed consent. (As is true of a lot of population statistics generally.)

Not necessarily, but very hard to do properly.

I read this post by Adam Pearce and Ellen Jiang the other day and it was a great read: https://pair.withgoogle.com/explorables/anonymization/

There are tools to generate realistic patient data, for example Synthea: https://synthetichealth.github.io/synthea/

Use synthetic data instead? https://tonic.ai/

Also with synthetic data, there is an inherent trade-off between privacy risks and the usefulness of the data produced.

However, this trade-off can be of a different nature, resulting in advantages for synthetization, for example when protecting high-dimensional data.

Are there good ways to measure the amount (original) subject level data that can be extracted from a synthetic dataset, or calculated risk of reidentification (which is nice and easy for k-anonymity (if your assumptions are valid))?

Risk of re-identification is hard to estimate. It's mostly because you have to assume some state of background knowledge. I.e. what fields does the adversary even know something about.

If Im looking for a white male in new york city it's going to be harder to find my target than it would be if I also know their birth date and zip code.

Why isn't this open sourced?

yep just found that! thanks!

Is that after clicking the download button on the linked page in the OP?

When this service gets hacked and leaks data who pays the GDPR fine?

I will go for the superior product "dementia"

that will be all

mic drop

Insensitive name to people actually suffering from amnesia, (me).

I'm a grown up man with a shoe size of 40 (8). Is the 2018 movie Smallfoot insensitive towards me?

I mean, I don't enjoy hearing about your suffering and I certainly don't wish it upon you, but why would you care about the name? Do you think that people somehow treat amnesia sufferers with less respect when they're exposed to the name in a context that's not explicitly serious and negative?

The same logic can be applied to black people and the n-word. Shall I name a product after the n-word with no harm intended or being explicitly serious or negative? Does that make it right? No.

You may not be offended by the word "smallfoot" but that's just you. You can't make that judgment for black people and the n-word anymore then you can make that judgment for the word "amnesia" and the entire community of amnesiacs.

Give it a few minutes.

Give what a few minutes?

I am sorry, but I disagree.

The website render with no scrollbar in my browser (Goolge Chrome 97 Linux). Very annoying

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact