If you are using data for test purposes please use generated data, not anonymized data. This has the additional advantage that there is no potential path for live data to end up on a developers machine.
added in edit: And also realize that just using a service such as this or similar actually increases the chances that you are leaking sensitive data, in fact uploading somewhere it is the very best way in which you could ensure that at some point there is a breach. Don't take the 'made easy' line for granted if possible ask the company for their audit reports and what measures they have in place to ensure that your data doesn't end up elsewhere, a company providing such a service or a chunk of software that does this is - of course - a massive target. The only way you would stay away from that extra risk is by running this on your premises on a machine that is not connected directly or indirectly to the outside world.
When a massive object passes between a distant star and an Earth-based observer the light coming from the star gets deflected and focused by the gravity of the massive object. The star seemingly brightens as a result. This is called gravitational microlensing. The duration and how the brightening happens allowed scientists to determine that it was likely a Mars-sized planet and that it likely had no star within 8 astronomical units. It's likely a rogue planet of roughly Earth size.
Think about how little actual information these astronomers had. Yet they were able to make a very credible prediction on what happened. You see this all over in science, particularly in physics, where the truth is coaxed out of very little direct data. This makes me think that similar things can probably be done with data about people. This would mean that effectively anonymizing people's data is very hard or maybe even impossible.
Computers are powerful enough though, and people clever enough, that everyone can now have a Sherlock Holmes in their pocket. And large corporations can have smarts far beyond Sherlock Holmes. So what once was a "eh, there's no rule against it, if you have the information the smart guy should be able to use it", is now a case of "oh dang, hmmm maybe civilization is built on the idea that not everyone is Sherlock Holmes."
Maybe this should have been my own blog post somewhere, but anyway, it's one of the many new conundrums of our time.
Enter today, when nearly everyone can be and is surveiled with detail cops of previous generation could only dream of, and we're still dealing with those same laws, but we removed the unwritten premises that they're based on: surveillance couldn't scale.
Alternatively look for an open-licensed dataset if one exists in your domain (e.g. using https://fairsharing.org or, shameless plug, https://biokeanos.com). With generated data you always add some assumptions, with more 'wild' data you have a chance to discover edge cases earlier.
The problem goes beyond testing: your training might (1) be deprived of information your original data had, e.g., among a 1000 features, the key to the classification of interest might be one feature; how does your generator know to not distort this potentially at the cost of distorting the other 999 features? (2) might latch onto the assumptions made by the data generator, e.g., for a continuous valued feature, your model might bias itself towards the distribution moments that the generator assumed.
I am not sure generating good fake data is a problem different from good density determination. And in some cases, you might need to specify what parts of the original distribution you don't want the generator to mess with: consider an NLP dataset where your model must rely on sentence structure. Generating the right bag-of-words features might not help here: sequence matters. Or if you wanted to use contextual embeddings; sequence matters then too.
Even if you did manage to generate a "distributionally-compatible" version of the data, for cases where you perform some kind of data enrichment at a later stage, you could run into problems. For ex: if the original data has zipcodes that you wanted to mask, and your data generator substitutes them with arbitrary strings, then at a later point you cannot introduce a feature that measures the proximity of two locations.
Theoretically speaking if the generated data has the same distribution and parameters as the real data , and encodes similar nonparametric features like seasonality and user activity, I think generated data might be fine. 
1. Admittedly tricky if you have limited data and no insight into the underlying population distribution/features, just those of the sample. But then you have a worse problem for modeling diagnostics anyway.
2. In the sense that anything is "fine", which is a spectrum that requires some critical skepticism in statistics. There are always caveats but it may still be robust and useful.
So you could instead test on fake data that has the same (or as much as possible the same) statistical properties as the data that you would like to use.
Disclosure: I'm the main author of ARX.
In my opinion, this source code was put out to appear "open" to the H2020 Programme, but they have no intention in actually helping users run the code locally.
And if this isn't intentional, then the level of effort is so low how could you trust the software with such a complex and subtle problem? It is clearly intentional, but they didn't even give themselves a plausibly deniable story.
Less prominent than the "online version" button though.
> Amnesia Desktop Version 1.2.2
> Available for Windows and Linux
> [ mockup that is supposed to resemble a macOS app ]
For data to be truly anonymous under GDPR. there must be no other additional data that would allow for reidentification. If there is any other data that, when combined with the anonymous data, allows for reidentification, the data set is only pseudonymous and must be treated as personal data under GDPR.
The thing that you're worried about with poorly-anonymized datasets is that if you have another non-anonymized dataset you can combine them to deduce the original information. "Your data set must not be able to be combined with any others that would allow them to infer the original data" is hard. How could you possibly test them all?
Well it turns out that there is one such non-anonymized dataset with the property that if you can't connect your anonymized data with it at all then you can be pretty sure that you couldn't connect them with any others -- the original data!
You don’t have to use anonymized data all the time, it’s just that the requirements for handling and passing around such data is lower.
The main point is that de-identified data can still be "personal" so it's regulated. If you share or make public psuedonymous data, that data is still covered by GDPR so you have to inform the individuals, have a legal basis (such as consent), let them opt out (if applicable), etc. Even if it's been pseudonymized, I would want to know if/when my data is sold to a marketing firm or whatever.
But if the attacker lacks the source dataset, they can't do this, and if they possess the source dataset, they'd use it for their analysis rather than using the anonymised dataset.
Oh, there is Netflix account for user # 188da24a7789d and the IRS released tax summaries for user # 188da24a7789d? That's interesting, since I know that user # 188da24a7789d is really MaxBarraclough.
This is a deep problem; it's basically unavoidable in e.g. medical research - the very factors you want to study may well be potentially identifying. The only way to address this is to balance the potential utility of the research against the potential impact of the information.
Unfortunately, many aspects of the GDPR are interpreted very heterogeneously, both in individual countries and by different supervisory authorities within the countries themselves.
For this reason, it is essential that more specific guidelines and certifications are developed for the use of different technologies, including anonymization.
This is absolutely true. The hard part is that was it "reasonably likely" changes as technology changes. It's entirely possible that a data set that qualifies as anonymous today will not be anonymous in 5 years. Organizations are responsible for the data they publish. If data loses its anonymity in the future due to release of other data sets and/or improved technology, the organization releasing the data will be responsible for the release of personal data, even if it wasn't personal data at the time of release.
It looks like (from other text on their site) they use variants on k-anonymity. This can prevent re-linking attacks back to the original data, but we've also known for a decade that this isn't especially strong. For example, two independent k-anonymous releases can unique identify everyone in the dataset.
I feel like differential privacy is the strongest definition we have, but it is also lacking from a practical standpoint. What does it mean to have N nats/bits of information gain from seeing the result of a query? How does this translate to my risk of a PII leak?
Clients are buying plausible deniability, hedging their liability.
"Yes, we regrettably leaked sensitive data. But we followed all the rules. What more could we possibly do?"
Edit: "Not save my data"
I'd also point out that there is a huge amount of COVID data being shared at the moment and, outside of vaccine trials, I'd be pretty certain a lot of it is not under any sort of informed consent. (As is true of a lot of population statistics generally.)
However, this trade-off can be of a different nature, resulting in advantages for synthetization, for example when protecting high-dimensional data.
If Im looking for a white male in new york city it's going to be harder to find my target than it would be if I also know their birth date and zip code.
that will be all
I mean, I don't enjoy hearing about your suffering and I certainly don't wish it upon you, but why would you care about the name? Do you think that people somehow treat amnesia sufferers with less respect when they're exposed to the name in a context that's not explicitly serious and negative?
You may not be offended by the word "smallfoot" but that's just you. You can't make that judgment for black people and the n-word anymore then you can make that judgment for the word "amnesia" and the entire community of amnesiacs.