
A group of ex-NSA and Amazon engineers are building a ‘GitHub for data’ - malshe
https://techcrunch.com/2020/02/20/gretel-nsa-amazon-github-data/
======
dekhn
I once took a call with an ex-NSA guy who was a CEO selling email security
software- your MX would point to them, they'd scan your incoming email for
exploits, then deliver it on to you. When I spoke to the guy, I expressed my
concern, having working for a large multinational corporate whose fiber optic
lines were tapped by UK intelligence to avoid NSA laws against spying on
americans, that if I couldn't inspect his software, I couldn't feel confident
of the security and integrity of the scanning system.

I said, in good faith, that I would consider his product if I could inspect
the running system and the code.

He said several things: 1) the NSA never did anything illegal 2) the software
was too large to audit 3) it was an insult to his employment in the NSA that I
was even asking these questions.

Then he hung up.

~~~
sdinsn
It is just plain rude to apply his previous employer's actions, who employ
over 30k people, solely on him.

~~~
CWuestefeld
If the NSA guy is selling himself based on his past experience, then he
himself is the one dragging those issues into the conversation.

~~~
dekhn
Yes, he sells himself on that experience (I had already done due diligence on
a previous company he founded, sqrrl, which was organized around open source
software, but touted the NSA creds):

Oren Falkowitz CEO and Co-founder Oren Falkowitz co-founded Area 1 Security to
discover and eliminate targeted phishing attacks before they cause damage to
organizations. Previously, he held senior positions at the National Security
Agency (NSA) and United States Cyber Command (USCYBERCOM), where he focused on
Computer Network Operations & Big Data. That’s where he realized the immense
need for preemptive cybersecurity.

------
lalaland1125
When you hear a company talk about the promises of synthetic data, you should
run far far away. The fundamental problem is that in order for synthetic data
to be useful for model training the generative synthetic model must have
already solved the problem at hand. Training on the synthetic data is just a
charade: you would be better off simply extracting the model you need from the
source generative model. For generative models with densities this is as
simple as P(Y | X) = P(X, Y) \ P(X).

So, what do you get out of it compared to simply training models as a service
for people? Almost nothing useful. All you get are:

\- Much worse privacy guarantees. (People selling synthetic data love to talk
about improved privacy, but it's actually the reverse. The privacy guarantees
for synthetic data are much, much worse than selling direct models to people.)

\- Much worse model performance. See the previous notes about how a synthetic
data generating procedure must have already solved the problem at hand.

\- A much more complicated setup with much more expensive model training.
Training generative models is hard and requires a lot of data and compute due
to the difficulty in learning such a complex outcome space. This can easily
cost 100-1000x as much as simply training a straightforward xgboost model.

~~~
lettergram
I’d argue you’re incorrect on most accounts. I work in this space, and have
implemented what’s being described related to synthetic data.

You can see some of my work here:

[https://medium.com/capital-one-tech/why-you-dont-
necessarily...](https://medium.com/capital-one-tech/why-you-dont-necessarily-
need-data-for-data-science-48d7bf503074)

The reality is, you can guarantee certain values are not passed. That’s fairly
easy, what you can’t do is block general trends easily in synthetic data -
else the model can’t learn. So you have to be willing to accept that “leakage”
when using synthetic data.

If you accept that leakage, then you can actually improve model performance in
some domains. See ubers architecture search blog, there’s a lot of material
(also from Uber) showing this.

Regarding cost of training, yes model training costs could increase. Although,
I’d suggest much much less than 10x more like 50% or something. Is that worth
it for privacy?

~~~
missosoup
> The reality is, you can guarantee certain values are not passed. That’s
> fairly easy

Nonsense. There's a fundamental requirement that for a model to have utility,
it must have access to relevant data. If you truly find a leak-proof way to
block certain values (which is borderline impossible in itself), then you make
the model significantly weaker or outright useless if those values aren't
uniformly distributed across the input domain.

And if you do have leakage (and you almost certainly do), then direct re-
identification becomes a trivial problem and we're back to square one.

Synthetic data generated from statistical analysis of real data is either
worthless or leaky. One of those two conditions is formally guaranteed to be
true.

GP is correct that a viable synthetic data generator basically has to have
already solved your problem for you, and in that sense it just becomes your
model which is trained on real data. Training an additional model on top of
that model doesn't add any privacy or mitigate reidentification.

------
peteforde
I started a company with the same premise 9 years ago, during the prime "big
data" hype cycle. We burned through a similar amount of investor money only to
realize that there was not a market opportunity to capture. That is, many
people thought it was cool - we even did co-sponsored data contests with The
Economist - but at the end of the day, we couldn't find anyone with an urgent
problem that they were willing to pay to solve.

I wish these folks luck! Perhaps things have changed; we were part of a flock
of 5 or 10 similar projects and I'm pretty sure the only one still around
today is Kaggle.

[https://www.youtube.com/watch?v=EWMjQhhxhQ4](https://www.youtube.com/watch?v=EWMjQhhxhQ4)

~~~
reggieband
Sometimes an idea has a time. It may seem obvious now but it isn't like
Instagram and Snapchat were the first image sharing applications developed.
Slack was nowhere near the first chat app.

I happened to have a long discussion on the topic of data businesses last
night with a friend. We brainstormed datasets that would be a combination of
hard/expensive to obtain while also having resell ability to thousands of
customers who would be willing to pay a high value for them. I don't want to
get involved in datasets that are easy to obtain (too many competitors, no bar
to entry) or datasets specific to a particular company (too much dependence on
a small set of customers, cost of acquiring new customers also includes cost
of acquiring data, no economy of scale).

It's easy to start with the tech problem: how to collect, clean and analyze
data. But reasoning backwards from the business side is much more difficult.
Expensive data I can sell once feels easy. Cheap data I can sell frequently
feels like a race to the bottom. Expensive data I can re-sell 1000s of times
to a niche audience feels like a perfect middle ground ... I just can't think
of any examples.

~~~
AznHisoka
What makes a dataset hard/expensive to obtain?

~~~
mfrye0
I had a friend tell me recently about a client using commercial real estate
data for lead gen. He mentioned [https://compstak.com/](https://compstak.com/)

Basically, identifying companies that are doing well / expanding by how big
the space is they leased. This sort of data is apparently very hard to get,
but gives users a competitive advantage.

~~~
reggieband
Real estate data, and companies like compstak are exactly the kind of niche
markets I'm talking about. Agents are willing to spend large sums to get
access to this data and it can be resold multiple times. Unfortunately it is
also a market full of existing competition with some established players.

What other markets for data are similar? In general, data that leads to
prospect generation is desirable because sales agents are willing to spend
money to make money. Are there any other markets like that?

~~~
jb775
So it sounds like the salient aspect here isn't necessarily the type of data,
but the manner in which that data is collected. Looks like compstak's success
is a result of creating a platform that facilitates crowdsourced data points
that are difficult to acquire using traditional data collection
approaches...that scarcity is what makes the data valuable, especially since
that data can be used for leverage in a negotiation. Also, they appear to prop
up the overall scarcity by only granting access of existing data to users who
provide new data.[1]

I'm curious how they figure out how much to charge companies for this data?
And also how they stop real estate insiders from gaining access without
sharing new data?

[1]
[https://techcrunch.com/2012/10/18/compstak/](https://techcrunch.com/2012/10/18/compstak/)

------
headcanon
I'm a bit unclear on the goals for this startup. When you say "Github for
Data" I'm thinking of a repository of datasets used for ML training or for
more traditional research. But this:

> This so-called “synthetic data” is essentially artificial data that looks
> and works just like regular sensitive user data.

So its like Lorem Ipsum generator for data? Whats the use case here besides
building apps with sample data? Notwithstanding potential privacy concerns,
How am I confident that this is realistic if you literally say its generated?

General data repository for research with some mechanism to ensure cleanliness
or integrity sounds much more useful to me.

~~~
zachmu
As you have noticed, "Github for Data" is not what they're building. At all.

Our company is actually building Git for Data and Github for Data. We have an
open source database called Dolt which combines the commit graph of git with
the relational tables and SQL of MySQL:

[https://github.com/liquidata-inc/dolt/](https://github.com/liquidata-
inc/dolt/)

Then we have a DoltHub, which is Github for data:

[https://www.dolthub.com/](https://www.dolthub.com/)

Dolt lets you version, branch, and merge your dataset so that you can
collaborate on it with others. Dolthub lets you share your dataset with the
world, submit PRs, fork other people's repos, and lots of other analogous
features to Github.

~~~
dflock
You might want to re-think that name:
[https://www.urbandictionary.com/define.php?term=dolt](https://www.urbandictionary.com/define.php?term=dolt)

> A mental retard who is clueless not only about current events, but also has
> the IQ level of a rock. "Dolt" may be the most sophisticated insult in the
> English language. Dolts commonly populate such stereotypes as jocks, nerds,
> fruits, bookworms, and dorks.

~~~
vharuck
It's probably intentional, because "git" is also an insult to imply somebody's
dumb.

~~~
dflock
True, but the name `git` came from the open source SCM tool, written by Linus
for the Linux source code - and now everyone is sort of stuck with that name.
This is a commercial product, deciding on a name for themselves.

------
gumby
I don't want a git _hub_ for data, I want a _git_ for data. The one reason
that POS Perforce is used in game development is that it can handle binary
files, especially large ones. LFS is..OK.

And this company isn't really even aping GitHub given that they are generating
the content too.

~~~
arxpoetica
What would you use a "git for data" for?

~~~
mhh__
Data?

Although having said then when you say data to me, I do imagine some level of
immutability. Logging scientific results using git would be pretty good if
required, in the sense that there is a habit to just use folders and text
files which is fine at the time but is really really hard to take over
sometimes (Like code written by scientists with their single-letter variable
names and hatred of functions, in my experience of Fortran - yuck)

~~~
DocSavage
We’ve released our branched versioning system for data at
[http://dvid.io](http://dvid.io). It’s entirely open source and uses a science
HTTP API with pluggable data types and a versioned key-value backend. I’m
currently developing a new backend for it: DAGStore, a lower-level ordered
key-value store that will have an explicit immutable store and is tailored for
distributed versioning of big data.

------
HashThis
ex-NSA engineers. That has a negative brand to it. Especially and specifically
when it comes to respect for data privacy.

~~~
save_ferris
I’ve met several ex-NSAers that used their time there to build their brand
successfully.

At the end of the day, the dragnet surveillance decisions came from the
highest levels of the Bush and Obama administrations, not the boots on the
ground.

~~~
whatshisface
"I was just following orders," is not an excuse with a precedent of being
accepted.

~~~
sdinsn
No, the point is that the average NSA employee was not involved in dragnet at
any capacity.

~~~
fiblye
Most people who were members of socially bad organizations throughout history
weren't bad people or doing bad things, but willingly joining and serving a
support role for such an organization is itself morally dubious. It's actively
helping enable their activities.

For example, if someone got a job on a shark finning boat maintaining the
knives, they're still complicit in shark finning even if they're never touched
or seen a shark in their life.

And even if their actual role was spying on and helping detain Americans for
arbitrary reasons, nobody's going to slap that on their resume. They'd say
they were a data aggregating administrative officer.

------
zitterbewegung
I think a Gitlab for data is a much better proposition . See
[https://quiltdata.com/](https://quiltdata.com/)

It is also open source and written in python.
[https://github.com/quiltdata/quilt](https://github.com/quiltdata/quilt)

------
lettergram
I work in this space, at the intersection of synthetic data and security:

[https://medium.com/capital-one-tech/why-you-dont-
necessarily...](https://medium.com/capital-one-tech/why-you-dont-necessarily-
need-data-for-data-science-48d7bf503074)

The market opportunity is very large, but also insanely difficult to tap.

For one, you have an uphill battle on trust. Customers have to trust your data
is secure, and btw it’ll never be 100% secure by many standards.

On the other hand, people have to trust the synthetic data is good enough to
use Practically.

So you have to both convince management and convince engineers. Arguably
management is easier to convince, but.. best of luck on the endeavor!

That’s not even discussing the technical challenges - I’ve implemented this
all technically and have had it deployed to production systems. Building a
robust system that is secure and produces valid synthetic data is a challenge.

------
iblaine
I see 'Github for data' but I'm reading services to hide PII. I think a
distinction needs to be made here. Is the primary goal to enable Change Cata
Capture on data(Github does CDC on code) or is the primary goal to manage PII?

~~~
jtm_tech
Hey thanks for the great question. So the "Github for data" is referring to
the ability to collaborate on data. By streaming in data, you can view
discoveries we made on the data (entity recognition, etc) then essentially
make a new version of that data with automatic transforms, anonymizations,
etc. So you're absolutely right, managing PII is part of it, but really its
about enabling entity A to share data with entity B with a high level of
confidence the sensitive data is stripped out.

We'll be releasing some of the packages to do the analysis and transformations
as time goes on, so stay tuned for those so you can take them for a test drive
yourself.

Thanks!

------
opendomain
I pitched exactly this years ago when I was the founder of NoSQL.com
[http://bit.ly/NoSQLPitchDeck](http://bit.ly/NoSQLPitchDeck)

The problems with a 'Github for data' are the 7 'V's Volume: too much data to
have usable diffs and merges Veracity: How do you know which branch of data to
commit? Velocity: The data coming in is a stream - batch processing does not
cut it

Most companies end up creating a federation, not a true Data Mart.

------
wiremine
A lot of the discussion here seems to be about the methodology and value of
the "synthetic data" concept, and squaring that approach with the analogy to
github.

It feels like we can tease those two things apart:

1\. Is a github-style website/service of forkable data sets useful?

2\. Are anonymized, synthetic versions of those data sets, created via ML,
useful?

Feels like the answer to both is "yes"?

(Also makes me wonder if there's a "rebase" equivalent for data in this sort
of world...)

~~~
justinclift
> 1\. Is a github-style website/service of forkable data sets useful?

Depends on the specifics, but it can be.

I've been working on and off on getting a GitHub style data thing going
([https://dbhub.io](https://dbhub.io)).

It's still a work-in-progress, and I really need to get the data visualisation
piece working, which is a pretty key feature and lousy to not have. ;)

------
jayfk
GitHub for sensitive data? Why would I trust an external company obfuscating
my sensitive data and not my developers doing this for me?

~~~
mirajshah
the same reason you trust external companies to do all sorts of other stuff
for you: specialization. in the average case, they can probably do it better
than your developers because it's their primary business.

~~~
jayfk
Sure, but we are talking about a twofold trust problem here. Trust in an
entity (external company vs internal developer) and trust in the work itself.

While I do see your point about trusting an external company which is
specialized in the problem I’m trying to solve more than my own developers, I
still have to transfer my highly sensitive data to them for which I have to
trust them even more.

------
JohnFen
Engineers from the NSA, Google _and_ Amazon worked on this? That doesn't
exactly inspire any kind of confidence...

------
notaharvardmba
I haven't seen it mentioned yet so don't forget about the dat project:
[https://github.com/datproject/dat](https://github.com/datproject/dat)

------
quasiben
Sounds very similar to the OSS DVC project:
[https://github.com/iterative/dvc](https://github.com/iterative/dvc)

------
sailfast
"How do the CIA, NSA, DHS, and FBI all share critical data without breaking
the law with regard to specific details about certain types of collection
targets (like citizens?)"...

FEDRAMP it. Now run that as a service inside a secure cloud environment. You
need enough runway for the FEDRAMP / engineering / sales process plus a
contract win and then I'd imagine the income gets pretty steady.

Commercially? Unsure of the use case. I'd imagine as those sharing data are
typically in competition. Not so in the government / intel community / finance
space and I'd imagine you have to write a metric ton of policy and sign a
bunch of MOUs to do this kind of stuff properly. People in government do care
a great deal about these policies, believe it or not.

This is also a huge problem to solve in the "Know Your Customer" / Anti-Money
Laundering space for financial institutions, where sharing data between
companies or government and companies is often prohibited and/or really
difficult. See the recent FCA TechSprint for more on this:
[https://www.fca.org.uk/events/techsprints/aml-financial-
crim...](https://www.fca.org.uk/events/techsprints/aml-financial-crime-
international-techsprint)

Lots of talk of "Homomorphic Encrpytion" and "Encrypted Cloud Runtimes" as
options but if you don't really need to share all of the data to get to an
outcome (but rather synthetic data - though synthetic identity seems to be the
hard bit here...) that could be interesting!

------
geraldbauer
FYI: You might enjoy my talk notes titled "Using Git (and GitHub) for
(Publishing) Data" [1] includes many real-world examples too.

[1]:
[https://github.com/geraldb/talks/blob/master/git_for_data.md](https://github.com/geraldb/talks/blob/master/git_for_data.md)

------
alexwatson405
Hey all- we just released code on GitHub and a research post on Medium
demonstrating how to generate synthetic datasets from models trained using
differential privacy guarantees, based on rideshare datasets. Please take a
look and let us know what you think!

[https://medium.com/gretel-ai/using-generative-
differentially...](https://medium.com/gretel-ai/using-generative-
differentially-private-models-to-build-privacy-enhancing-synthetic-
datasets-c0633856184)

------
wildermuthn
A little off-topic, since we’re not talking about true synthetic data here
(vs. wiping PII), but the future of synthetic data is no-data due to
differentiable programming. Instead of a program outputting vast amounts of
synthetic data, it is written with a library or language that is
differentiable and whose gradients can be integrated straight into the
training of a model. A few PyTorch libraries dealing with 3D modeling have
been released lately that accomplish that, and a good deal of work in Julia is
making promising advances. I’m curious to see how overfitting will be
addressed, but there may come a time when large datasets become a thing of the
past as low-level generation of data becomes just another component of a
model’s architecture.

------
v4dok
Synthetic data is a tool in a system of privacy-preserving analytics. I don't
think a lot of companies would be comfortable with shipping their sensitive
data into a model sitting not on their premises. The model, needs to generate
good enough results that would make sense in a training/analysis scenario
(doubtful) and preserve the privacy of that data (difficult).

People in the Privacy-preserving industry try to find the silver bullet in one
technology, but the real solution is a combination of different technologies.
Trusted delegation of computation + privacy-preserving techniques together
solve this issue, but separately provide marginal value.

------
dannykwells
Honest question: does the NSA have good engineers? Like, FAANG good? It seems
like a funny "hype line" for a startup, because I think of government
engineers as ok, but not like, rockstars.

But maybe I'm wrong?

~~~
dx87
Most of the top engineers would probably be contractors. The government pay
scale pretty much forces you to become a manager if you want to move past
GS-12. Including locality pay, GS-12 salary tops at a little over 100k a
year[1].

I just looked at their website, and they don't mention anything about NSA,
Amazon, or Google on the landing page, so I think it's just techcrunch adding
it in as clickbait, even though the founders aren't going out of their way to
advertise it.

[https://www.federalpay.org/gs/2020](https://www.federalpay.org/gs/2020)

------
dpflan
The article does not have a link to the company's website; this will save you
a few actions: [https://gretel.ai/](https://gretel.ai/)

------
mhh__
Not entirely sure what this is yet but I was looking for a resource of
scientific data in nice formats and I was very surprised I couldn't really
find any. If this could do that in a way that doesn't upset any licensing
agreements (Don't really know what I'm talking about in that regard) that
would be pretty cool.

I needed a Resitivity-Temperature dataset for a Tungsten alloy and I ended up
having to manually type up a series out of the book, not fun!

------
xrd
Is this the same Lazslo Bock, founder of Humu, formerly head of HR at Google?
If so, very diverse set of founding experiences.

If not, is this a common name in Hungary? Small world.

~~~
Maxious
Same one, confirmed on his twitter
[https://twitter.com/LaszloBock/status/1230520252384993285](https://twitter.com/LaszloBock/status/1230520252384993285)

------
travjones
I think this is a great idea. Perhaps the "Github for data" copy could use
some work, but the concept of obfuscating real data for use in building
systems without the overhead or concern of operating on real customer data is
valuable. Of course, the degree to which the obfuscated data represents real
data is important to ensuring the systems built like this are robust, but this
seems possible.

------
tompccs
Since they're not launched yet, I think we can safely assume that this press
release is not targeted at potential users.

------
that_girl
I've been using Snowflake for a while now and the Data Sharing feature they
have is weirdly close to what I'm reading here. Anonymization, masking a
column etc. Anyone with similar experience as mine? Or anyone who's
knowledgeable enough to compare this product with Snowflake for me?

------
alpb
I see "Laszlo Bock" is one of the founding members. He was industry veteran an
exec at GM and Google (as a head of HR, as far as I know). He then started an
HR company Humu ([https://humu.com](https://humu.com)).

------
ivarv
Wouldn't a data repository considered more of a wiki than vcs? I don't
understand the desire to associate the concept with Github instead of
Wikipedia, which I would propose is more appropriate.

~~~
rednerrus
What if you wanted to work on a data set across groups and you wanted version
control?

------
m0zg
There is already "GitHub for data":
[http://academictorrents.com/](http://academictorrents.com/)

~~~
oscarbatori
How does that even come close to solving the problems that a "GitHub for data"
product would solve?

~~~
m0zg
Arguably, Academic Torrents is closer to what GitHub does than the product in
question. I don't know why they compare it to GitHub TBH.

------
rsashwin
When I read the headlines, I thought it was these guys:
[https://www.dolthub.com](https://www.dolthub.com)

~~~
tlbsofware
There’s a 404 at that link

~~~
oscarbatori
[https://www.dolthub.com/docs/getting-
started/introduction/](https://www.dolthub.com/docs/getting-
started/introduction/)

This is the correct link

------
scarejunba
Dolthub and Data.world sound more like that. This looks like one of numerous
synthetic data startups.

------
michaelbuckbee
I like this idea but the article doesn't do a great job of spitting out why
it's valuable.

Consider a SAAS with 25,000 active customers and tons of structured (database)
data.

You have a bunch of people that need to work on the dev system and the closer
dev looks to prod the better you are.

\- Contractors in another country

\- A team working on basic compliance with GDPR and CCPA

\- Sysadmin team trying to manage backups/restores

When the contractors pull a version of the DB it needs to not have any
customer data (emails, addresses, etc.) so there's a process that wipes all
those out and fills them in with fake data.

When the GDPR team gets a data deletion request ("Please delete all my data my
email is X@y.com") on Monday and the Sysadmin team restores from a backup from
Sunday what happens?

Right now both of these actions are one-off things done with a mish-mash of
scripts, organizational knowledge, and half-remembered processes.

So wouldn't this be better with a service that could talk to your DB and you
could fork out versions that "knew" the current DB structure, that you could
mark as purged of sensitive data, that you could apply (and re-apply)
transforms to for structure to data addition/removal.

------
TACIXAT
I would like this as a marketplace where you could license your data to other
people.

------
amirouche
It just happens that I have been working on something like that, posted on
Show HN.

------
goofballlogic
Great idea. I was just thinking about how this would be a great idea the other
day.

------
visarga
It's not Github for data, but if it were, then one of the main issues is
handling GDPR requests. Git is by construction averse to deletions in its
history, so incompatible with hosting sensitive data. On the other hand it
could be great to have deduplication in storage and versioning for datasets.

------
stazz1
How about a version controlled law lookup, can I get an Amen?!

~~~
Havoc
I'm sure I've seen this somewhere already. Don't recall what country though

------
arxpoetica
Assertion: GitHub for data can't exist without a Git for data.

Naive take?

------
luord
That sounds like github in pretty much no way.

------
m463
We already have:

    
    
      git lfs

------
freeone3000
Why not use GitHub?

~~~
sonicxxg
Well, Git itself is not a good tool for handling large dataset files. In most
cases, you're not interested in deltas between commits. The size of your repo
can also grow or of control pretty quickly. As a dirty workaround, you have
Git-LFS to do that for you. People tend to use it in repos with a lot of
multimedia assets. This works well in many cases, but it has its own pitfalls
as well.

------
symplee
How about a GitHub for science?

~~~
btrettel
GitHub itself works fine for science. I see no particularly compelling reason
to use an science-oriented service given that people are more likely to be
familiar with GitHub.

Here's an example (or shameless plug): I use GitHub to share research data:
[https://github.com/btrettel/pipe-jet-breakup-
data](https://github.com/btrettel/pipe-jet-breakup-data)

There's code in that repository too. The code merges a variety of different
data sources and performs some analyses. Nothing particularly fancy, and the
code is probably not much better than average as far as academic code goes
(which is not good), but I'm slowly adding tests and improving the code
otherwise.

------
kick
_“We’re building right now software that enables developers to automatically
check out an anonymized version of the data set,” said Watson. This so-called
“synthetic data” is essentially artificial data that looks and works just like
regular sensitive user data. Gretel uses machine learning to categorize the
data — like names, addresses and other customer identifiers — and classify as
many labels to the data as possible. Once that data is labeled, it can be
applied access policies. Then, the platform applies differential privacy — a
technique used to anonymize vast amounts of data — so that it’s no longer tied
to customer information._

Anyone taking bets on how long it's going to be before these idiots end up
leaking the SSN of every US citizen because their categorizer failed?

~~~
derision
How do you know they're idiots? Maybe you're the idiot

~~~
staticautomatic
You don't have to be a machine learning expert to understand that no
classifier is going to be correct 100% of the time. The laws against divulging
PII don't contain exceptions for classifiers goofing.

~~~
thruflo
That’s not how it works. Synthetic data is entirely artificial rather than
transformed, so you’re not worried about “missing some PII”.

See for example the videos on
[https://hazy.com/product](https://hazy.com/product)

Disclosure: Hazy cofounder.

~~~
staticautomatic
Why would you need "to anonymize vast amounts of data — so that it’s no longer
tied to customer information" or "appl[y] access policies" if the data contain
no PII? Presumably the ML is anonymizing the data and the access policies are
necessary because the data contain PII.

~~~
thruflo
Yup, people tend to confuse concepts and refer to synthetic data as anonymised
data. They are very different things.

Anonymised data or redacted data are transformations of a data set that
_hopes_ not to leak too much PII / sensitive data. People don’t use ML to
anonymise but they do use ML to classify as a first step before splatting or
generalising.

In that case, its absolutely right that the ML classifier not being 100%
results in PII leaking.

This is a key reason why anonymisation and redaction are widely seen as
problematic and are being replaced by synthetic data and, maybe in future,
homomorphic encryption.

~~~
v4dok
Homomorphic encryption and any encryption in-use technology is no guarantee of
privacy on its own. Synthetic data has the same dillema of utility vs
anonymity as any other anonymization tech.

~~~
thruflo
Yes, but starting from the position of “entirely artificial data”.

