
Show HN: Anon – A Unix Command to Anonymise Data - xomateix
https://github.com/intenthq/anon
======
motohagiography
De-identification of data sets (like cryptography) is a very difficult
problem.

It is great that people are building tools for this. Even if I were skeptical
of one or another in particular, the availability of tools popularizes the
discussion of what is necessary and sufficient for de-identifying data.

The main use case I worked on was how to test an event driven (SOA at the
time) pipeline without production data. Health information handling is very
tightly regulated, so generating a test data set large enough that reflected
the needs of the system was a significant challenge. Engineers couldn't just
copy some production data and use it for testing. The regime I worked in that
defined these rules (early PHIPA, PIPEDA in Ontario) is not unlike what people
may encounter with GDPR.

When I was doing this sort of work, I found that it made more sense to find
the structure of the data, then synthesize it from scratch. For a data format
like HL7, this is non-trivial.

Synthesizing a few gigabytes of json/xml/text from a small training corpus
provides incomplete test data. There are a few companies in the de-
identification business, and I remember a few consulting services for it.

I can think of a few ways to do this, and they aren't simple.

------
Cynddl
How does this tool compare to other (libre) anonymization software programs,
such as ARX [1]? From what I understand, there are only basic routines so to
sample records and coarsen a few attributes (e.g. ZIP code, dates) implemented
so far.

This might also not be sufficient to truly anonymize data, as a large body of
research has shown so far [2,3,4]

[1] [https://arx.deidentifier.org](https://arx.deidentifier.org)

[2]
[https://www.uclalawreview.org/pdf/57-6-3.pdf](https://www.uclalawreview.org/pdf/57-6-3.pdf)

[3] [http://randomwalker.info/publications/no-silver-bullet-de-
id...](http://randomwalker.info/publications/no-silver-bullet-de-
identification.pdf)

[4] [http://arxiv.org/abs/1712.05627](http://arxiv.org/abs/1712.05627)

~~~
nathankleyn
Hey! I'm one of the co-maintainers of the project here.

What you see today in this project is really a means to scratch an itch we had
- mainly to quickly and easily sample/obfuscate some delimited data in a way
that is "good enough" for use for demonstrating a visualisation tool without
using the original dataset. It's important to note that that we intend to use
this data still within a secure environment.

This tool is absolutely not up to the task of anonymising a dataset in such a
way as to make it able to be made public. For us, it's about risk management
vs effort: from a security perspective there are scenarios where we can use
samples of data that have gone through this process and decrease the risk of
holding data in mutliplate places substantially without significant effort. If
we were to go onto to make any of these datasets ultimately public, we'd be
looking for a better suited tool.

As a result, tools like ARX are not something we really want to compete with -
they're aiming for a complete solution whereby the results are good enough to
potentially make public. It goes perhaps without saying really that the
reality of this goal is debatable given the research you linked, but some
people might be comfortable with those risks.

One thing we've done to try and bridge the gap a bit is to make it really easy
to add new functions as we need them, and I think we can get to a point
whereby for a good portion of use-cases this tool is good enough (for example,
making datasets you can use in a development environment that are
representative, but a manageable size and anonymised to a reasonable degree).

We'll also try to add something to the README addressing this exact question
from you as it's one I anticipate we're going to get asked a lot - so thanks
for the constructive line of questioning as it really will ultimately help us
and people who choose to use this tool make a decision that's right for them
and their use-cases.

~~~
kevin_nisbet
I would recommend you make this clearer in the readme, as I wasn't left with
the impression reading the documentation that the tool was for limited
scenarios and scope.

------
JackCh
The intent behind this tool seems good, but I don't think it's a good idea. To
actually anonymize data requires semantic understanding of that data and an
understanding of what sort of data, harmless by itself, is transmuted into
identifying data when provided in the context of other otherwise harmless
data.

This tool doesn't help you with any of that. It seems to be a glorified awk
script. My concern is that helping the user with the _easiest_ part of
anonymizing data stands to encourage the user to go full steam ahead without
slowing down to stop and think very carefully about what they're doing.

~~~
nathankleyn
Hey! I'm one of the co-maintainers of the project here. I've posted a very
similar reply to a very similar comment below at [1], but to replay the main
points:

We absolutely agree this tool only solves the easiest part of anonymising
data, and internally we rely on our team of data scientists to do the
difficult parts. This tool is absolutely not up to the task of anonymising a
dataset in such a way as to make it able to be made public. For us, it's about
risk management vs effort: from a security perspective there are scenarios
where we can use samples of data that have gone through this process and
decrease the risk of holding data internally in multiple places substantially
without significant effort. If we were to go onto to make any of these
datasets ultimately public, we'd be looking for a better suited tool (eg. ARX
[2]).

Regarding one part of your comment:

> My concern is that helping the user with the easiest part of anonymizing
> data stands to encourage the user to go full steam ahead without slowing
> down to stop and think very carefully about what they're doing.

We're going to try to add something to the README addressing this exact
question from both of you as it's one I anticipate we're going to get asked a
lot - or one that carries risk if it's not made obvious form the outset - so
thanks for the constructive line of questioning as it really will ultimately
help us and people who choose to use this tool make a decision that's right
for them and their use-cases.

[1]:
[https://news.ycombinator.com/item?id=17144702](https://news.ycombinator.com/item?id=17144702)

[2]: [https://arx.deidentifier.org](https://arx.deidentifier.org)

------
pdkl95
> anonymising ... columns until the output is useful for applications where
> sensitive information cannot be exposed

This tool will not provide any significant amount of anonymity.

> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the
> result by the [constant] value

This is not random. It deterministically selects the same very predictable
fraction of rows.

> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)

> Given a date, just keep the year

Partial postal codes and dates quantized to the year are still _very_
revealing. Combined with other data (such as a hashed name), the partial
postal code may allow a lot of people to be uniquely identified.

> Hash (SHA1) the input

 _Hashing does not provide anonymity. Substituting a candidate key with the
hash of the key is usually a 1-to-1 map that is often trivial to reverse. It
isn 't hard to iterate through e.g. all possible names, postal codes, license
plates, or other short-ish strings to find a matching SHA1.

[https://arstechnica.com/tech-policy/2014/06/poorly-
anonymize...](https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-
logs-reveal-nyc-cab-drivers-detailed-whereabouts/)

The salt _might* provide some resistance to per-computed tables, but a GeForce
GTX 1080 Ti running hashcat can search for matching SHA1 at over 11 GH/s
(giga-hashes per second). That means that a single 1080 Ti running for ~3-4
hours would not only discover _not only_ that SHA1("hasselhof") ==
ffe3294fad149c2dd3579cb864a1aebb2201f38d; it would exhaustively search all 10
character or smaller lowercase strings.

> range

This is the only feature that could provide anonymity, if it is used correctly
to group large numbers of individuals into the same bucket. This is probably
more difficult that it first appears.

~~~
xomateix
Hey, one of the co-maintainers here. Thanks for your comments.

>> rows to randomly sample ... hash (using ... 32 bits) the column ... mod the
result by the [constant] value

> This is not random. It deterministically selects the same very predictable
> fraction of rows.

Yep, you are right. We didn't intend the sampling function to be part of the
anonymisation but just something we tend to use and we thought it would be
useful to have it.

Its objective is to pick a portion of the input data. No more.

>> UK format postcode (eg. W1W 8BE) and just keeps the outcode (eg. W1W)

>> Given a date, just keep the year

> Partial postal codes and dates quantized to the year are still very
> revealing. Combined with other data (such as a hashed name), the partial
> postal code may allow a lot of people to be uniquely identified.

You are absolutely right. Depending on the use case and your data, having the
outcode, the city or the year might be very revealing. In some other cases
even having decades or centuries might be revealing.

We don't pretend that each function provided applies to all use cases. But in
certain use cases partial postcodes or years can be good enough.

>> Hash (SHA1) the input

> Hashing does not provide anonymity.

We are very aware of that. That's why we offer the option to add a salt (that
the user of the tool can make as long as possible and throw away after the
anonymisation process).

>> range

> This is the only feature that could provide anonymity, if it is used
> correctly to group large numbers of individuals into the same bucket. This
> is probably more difficult that it first appears.

We usually work with sets of data that are tens of millions of users. Choosing
the right ranges and, specially, analysing the data and making sure you
anonymise the outliers (by choosing your bottom and top ranges carefully) it's
crucial.

Again, this tool is a hammer. We expect a person that understands about wood
and nails to analyse their problem and use it.

------
magissima
Json for a config that's intended to be used by humans is an abomination.

~~~
xomateix
Hey, one of the co-maintainers of the project here. And the one that decided
to use json.

I agree with you. there are other options for configuration that are much
better than json (yaml, toml).

Main reason for chosing json was simplicity. This was my first project in go
and I didn't want to spend much time in it either. I found an example that was
using json and I saw that I didn't need any external library to decode it. I
thought that was good enough, at least for now.

Will probably look into using a library that supports yaml/toml for
configuration in the future.

------
simlevesque
Good idea. You should add a preview of before and after the anonymisation.

~~~
xomateix
Thanks for the tip. We plan to add an examples folder. We'll add a preview
too.

------
stepik777
Why is it a UNIX tool? What makes it UNIX? Would it not work on e.g. Windows?

~~~
littlesheephtpt
It ostensibly is a tool that follows the model/philosophy of unix: ie, a
command line utility that does one thing well, inputs and outputs are text so
they can be piped together, etc.

------
unhammer
Slightly related: Metadata Anonymisation Toolkit
[https://mat.boum.org/](https://mat.boum.org/) (which seems to be in need of
contributors)

------
Tepix
I'm surprised it doesn't support anonymisation of IP adresses. That would be
pretty much the first feature I'd implement.

~~~
xomateix
Thanks for the idea. We don't support anonymisation of IP addresses because
it's not in any of our use cases yet. But I've already added an issue to
address it.

------
qop
Now that homomorphic encryption exists, why is data anonymization still a
desired thing?

~~~
pintxo
Do you have a pointer for a real life example?

