
The False Allure of Hashing for Anonymization - twakefield
https://gravitational.com/blog/hashing-for-anonymization/
======
aidos
I saw a case a few years ago where the management of a company I knew were
worried that the sales team were covering their mistakes and lying about it to
blame the (external) dev team's code. They asked me to take a look into it one
morning.

At first glance there didn't seem to be a lot to go on. There was no auditing
in the application itself so I focused on the nginx logs. It's amazing how
clear of a picture you can create from ip addresses, user agent strings and
accessed urls.

Within an hour I could say with a high degree of certainty that the story was
something like:

    
    
        Sales rep makes mistake with record on Friday afternoon
        Monday morning - at home, late for work
        Receives call from another rep re mistake
        Logs in via mobile device to see the issue
        Logs in via desktop to fix broken record
        Arrives at work 1.5 hours later
        Claims dev team had broken the record for the weekend
    

There's a lot of information lurking in log files (let alone insecure dbs),
and that's just the tip of the iceberg of what's stored these days. I dread to
think how much personal information is stored in some of the bigger CRM apps
these days.

Quite frankly I'm glad there's a push to start thinking about this stuff from
the outset at the moment.

~~~
pcarolan
I think the bigger problem was the culture that company had in place that
would lead people to do that.

~~~
beagle3
Sometimes it's just what the people bring with them, even if the company has
"good" culture (whatever that means).

And even in companies with the best culture, I would expect such things to
happen if the cost of a mistake is comparable to a person's yearly salary or
above that.

------
michaelbuckbee
In digital security there is the concept of "defense in depth", that no one
product, feature, approach or safeguard is going to magically make you
protected from attacks. What's required are multiple overlapping layers of
protection that collectively work together to create a more protected whole.

We're seeing more of this with privacy and user data. The author very
correctly points out some issues with hashing and "pure" anonymization. It's
more correctly considered "pseudonymization" (which is a recommended GDPR
technique [1]).

All of which is to say _it's still an improvement over nothing_ and when
layered with other techniques can help protect user privacy.

1 - [https://blog.varonis.com/gdpr-requirements-list-in-plain-
eng...](https://blog.varonis.com/gdpr-requirements-list-in-plain-
english/#article25)

~~~
WalterBright
Defense in depth is a lesson other industries have learned - that's why
airliners are incredibly safe these days. It's not safe because parts don't
fail - they do fail, as the recent engine compressor failure showed. But the
airliner is designed to withstand those failures, the pilot is trained to deal
with them, and the process is designed to prevent them from happening again.

Notably the Fukushima Nuke plant and Deepwater Horizon disasters did not have
defense in depth. One failure each had a zipper effect.

(Of course, defense in depth is a concept from the military, look how medieval
castles are constructed for a very visible implementation of it.)

------
kevin_nisbet
Author Here.

Using crypto hashes to anonymize data is one of those mistakes I've seen
several times, and wanted to draw some attention to the issue so that
hopefully we can all learn from it.

Let me know if you have any questions.

~~~
sdenton4
So one of the issues here is using an externally visible ID (or a
transformation of such) as an internal ID. Why not create a random int64 at
account creation time which is invisibly linked to the public username (eg,
email address). So now you've got a proper join key, you can restrict access
to the map, and it's easy to delete the map entry when the user unsubscribes.

(There can still be good reasons to apply one-way hashing to the random
internal UUID, as well: for example, to provide different levels of logs
access to different internal users. People who make dashboards get hashed ids,
and people who debug logging get raw ids.)

The problem of entropy allowing individual user identification even with all
IDs scrubbed is still very real, though, and non-trivial to undertake. One can
start by wrapping the query engine with a service which checks that a certain
minimum number of people are covered by a given query before returning the
results. Or apply differential privacy-type transformations to the output...

~~~
merinowool
Then user emails you to ask what personal data of his you have on the server.
Now you don't have a connection so you can't find it, but you have it. GDPR
non compliance.

~~~
sdenton4
It's your mapping, so you can easily gather up everything with the given
marker and hand it back to them. You only throw away the key (and delete
attached data) if the user deletes their account (and maybe after some
additional time elapses, in case they change their mind or were hacked); it's
the same process as GDPR per-user encryption key deletion.

~~~
merinowool
If you throw away the key you still have the data but encrypted. There is no
guarantee that in 5 years user data could be easily decrypted.

~~~
sdenton4
'Key' here refers to the key in the mapping from external to internal userID.
The whole point is that (as mentioned in a sibling comment) choosing an
internal user ID uniformly at random is equivalent to a one-time pad; it's
guaranteed non-decryptable, unless you invent a time machine...

~~~
robbiemitchell
Isn't there a distinction here, though? While they might result in a similar
outcome, deletion is different from de-identification.

------
russnewcomer
This is a question that I've thought of recently, as I am going to be working
with a set of data that is the kind of data that may have damaging personal
repercussions if identified with you but is good for society as a whole to be
tracking, but that tracking doesn't have to be personally identifiable.
Something like, it could be bad for me if it was revealed to my insurance
company that I drove more than 5000 miles a year on a motorcycle, but
beneficial for society as a whole to understand accident rates for high
mileage motorcycle drivers. Do you have any thoughts/resources on how one
could go about creating a privacy environment where users could input how many
miles they drove, and where we have reporting that analyzes that information
they put in? My first thought had been hashing primary keys, but as you point
out in your article, that obviously isn't the best answer.

~~~
entee
Differential privacy and other formalized systems are a good choice, but if
you never need to give the data back or present it as-such to the
customer/inputer, you can get heuristic Pretty Good Anonymization if you
understand the structure of your problem and how you're going to use it.

For example taking your example of motor vehicle trips off the top of my head,
in order the things that can ID you are:

    
    
      Driver's License
      Name
      Vehicle License Plate
      Time, Location of trip
      Trip Distance
      Location of driver residence
      Location of driver workplace
    

If you had a database of these things, you could apply some of the strategies
in the article, and a few others to ensure no collisions.

    
    
      Driver's License: Ditch it, 
      hash it with private key or have a lookup table
      somewhere. I'd favor ditching it.
      Name: Same as DL number
      Vehicle License Plate: Same as DL number
    

For the above 3, you really may only need a few variables that are less
constrained: gender, approximate age, type of vehicle so you could just
compute out to those and store only that result.

    
    
      Time, Location of trip: Fudge these +- random time, or +- random distance from start/finish. 
      Careful not to have it be a dumb random circle, Strava does this, given enough public rides I'm sure people 
      could figure out where I live. (maybe do this as function of population density?)
      Trip Distance: Fudge +- random distance
      Location of driver residence: Fudge to begin with, probably ditch if possible
      Location of driver workplace: Ditto
    

The point is think about what you need from the dataset and deliberately mess
it up so that you'd have to have the original to piece it together. Often, you
don't need the exact input data, but something within a random delta of it, so
just keep the stuff within a random delta.

~~~
russnewcomer
But if I do need the original data back, say, the driver needs to produce an
expense report with the hours, what would you do in that case? I have
thoughts, but trying to bounce off of someone else.

~~~
entee
If you need to provide the data back to the customer, then maybe the right
answer is to follow the same standards as financial institutions and health
companies do. In practice, that comes down to ensuring that no individual has
access to the underlying data without extreme monitoring of how that data
moves around and is used. This is a rather large burden though, so I can
understand if that's too much for your use case.

Things we do:

    
    
      - Rotate passwords used to access networks/servers regularly
      - 2FA all the things
      - Only provide permissions to what a user needs
      - Limit it to just time a user needs it 
      - Logging+security scanning across the backend infrastructure
      - Tight monitoring of devices used to access network for patch level
      - Keep front-end networking infrastructure redundant and patched
      - Multiple levels of auth (vpn pw, vpn 2FA, then public/private key for each server, then 2FA for each server, etc.)
    

You can only do so much but you can make it so that it's harder to compromise
the crown jewels.

~~~
russnewcomer
That makes sense. The data set is going to be in the health area, and I'm less
concerned about processes for the individuals in the organization having
access (like what you've suggested) and more thinking about how to structure
the data so we as an organization can't access it. Dealing with infectious
disease, where there is personal benefit to not letting someone outside the
care side know that you have a disease, but societal benefit to tracking
trends, outbreaks, or hygiene around the disease. And figuring out how to
structure the system so that if we were to sell, say, there wouldn't be this
trove of information on who has what diseases, just who was a customer.

Thanks for your thoughts!

------
PeterisP
SHA256 pretty much ensures that you have a unique hash for every value - and
that's a feature you don't want for anonymization. So why not simply take the
first few bytes of a SHA256, a small enough set to ensure that collisions not
only _might_ happen but _will_ happen? I mean, that's a required feature to
ensure anonymization, not just pseudonymization - if you can select a whole
trail of events for ID #123 and be sure that these represent all the events
for some (unknown) real user, then that by itself means that those events
aren't anonymous, they're pseudonymous.

You can tweak the hash length so that whatever statistics you run out of the
hashed data are meaningful (though not exact) despite the collisions, but that
running a dictionary attack of plausible usernames returns an overwhelming
amount of false positives.

~~~
nebulous1
I'm not sure you could make the data statistically meaningful and have too
many false positives to deanonymize an id. I think you're basically suggesting
randomly grouping the ids so they average X real ids per grouped ID. At least
if you just did it randomly instead of by hashing then there would be no
danger of a dictionary attack.

~~~
PeterisP
The expectation is that a brute force attack would try orders of magnitude
more IDs than you actually have. It means that if a random ID is 90% likely to
have a unique hash and 10% likely to map to one of your real IDs, then your
real data won't have that many collisions, however, if someone does a brute
force check of (for example) a million email addresses, then they'll get 100
000 positive responses, the vast majority of which will be false positives.

~~~
nebulous1
That's a reasonable point but doesn't explain why you're using hashes instead
of random groupings in the first place.

------
jacquesm
The idea that data is a corporate asset has to die. Data is a corporate
liability.

~~~
benmowa
i agree that data is a liability, however even my plumber's truck can kill
someone and can be considered a liability. regardless, the truck is not
something he can do business without. I agree that companies should fear the
data they retain much more than they do today.

~~~
e_proxus
To be fair, companies gather a lot more data than they need to to do business
these days.

------
procrastinatus
Differential privacy seems like a pretty good approach to this problem.
[https://machinelearning.apple.com/2017/12/06/learning-
with-p...](https://machinelearning.apple.com/2017/12/06/learning-with-privacy-
at-scale.html)

~~~
awesomepantsm
Differential privacy is basically a buzzword. Don't believe the hype.

~~~
kahnjw
It seems to me that diffpriv is a nascent area of research that has not yet
been bastardized by the business community. The complete opposite of a
buzzword.

------
Area12
I am not a crypto expert, but I thought that the idea was to produce a new
more or less random salt for EACH password, store the salt with the hashed
password, hashing using an expensive algorithm. Yes the hacker steals the salt
with the hash, but now has to go to the trouble of brute forcing that ONE
password with its UNIQUE (or almost unique) salt. In other words, the hacker
can crack it, but the process is so expensive for ONE password that cracking
an entire database of passwords is a nightmare. Of course, the hacker just
focuses on the most privileged accounts I guess, but the idea is to make the
hackers life as unpleasant as possible, and to catch the hacker while they are
coming back in. Am I missing the point? I do see that if the hacker wants one
password, they can do with effort even with unique salts.

~~~
bglusman
For the specific use case in question, what I've been doing for years is not
just hashing the data, but hashing an internal secret AND the data. The secret
isn't stored in the database anywhere (usually an env var but could be a
secret in vault or other outside config), so our hashes are deterministic (and
don't need a seperate salt for each one), but our hashes will never cooincide
with another system's hashes. I didn't see this mentioned in article but
didn't read thoroughly, I thought this was a pretty good compromise but
curious for other perspectives/forget if I read this technique somewhere or
just made it up as a reasonably good safeguard.

~~~
aflagatopamoon
This is called a pepper:

[https://en.wikipedia.org/wiki/Pepper_(cryptography)](https://en.wikipedia.org/wiki/Pepper_\(cryptography\))

If you read the article above, you'll see that you still need a salt, since
users with very simple passwords will have the same hash: crack one, and you
can crack the others for free.

~~~
bglusman
Yeah I should have been clear this wasn't for passwords at all, ever, this
was/is only for other kinds of PII

------
kurthr
I'm surprised that there was no mention of a salt used in a secure server to
generate the hashes and act as an oracle. Adding pepper at the customer site
already seemed like a good idea. Of course this is still hard and requires
diligence for those who care about their customers and data security.

~~~
int_19h
How would you prove that said server is secure?

A good rule of thumb with these things is to assume that if there's any sort
of indirect link between some person and that server (even if it involves
multiple hops across security boundaries - e.g a web request invoking a
backend service querying a database that accesses the hash from a stored
procedure), it can potentially be compromised. You never know when another
Meltdown happens, and what it'll look like.

------
lolc
The trouble is when we're holding on to the original data because we want the
option to process it in new ways later on. The fundamental problem is that
data correlates facts. Thus - as the article rightly points out - if you know
some of the facts you can reconstruct identities.

I find the distinction between information and exformation revealing:
Information is the bits we gleaned from the data, exformation is the bits we
discarded while reducing the data. The efficacy of an information processing
system is in how much it discards while extracting the information we need.
The expensive operation is not the recording but the forgetting.

If you want to protect data from being stolen, distill it as soon as possible
into the information you need. And destroy the rest. It comes down to the
value of being able to re-run the analysis versus the effort to guard the
data.

~~~
madrox
As someone who's had to work through the implications of GDPR lately, I think
the future of user data is that you can't keep the option to "process it in
new ways" later. Permissions are becoming opt-in instead of opt-out.

~~~
mjevans
You probably can, but you need to be upfront about what you're collecting and
the context that is being stored with it.

It MAY be more ethically permissible to degrade the context and preserve only
the most valuable and least personally identifying data. (Such as saving only
the actual search query and a local timestamp, but filtering out anything
related to a recognized name that isn't famous)

------
voidmain
"Anonymization" in the sense of transforming a dataset so that it's still
useful but doesn't significantly reduce the privacy of the people it
describes, is usually impossible, or at least beyond the state of the art.
People start out with just a few tens of bits of anonymity and bits are
everywhere.

You probably have a better chance of creating your own secure block cipher
than of achieving this goal. In a similar way, your inability to see what's
wrong with your scheme is not evidence that it works.

I don't like to be negative, and I'm all for continued research, but at this
point the conservative thing to do with data that you need to "anonymize" is
delete it.

~~~
Sophistifunk
Agreed. The more alarming angle to consider is that the more a particular
describes somebody, 1) The more valuable it is in the context of surveillance
and advertising, 2) The more work good-faith actors should put into
anonymising it, and most importantly, 3) The easier it is to de-anonymise
through correlation with other sets.

~~People just aren't the unique snowflakes our mothers told us we are.~~ Most
people for example can be uniquely (and easily) identified with just a DOB,
first name, and suburb.

Edit: maybe the problem is actually that we are too unique :)

------
kahnjw
The author makes a good point, anonymizing data is hard. Unfortunately they
don't mention differential privacy, a promising area of research that can help
us solve these problems.

[https://en.wikipedia.org/wiki/Differential_privacy](https://en.wikipedia.org/wiki/Differential_privacy)

------
djhworld
Where I work we've been debating about this a lot. I work with log data from
CDNs, so user IP addresses get ingested. We use that information and correlate
it with geoip services to determine stuff like the ISP being used.

This is so we can evaluate CDN performance and also see how well ISPs are
doing in serving content to the user. So it's essentially asking questions
about network performance rather than at a macro level of individual users.

As far as IPs are concerned we don't care much after that, other than maybe
the odd "how many unique IP addresses were served today" type queries.

We've talked about doing the secret/salt that is rotated periodically, but to
be safe you would definitely need to ensure previous salts are destroyed, and
not even let people view them or access them when they are live.

~~~
unilynx
Wouldn't storing the first three octets of an IP address be enough for this
kind of analysis? Or use the whois database and reduce the data to the first
IP address of the network ?

~~~
ryanworl
I personally think just storing the autonomous system the IP originates from
and never writing the IPs to disk at all would be advisable if the goal is
purely which ISPs are delivering how many bytes to end users. Another benefit
is the AS to IP mapping database is small enough to fit in memory without
issue.

~~~
oasisbob
That's probably insufficient for the usecase. A single AS can advertise many
different routes for different IP blocks that have dramatic geographic
differences.

------
javajosh
When addressing the solution of adding data (salt) I find the authors counter-
argument unconvincing:

    
    
       Don’t get me wrong, this does make it significantly harder 
       to attack a leaked database to unmask every user, but the 
       resources required to do so or target specific users are 
       within the reach of many adversaries.
    

I don't see how it's more feasible to reverse hash(known_user+salt) than it is
to dereference hash(salt), and even state level actors can't do anything but
attempt to brute-force hash(salt). IOW without more behind the author's
assertion, I don't buy it that adding more data to the data you want to
protect is insufficient protection, even against known targets.

------
lbriner
The link to Cryptographic Right Answers is really helpful and the kind of
article that it would be nice to make the general "go-to" for those of us who
know enough but not enough to do it ourselves!

What I didn't like was the continual reference to AWS as if it is the only
provider available, without qualifying whether it is specifically an AWS
product that solves the problem or whether it is _an_ example of using a cloud
service to transfer the risk. There are many alternatives to AWS load
balancers and Key Management systems, so the advice is tainted _sigh_

------
Terr_
Why not a two-step process, where you (A) generate a hash from fixed user
details and (B) use that hash to access a lookup-table for the final UUID?
This combines some strengths of both systems:

1\. Outsiders can't determine an arbitrary UUID, even if they know the
original user-details.

2\. You can easily destroy a relationship (to limit correlation or to comply
with laws like GDPR) by erasing the corresponding row in the lookup table.

3\. Insiders can't directly go backwards from UUID to real-name, due to the
hashing step. They would need to generate hashes for all the users, and hope
that matches still exist in the lookup table.

------
JTbane
Surprising that no mention is made of rainbow tables or lookup tables. If you
hash something that can easily be looked up in a table, it's obviously not
anonymous.

Passwords are stored as salted hashes for these obvious reasons...

~~~
lolc
The article explains very well how salted hashes don't help against username
lookups.

~~~
jwilliams
In the case of salts, the article admits "Don’t get me wrong, this does make
it significantly harder to attack a leaked database to unmask every user..."

So salts definitely do help. And if you chose your salt well (e.g. global
fixed/rotating plus local/temporal) you significantly increase your protection
compared to not using a salt at all.

------
jiveturkey
Really good article. One of those things that are beyond obvious to those of
us close to this field, but not at all obvious to the general software dev
(who also mightn't know the difference between a good cryptographic hash and a
good password hash).

What would make this article great is general ideas on what _is_ a good way to
anonymize data. I'm surprised that info is missing, actually.

What would make it world class great is discussion about GDPR ramifications,
keeping in mind that one need not necessarily be perfect for GDPR, even if
you're FB/Google.

~~~
kevin_nisbet
Thanks for the feedback.

I was trying to avoid the general ideas on what is a good way to anonymize
data, because I don't think there are general rules that apply, and I'm not in
a position to give authoritative advice on this. The more I dug in, the more I
realized this is probably one of the hardest technical problems that exists
right now, and there isn't yet a right answer that works (like use scrypt for
passwords).

As for GDPR, I think digging into this in more detail would be a great follow
up.

~~~
jiveturkey
everyone is going to have different requirements so yeah, hard to claim there
is a general solution. but an idea or 2 can be thrown out there. like an
anonymizer microservice that only remembers the mapping for a limited time
period. even stating explicitly that it’s a hard problem and very very hard
problem if you want perfection, would be a worthwhile addition. as it stands,
the article doesn’t convey the difficulty of addressing the problem.

------
AndrewSChapman
If our goal is true anonymisation, that is, even the host cannot know who the
data belongs to, why are we hashing data at all, and not completely removing
it? Replace the pii (name, email address, phone etc) with a fixed number of
*'s. There's no reversing or guessing that.

If we are wanting information to be readable by some people in some
circumstances, that's not anonymisation: that's data protection and an
entirely different problem.

------
zAy0LfpBZLC8mAC
I think another problem is that we even call any of that "anonymization". If
you replace "foobar" with "1", you haven't anonymized anything. At best, you
have pseudomized your data. Whether you use hashing or a secret mapping
function, as long as identity within your dataset is preserved, what you are
generating are pseudonyms.

------
zzzcpan
> The way we’ve chosen to anonymize the data is by generating HMAC

You can also truncate the hash after the HMAC to mix the data of different
users. It still would be useful for aggregate analytics, abuse protection,
rate limiting, etc, but if each user shares an identifier with many others it
would be harder to unmask them and make correlations.

------
mlinksva
Another recent writeup [https://freedom-to-tinker.com/2018/04/09/four-cents-
to-deano...](https://freedom-to-tinker.com/2018/04/09/four-cents-to-
deanonymize-companies-reverse-hashed-email-addresses/)

------
angry_octet
See also, why hashing is not a good way to discover shared contacts, and a
better way: [https://signal.org/blog/contact-
discovery/](https://signal.org/blog/contact-discovery/)

------
tempodox
Any thoughts on using a UUID instead of a username hash?

~~~
dredmorbius
A hash is (theoretically) an anonymous function.

How do you anonymously map your UUID to the anonymised signifier? Such that
you cannot back it out yourself?

The properties that ensure this make the UUID useless.

------
blattimwind
tl;dr preimage resistance is only as strong as sizeof(input domain), which is
probably small if you're trying to anonymize something.

------
slooonz
If you don’t require deterministic hashes (and deterministic hashes are bad
for anonymization anyway) just hash data+randomBytes(16) (obviously, don't
save randomBytes(16) anywhere). There you are, nobody can bruteforce your
hashes.

Even better, just replace your data with H(randomBytes(16)). Or a random UUID.

~~~
EGreg
Umm what good is the string of random bytes if you don’t store it anywhere?
The point of a one-way function is that its output is verifiable given the
input.

