
Too Unique to Hide - zoobab
https://cpg.doc.ic.ac.uk/individual-risk
======
rstuart4133
Oh, it's a story about UK postcodes. To an Australian street addresses in the
UK border on the insane.

When I was in the UK I had to travel to a B&B at night. I had a GPS but they
hadn't given us a street address - just a postcode and house name. You
couldn't enter that into our ancient, so I rang them. Initially they didn't
give me an address, telling me it was useless. When I persisted they gave me a
street name and suburb. I forget the suburb, but I will never forget the
street name. It was "The Street".

I entered what I had in the GPS, and two hours later I arrived where it said
to go. But there was no house with the name given. So we rang them again.
Turns out there were three streets named "The Street" all within 1km of each
other and we were in the wrong one.

We arrive at the correct "The Street", but still can't find the house. We ring
them for help, but it's freezing and raining, or at least doing what passes
for rain outside of London and they are reluctant. Admittedly it's a short
street. We feel like fools, but as we still can't find it so keep calling, and
eventually our hosts stand on the road in the rain and flag us down. Turns out
the B&B is on a battle axe block and the car has to push past bushes in the
entrance to get onto the driveway.

At breakfast I make the mistake of expressing my opinion of the UK's street
address system. They look insulted and defend it vigorously. Then as we load
the car I hear them having a loud "discussion". It turns out a new postman has
been appointed who hasn't learnt the house names yet, so the mail hasn't
arrived for a few days. They were discussing who should go to the post office
to collect it.

------
nemetroid
As an aside, other countries can have considerably smaller post code areas
than US or UK. For example, here's the size of a typical Swedish post code,
probably less than a thousand people:

[https://www.hitta.se/kartan!~57.68616,11.91682,16z/tr!i=5N3q...](https://www.hitta.se/kartan!~57.68616,11.91682,16z/tr!i=5N3q01sC/search!i=2001232518!q=41452%2C%20G%C3%B6teborg!t=single!st=plc!ai=2001232518!aic=57.68787:11.91388)

~~~
DanBC
English postcodes have two parts.

Here's an example: GL52 2SE

The GL52 part covers large parts of Cheltenham.

[https://www.postcodes-uk.com/GL52-postcode-district](https://www.postcodes-
uk.com/GL52-postcode-district)

But the full post code GL52 2SE covers only 8 addresses in Albion Street. The
full post code covers from KwikFit up to 132.

[https://www.google.co.uk/maps/place/Albion+St,+Cheltenham+GL...](https://www.google.co.uk/maps/place/Albion+St,+Cheltenham+GL52+2SE/@51.8989323,-2.0688246,3a,75y,96.29h,90.06t/data=!3m6!1e1!3m4!1sARTo2RT5FY4I4hNYEhOItQ!2e0!7i16384!8i8192!4m5!3m4!1s0x48711b9595eb534d:0x3f8d4e355f3228e6!8m2!3d51.8989702!4d-2.0685344)

Plus this bike shop:
[https://www.google.co.uk/maps/place/Albion+St,+Cheltenham+GL...](https://www.google.co.uk/maps/place/Albion+St,+Cheltenham+GL52+2SE/@51.8998816,-2.0700704,3a,75y,45.99h,89.4t/data=!3m6!1e1!3m4!1sg4I7W_GHNsZ0S43BVUa9IA!2e0!7i16384!8i8192!4m5!3m4!1s0x48711b9595eb534d:0x3f8d4e355f3228e6!8m2!3d51.8989702!4d-2.0685344)

~~~
pbhjpbhj
Someone I know has 2 postcodes for their house. You can probably trace that
person on that fact.

------
vermilingua
I found it a little on the nose to complete the first interactive (about how
very few points of data can track you between datasets), only to receive a
_Share on Twitter /Facebook_.

If only people would practice what they preach.

------
anon1253
There are several metrics in addition to the k-annonimity
([https://en.wikipedia.org/wiki/K-anonymity](https://en.wikipedia.org/wiki/K-anonymity))
that has been going around the past few days. For example l-diversity
([https://en.wikipedia.org/wiki/L-diversity](https://en.wikipedia.org/wiki/L-diversity))
and t-closeness
([https://en.wikipedia.org/wiki/T-closeness](https://en.wikipedia.org/wiki/T-closeness))

t-closeness, is defined as: An equivalence class is said to have t-closeness
if the distance between the distribution of a sensitive attribute in this
class and the distribution of the attribute in the whole table is no more than
a threshold. A table is said to have t-closeness if all equivalence classes
have t-closeness.

In short: the distribution of a particular sensitive value should not be
further away than a distance t from the overall distribution.

Using the t-closeness metric circumvents issues associated with k-anonymity
and ℓ-diversity. Briefly, k-anonymity states that a certain attribute class
should be present in at least k records, which introduces ambiguity in the
data set. However, if each of the k equivalence classes are the same,
properties could still be resolved simply by elimination. The ℓ-diversity
metric circumvents this problem by adding a further requirement: in addition
to the class to being seen in k records, these records must have at least ℓ
‘well represented’ values. But if an attacker knows the real-world
distribution of values, then attributes could still be disclosed with a
certain probability, simply by combining different data sources

------
shadowprofile77
The implications of uniqueness through multiple seemingly innocuous data
points are quite worrisome in a wider context. For example, in the digital
sphere, if X person has a certain pattern of sites visited, browsing habits
and so forth, and then tries to anonymize themselves as person Y for browsing
in a way that has no direct connection to any of their regular online
accounts, a third party correlating even a small amount of browsing and even
vague location data by a bunch of unknown activity profiles could presumably
then quickly identify that one of them, Y has a high chance of being X. Or if
a bunch of online comment text, buying habits, random basic information or
whatever from known person profile X were used to search for similar data
points among random people, then correlated together, finding out that Y (who
is X but decided to change their physical address, country of residence and so
forth for personal reasons) matches very closely with X and is thus very
likely to be X seems very easy.

~~~
shadowprofile77
This could apply to simple text analysis too. some agent looking for now
disappeared person X collects a bunch of their known comment, email, chat
text, search pattern text and looks for certain distinct words, word patterns
and so forth among these things then searches for similar correlations across
a broad, maybe global swath of general chat, social media, email, comment,
search posting and also correlates some of the sites involved with keyword
similarities to easily identify X as Y even though Y has completely
disconnected from anything having to do with having been X

------
forgotmypwd123
Not a valid Postcode District...

Yes it bloody is...

Continuing anyway, I arrive at "NaN% of the time!"

~~~
Cynddl
Wups, what is your postcode? Feel free to let me know at luc (at) rocher (dot)
lc if you don't want to post it here.

We used the complete list of US/UK postcodes (edit: England and Wales only for
UK) to implement the demo, so all of them should be there.

~~~
Dayshine
Why not just use the full ONS UK postcode directory?
[http://geoportal.statistics.gov.uk/datasets/ons-postcode-
dir...](http://geoportal.statistics.gov.uk/datasets/ons-postcode-directory-
may-2017)

~~~
Cynddl
This is just a dataset of postcodes, no demographic information about the
population alas.

------
saghm
I don't think I understand the number I got from this; after putting in my
info, it says that if a company put in my demographic info and got a result,
it "would be me 39% of the time". Shouldn't that fraction be 1/n for some
positive integer n? Are certain people with the same demographic info more
likely to be returned than others? Or does the data service give close but not
exact matches some percentage of the time? What am I missing?

~~~
itcrowd
Intuitively, it would make sense to have a 1/n fraction: you must be one out
of n people in the database, right? However, I can think of two counter
arguments (correct me if I am wrong).

1) Consider that your date of birth + postcode appears twice in a dataset.
Naively, the operator can think to correctly identify you 50% of the time (you
are either entry #1 or entry #2). But, it is also possible that you are not
actually in the list. Maybe there are two other people in the area with your
DoB which are not in the database, thereby reducing the odds of guessing
correctly. You have to correct for this sampling bias. Also, the result is now
not necessarily in the form of 1/n anymore.

2) Consider that you now have two datasets that _together_ contain the whole
population (but there is overlap between the datasets that is difficult /
impossible to remove). Your DoB + postcode appears 3 times in the first set
and 5 times in the second dataset. Then ask again: what are the odds of
identifying you from the 8 candidates? Again, this does not give a solution in
the form of 1/n.

------
dijksterhuis
The company I worked for previously had access to a big dataset that was
“anonymised”.

There was one poor individual who was the only person who lives in a certain
post code area.

Ended up having to get the dataset owners to remove postcodes altogether (GDPR
just came in and everyone was on high alert).

------
40four
I guess it makes sense after you think about it for a minute. This is an
illuminating execise to be sure!

It's funny how in my head, it seems like it would be way less likeley that you
could be indentified with such a high degree of confidence just from those 3
peices of info.

------
maxheadroom
None of the Belfast[0] codes are returning as valid; same for the rest of NI.

[0] -
[https://en.wikipedia.org/wiki/BT_postcode_area#Coverage](https://en.wikipedia.org/wiki/BT_postcode_area#Coverage)

~~~
Cynddl
Yep, we used a dataset by the ONS, which includes only England and Wales. This
is indeed not clear on the website, we'll fix that.

~~~
jmkni
You might find similar statistics for Northern Ireland from NISRA -
[https://www.nisra.gov.uk/support/official-
statistics](https://www.nisra.gov.uk/support/official-statistics)

And for Scotland from NR Scotland - [https://www.nrscotland.gov.uk/statistics-
and-data](https://www.nrscotland.gov.uk/statistics-and-data)

------
SteveSmith16384
The scary thing is that this is what they call "anonymized".

Edit: changed.

------
hdfbdtbcdg
If the day is removed from the date would this be sufficient to make the data
anonymous or should the month also be removed?

~~~
dijksterhuis
Depends how many people in the same post/zip code have the same birth month,
are male/female, etc.

E.g. If there’s 1 female in post/zip code XXXX born in 09/79, doesn’t matter
if you’re only including the month. There’s a good chance you can work out who
she is.

It’s a sliding scale dependent on the other attributes in the dataset, and the
number of distinct values across all the attributes.

The only way to make data completely anonymous is to remove things like birth
date tbh.

~~~
NateEag
I'm no expert, but I think removing the postal code is the obvious way to
anonymize.

If you can't restrict to a specific geographic area, then who knows who you
are.

Pragmatically speaking, I noticed that in their "test toggling variables"
widget at the end, removing postal code instantly dropped my probability of
detection to 0, no matter what other fields were active.

So perhaps your geographic location is the information you should be most
paranoid about sharing.

------
m4r35n357
Couldn't enter a month of birth . . .

~~~
Freak_NL
At least your country of residence is an option.

> Where do you live?

> UK US

