
Lessons from NYC’s improperly anonymized taxi logs - vijayp
https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1
======
abalone
Whoa, whoa.... You're telling me there's a public data set of _all taxi trip
geolocation data with GPS precision_? That's f'ing insane!

I think there's a MUCH bigger privacy issue here than what the author focuses
on.

Couldn't you deduce many _passenger_ identities based on addresses? There's a
lot of scenarios where passenger identities could be effectively de-
anonymized, just based on GPS data. You could then use this data set to
analyze their comings and goings.

1\. For people who live alone in a single family home, you can pretty much
completely track when and where they went by taxi. From this you can deduce a
lot about their interests, lifestyle, workplace and schedule, private life,
etc. It's profoundly invasive.

2\. Even if there's a few people sharing an address, the other dropoff/pickup
point can be used to narrow down the likelihood of who it is, especially when
combined with other easily obtainable data.

For example if you knew an employee (e.g. that cute barista) lived in a
certain neighborhood you could track their trips to/from work and deduce their
home address.

Or if you knew there was only one senior citizen (or Muslim, etc.) living in a
building, a regular trip to a senior center (or mosque) would reveal when
their apartment is vacant.

Or if there's only one young man in a building, a single trip home from a gay
bar could out them.

Holy shit.. can you imagine someone just plotting all the trips from a single
gay bar? Listing off all the connected residential addresses? And not only
that, _any subsequent trips home from those addresses the next morning_?
Taking the walk of shame to a whole new level!

Likewise trips could be used to deduce affairs and other deceptions by fellow
residents. "You said you were working late, but the only taxi trip to our
building that night was from a bar."

This is just off the top of my head.. I feel I could go on for hours listing
all the possible ways this data set could be exploited.

How is this not front page New York Times???

~~~
Hominem
I think the way cabs actually operate in NYC makes this practically impossible
unless you already have some details such as the lat/Lon of dropoff and pickup
and time of the stops.

I'm assuming the data is for yellow cabs and the new lime green "boro cabs"
you hail on the street, not "car service" cars where you schedule a pickup and
dropoff to specific addresses.

Most bars in Manhattan are storefronts in 3-4 story residential buildings.
There are apartments above and they are surrounded by other buildings with
apartments and businesses. I don't think you could identify a bar. Now strip
clubs on the other hand are required by law to be tucked away in isolated
locations. Might be possible to identify a strip club.

When you hail a cab, and many times when you get dropped off, it happens on a
corner, perhaps over a block away, where it is easier to find a free cab.

Most cabs are in Manhattan, not a lot of single family homes. Single family
homes in the outer boroughs will have almost no yellow cab coverage for pickup
and finding a cab that will take you out of Manhattan can be dicey, although I
guess those lime green cabs are meant to address that. SI, the Bronx and huge
swaths of Brooklyn and Queens pickups will be almost non existant, people
going from the outer boroughs will most likely use a car service.

I will certainly be checking to see if I can identify any of my rides.

~~~
drglitch
This. If you've lived in manhattan for more than a month, you'd know that
pickup and dropoff locations are not precise, specifically:

1) you never get a cab on quiet single-family condo streets - gotta get to
corner of an avenue

2) cabbies often click meter to off half a block before you actually say "stop
right here please, between the drunken couple and the pile of garbage on the
left side". They do this so you pay and get out quicker, clearing way for
another passenger.

3) There are a LOT of "skyscrapers" in manhattan, with 300+ apts in each

What WOULD be interesting is taking credit card logs of someone's cab payments
and cross-matching dropoff based on charge timestamp :)

~~~
abalone
Most of the comments here about pickup/dropoff accuracy and large buildings
suffer from the same logical flaw: "often" is not the same as "never".

With comprehensive data set of literally _173 million_ trips, even if we limit
ourselves to precise locations in front of small buildings and residences --
let's say it's a paltry 5% -- that's still _8 million trips_.

That's more than enough to invade the privacy of a very large number of
people.

And that's just the low hanging fruit. With geolocation data you don't always
need precise location accuracy or small buildings to see identifying patterns.
Don't forget that time is also a very useful factor, and often precise to the
minute. E.g. trips departing after 1am within a half-block radius of the only
bar in that radius are more likely than not to be patrons. And trips arriving
at an apartment building at at particular time may be relatively rare, making
it easy to look up the single trip that matches it.

Thus, a neighbor or roommate who saw someone arrive and noted the time (or had
a security camera) might be able to deduce the bar that they visited, address
or block of the person they're dating, whether they were actually where they
said they were... That's one of a zillion scenarios. Precise address-to-
address trips are just the low hanging fruit.

------
Scaevolus
Anonymization projects should really invest in an hour of consulting time with
a cryptographer-- they would be able to see these flaws instantly.

Nit: this is a lookup table, not a rainbow table. Rainbow tables involve a
clever optimization that compresses multiple passwords (in a chain) into a
single entry in the table, saving a great amount of disk space.

~~~
harywilke
Thanks for your nit. I always wrongly associated rainbow tables with the
Rainbow Codes the British used to name their military projects.
[http://en.m.wikipedia.org/wiki/List_of_Rainbow_Codes](http://en.m.wikipedia.org/wiki/List_of_Rainbow_Codes)
[http://en.m.wikipedia.org/wiki/Rainbow_table](http://en.m.wikipedia.org/wiki/Rainbow_table)

------
salmonellaeater
These are old lessons: in 2006 AOL [1][2] and Netflix [1][3] both released
datasets that were supposed to be anonymized but were easily de-anonymized.
There are older examples based on Census data[4]. It's difficult if not
impossible to release a dataset that is both useful and truly anonymized; in
Schneier's words:

 _The obvious countermeasures for this are, sadly, inadequate. Netflix could
have randomized its dataset by removing a subset of the data, changing the
timestamps or adding deliberate errors into the unique ID numbers it used to
replace the names. It turns out, though, that this only makes the problem
slightly harder. Narayanan 's and Shmatikov's de-anonymization algorithm is
surprisingly robust, and works with partial data, data that has been
perturbed, even data with errors in it._

[1]
[https://www.schneier.com/blog/archives/2007/12/anonymity_and...](https://www.schneier.com/blog/archives/2007/12/anonymity_and_t_2.html)

[2]
[http://www.securityfocus.com/brief/286](http://www.securityfocus.com/brief/286)

[3]
[http://www.securityfocus.com/news/11497](http://www.securityfocus.com/news/11497)

[4]
[http://crypto.stanford.edu/~pgolle/papers/census.pdf](http://crypto.stanford.edu/~pgolle/papers/census.pdf)

------
lumpypua
Stop rainbow attacks peeps, salt your hashes.

[http://crypto.stackexchange.com/questions/1776/can-you-
help-...](http://crypto.stackexchange.com/questions/1776/can-you-help-me-
understand-what-a-cryptographic-salt-is)

~~~
bazzargh
That applies for passwords (where you _hope_ the data is fairly random, and
unknown) but there are only 13,237 taxis in new york, and you can download the
list! You'd simply try each one. The author only took hours to crack the list
because he generated hashes for all possible medallion numbers, without using
the list.

Also, even these numbers only apply to queries where you want to discover all
of the drivers for all of the data. It seems more likely to me that someone
would want to know who was driving a particular taxi at a particular time, or
what a particular driver was doing on a range of days. In both of these cases,
the number of records you need to deal with is massively reduced, and the
second attack implies you know the plaintext.

So no, salting doesn't help against abuses of this data when hashing is so
fast, and even using a slow KDF won't help much against the second attack.

~~~
delroth
You're assuming that the salt would be public. In practice there is no reason
for it to ever be. As long as it stays private, it would be impossible to
reverse HASH(salt || taxi_id) back to the taxi id.

------
ddlatham
Suppose they had generated a random unique ID for each driver and used that
instead of a hash throughout. If you had a record of a single ride you made
with a taxi driver, you could still find that ride in the database (start
location, time, stop location, time). Then you can take your driver's ID and
track all other trips that driver has made. Is that truly anonymous?

~~~
vijayp
That's a very interesting question. In your example, finding the identity of
one or two drivers might be possible easily, but finding the identity of many
drivers would still be very difficult. I guess whether it's anonymous or not
depends upon how strictly you think anonymity is defined.

~~~
karussell
Really interesting. It also depends what other datasets are available

------
TyrantDevice
I'm sure Uber is busily crunching all of this data and will use it to figure
out how to efficiently destroy the remaining taxi cab drivers.

------
walterbell
Would you consider anonymizing the data properly and re-publishing as
canonical torrent for future analysis?

~~~
laurencei
Is there any real point now though? The raw data is publicity available.

So anyone who _wants_ to remove the anonymous fields and get the underlying
driver is free to do so.

Meanwhile anyone who is not interested in the anonymous fields can just leave
it alone?

~~~
walterbell
Information diffusion is a function of time. Some people have the data now.
Many more will use the data over time. Most of those new data users will
simply click on a link in a blog or google or HN. The data may also be stored
in a canonical open-data location. Each of those instances can have anonymized
data.

What's the difference between distributing open-source with a known
vulnerability and distributing open-data that knowingly violates the privacy
of many people? If this was source code, there would be "responsible
disclosure" that allowed the software author time to issue a new release of
software. One could similarly work with NYC citygov digital team to anonymize
the data properly and have them reissue an official dump, possibly with
additional data from 2014. That would provide some incentive for developers to
use the newer data.

Yes, malicious analysts can find the old data. But that is no reason for non-
malicious analysts to keep replicating data that violates privacy. If this
were data where the loss of privacy had significant financial or legal
consequences, then naive data distributors and analysts would be inadvertently
contributing to those consequences.

One should try to do the right thing, even if it seems technically pointless.
In this case, working with the people who shared the data to fix the mistake.
Otherwise, one could imagine future citygov publication requiring much more
slow and expensive review of data to be released, e.g by lawyers who still
won't find the next technical mistake. It's in the interest of all parties to
make this particular instance right, to ensure future openness of privacy-
protecting data.

~~~
vijayp
Yeah, this is a really good point. I'm going to try to reach out to someone in
the government on Monday. I don't really have many contacts over there, so if
anyone has suggestions on how to navigate the bureaucracy, I'm all ears.

~~~
walterbell
Might be worth trying the email address on the page of NYC Digital:

digital@cityhall.nyc.gov
[http://www.nyc.gov/html/digital/html/about/contact.shtml](http://www.nyc.gov/html/digital/html/about/contact.shtml)

------
evan_
At least they tried to anonymize the data. Someone in my hometown recently
filed a FOIA request for information about schoolteachers' pension plans and
the district gave him a straight dump of the database which included the
Social Security Number of every teacher in the district.

------
ghuntley
There's some interesting analysis going on over at:

[http://www.reddit.com/r/bigquery/comments/28ialf/173_million...](http://www.reddit.com/r/bigquery/comments/28ialf/173_million_2013_nyc_taxi_rides_shared_on_bigquery/)

------
rickyc091
Can someone explain to me what the real privacy concern is? The way I see it,
the drivers are on a job. To me it seems the same as mapping out the route
which a bus driver took. It's not like the passenger's information is being
made public.

~~~
dewey
Cabs are not always taking a predfined route like a bus driver. They are
picking people up at home/work which makes it relatively easy to extract
information about their personal life.

------
ecesena
> creating a secret AES key, and encrypting each value individually

This doesn't sound like a good choice. It's security through obscurity.

~~~
gengkev
(In my unqualified opinion,) I don't think this qualifies as security though
obscurity. It seems to be just as secure as using AES for encryption: a secret
key produces a ciphertext that an attacker can read, without being able to
decrypt it.

------
khaki54
Hash without salt, you're at fault; use a nonce, you're not a dunce.

------
mihai_maruseac
Differential Privacy might help.

