Hacker News new | past | comments | ask | show | jobs | submit login
Lessons from NYC’s improperly anonymized taxi logs (medium.com/vijayp)
160 points by vijayp on June 21, 2014 | hide | past | favorite | 56 comments

Whoa, whoa.... You're telling me there's a public data set of all taxi trip geolocation data with GPS precision? That's f'ing insane!

I think there's a MUCH bigger privacy issue here than what the author focuses on.

Couldn't you deduce many passenger identities based on addresses? There's a lot of scenarios where passenger identities could be effectively de-anonymized, just based on GPS data. You could then use this data set to analyze their comings and goings.

1. For people who live alone in a single family home, you can pretty much completely track when and where they went by taxi. From this you can deduce a lot about their interests, lifestyle, workplace and schedule, private life, etc. It's profoundly invasive.

2. Even if there's a few people sharing an address, the other dropoff/pickup point can be used to narrow down the likelihood of who it is, especially when combined with other easily obtainable data.

For example if you knew an employee (e.g. that cute barista) lived in a certain neighborhood you could track their trips to/from work and deduce their home address.

Or if you knew there was only one senior citizen (or Muslim, etc.) living in a building, a regular trip to a senior center (or mosque) would reveal when their apartment is vacant.

Or if there's only one young man in a building, a single trip home from a gay bar could out them.

Holy shit.. can you imagine someone just plotting all the trips from a single gay bar? Listing off all the connected residential addresses? And not only that, any subsequent trips home from those addresses the next morning? Taking the walk of shame to a whole new level!

Likewise trips could be used to deduce affairs and other deceptions by fellow residents. "You said you were working late, but the only taxi trip to our building that night was from a bar."

This is just off the top of my head.. I feel I could go on for hours listing all the possible ways this data set could be exploited.

How is this not front page New York Times???

> Holy shit.. can you imagine someone just plotting all the trips from a single gay bar? Listing off all the connected residential addresses? And not only that, any subsequent trips home from those addresses the next morning? Taking the walk of shame to a whole new level!

There are some weird assumptions going on here, in addition to the fact that you grossly overestimate the precision to which GPS data can de-anonymize individuals who are using a shared, public transport mechanism in a city as densely crowded as NYC. The density of people and businesses alone makes individual identification difficult, not to mention weak GPS signals and low accuracy with skyscrapers every hundred yards. Is there any evidence that the logs have enough accuracy to do what you're claiming, or are you just wildly speculating?

What is the weird assumption? That GPS is precise enough to identify addresses? Not all of NYC has skyscrapers every hundred yards.. that's only portions of Manhattan. It's 5 boroughs.

Oh my god. You really think there are poor people in Brooklyn or the Bronx who, nevermind being able to afford a place in which they are the only people within 50m of 10 other families, can regularly go into the city at a given time in a taxi? The commute traffic into Manhattan is hellish -- who in their right mind would use it to get to work? A taxi is not a replacement for a car, it is a luxury, and it is expensive.

Another poor assumption: that people get dropped off right in front of their homes. Uh, no. I take taxis several times a month to get home. I'm never, never able to get dropped off right in front of my apartment because traffic is dense enough and the one-way streets add another 0.50 to $1 when I can just hop off and take a pleasant two minute walk up the corner.

edit: And another poor assumption: that there is one easily locatable gay bar within a vicinity of where people hail a taxi (again, it won't always, or even mostly, be in front of the exact place they just walked out of) or that, after leaving a bar, people go straight to their own home.

Plenty of people take taxis between Manhattan and Brooklyn. A significant percentage of rides home to Brooklyn from Manhattan bars at night are in cabs, for example. The fact that you think everyone in Brooklyn is "poor" or that only rich people take cabs shows how inaccurate your knowledge of NYC is.

I've tried to address your other concerns in other comments. Just because you don't get dropped off at your place doesn't mean no-one else in NYC does ever.

I think the way cabs actually operate in NYC makes this practically impossible unless you already have some details such as the lat/Lon of dropoff and pickup and time of the stops.

I'm assuming the data is for yellow cabs and the new lime green "boro cabs" you hail on the street, not "car service" cars where you schedule a pickup and dropoff to specific addresses.

Most bars in Manhattan are storefronts in 3-4 story residential buildings. There are apartments above and they are surrounded by other buildings with apartments and businesses. I don't think you could identify a bar. Now strip clubs on the other hand are required by law to be tucked away in isolated locations. Might be possible to identify a strip club.

When you hail a cab, and many times when you get dropped off, it happens on a corner, perhaps over a block away, where it is easier to find a free cab.

Most cabs are in Manhattan, not a lot of single family homes. Single family homes in the outer boroughs will have almost no yellow cab coverage for pickup and finding a cab that will take you out of Manhattan can be dicey, although I guess those lime green cabs are meant to address that. SI, the Bronx and huge swaths of Brooklyn and Queens pickups will be almost non existant, people going from the outer boroughs will most likely use a car service.

I will certainly be checking to see if I can identify any of my rides.

This. If you've lived in manhattan for more than a month, you'd know that pickup and dropoff locations are not precise, specifically:

1) you never get a cab on quiet single-family condo streets - gotta get to corner of an avenue

2) cabbies often click meter to off half a block before you actually say "stop right here please, between the drunken couple and the pile of garbage on the left side". They do this so you pay and get out quicker, clearing way for another passenger.

3) There are a LOT of "skyscrapers" in manhattan, with 300+ apts in each

What WOULD be interesting is taking credit card logs of someone's cab payments and cross-matching dropoff based on charge timestamp :)

Most of the comments here about pickup/dropoff accuracy and large buildings suffer from the same logical flaw: "often" is not the same as "never".

With comprehensive data set of literally 173 million trips, even if we limit ourselves to precise locations in front of small buildings and residences -- let's say it's a paltry 5% -- that's still 8 million trips.

That's more than enough to invade the privacy of a very large number of people.

And that's just the low hanging fruit. With geolocation data you don't always need precise location accuracy or small buildings to see identifying patterns. Don't forget that time is also a very useful factor, and often precise to the minute. E.g. trips departing after 1am within a half-block radius of the only bar in that radius are more likely than not to be patrons. And trips arriving at an apartment building at at particular time may be relatively rare, making it easy to look up the single trip that matches it.

Thus, a neighbor or roommate who saw someone arrive and noted the time (or had a security camera) might be able to deduce the bar that they visited, address or block of the person they're dating, whether they were actually where they said they were... That's one of a zillion scenarios. Precise address-to-address trips are just the low hanging fruit.

> strip clubs on the other hand are required by law to be tucked away in isolated locations

Really? I used to live across the street from one in Manhattan. It wasn't an illicit club, and I didn't live in some sort of squat house. Dancers and patrons going there would be indistinguishable from people going to my building, apart from the address being off by one.

Behold, a "Gentlemen's Cabaret" club and a porn shop between a falafel place, camera store, burger joint, and residential flats: https://maps.google.com/maps?ll=40.758115,-73.989143&spn=0.0...

> unless you already have some details such as the lat/Lon of dropoff and pickup and time of the stops.

These are literally columns in the data set. To quote the original post:

Each file has about 14 million rows, and each row contains medallion, hack license, vendor id, rate code, store and forward flag, pickup date/time dropoff date/time, passenger count, trip time in seconds, trip distance, and latitude/longitude coordinates for the pickup and dropoff locations. [1]

You're right that a lot of cab pickups/dropoffs happen a few doors down from the actual location, and that there aren't a lot of single family homes in Manhattan. But that doesn't negate what I'm saying. Even if only 20% of rides involve the actual location, that's still an awful lot of potential privacy violations. And even if there are zero single family homes involved, that was only the first scenario of numerous ones I mentioned.

[1] http://chriswhong.com/open-data/foil_nyc_taxi/

What I meant was if you already know the time and location, say you know what time the barista left work and the lat/Lon of the coffee shop. It would just be a matching it up with the data in the table to find the drop off.

What I don't think is practical is identifying everyone who may or may not have left a particular bar.

> What I don't think is practical is identifying everyone who may or may not have left a particular bar.

This is weaker than your original claim which was that deducing passenger identities is "practically impossible". You've now conceded that the barista scenario is plausible and left open several others I mentioned.

But let's examine this one, just bars. The basis of your criticism is that some bars are located in residential buildings. First off this still leaves quite a number of bars that aren't. But even for those that are, the time of day and direction of travel is a pretty fair indicator of people who are bar patrons vs. residents. I.e. trips departing the building after 1am and arriving at a residential location are probably a lot more likely to be bar patrons than residents.

And don't forget that this public data set is also potentially privacy-violating when combined with other data about the destination, such as information that other residents of that location may know. So even if the general public couldn't determine much from a trip from a gay bar to a home residence one night, a live-in parent could.

Do you live in New York?

1. New York taxis are hailed from the street. AFAIK, you can't call a medallion taxi to your home for pickup. So that makes impractical the idea of a single-family home (which is pretty damn rare in New York) being narrowed as the sole user of a particular taxi.

2. Um, no, again. First of all, are you fucking kidding me? A barista, taking a taxi on a daily, or even weekly basis, in a city with U.S. most comprehensive subway and bus system? A barista.

But I'm not being nit-picky here, taking taxis on a regular basis is within the means of the rich, only. And how many rich people do you think live in non-dense areas (i.e. areas in which 100-500 people could be within a certain lat/long)? And how many people in that income bracket would not take a private car? Do you really imagine there to be a significant number of New Yorkers who take taxis at a regular time, to a regular place, from a residence in which there aren't dozens, if not hundreds of people, within a 50m radius?

Again, you realize gay bars are in densely populated areas, and a taxi right in front of a gay bar could be determined to come from a large number of bars, nevermind that it is not always the case that you call for a taxi in front of the place you just stepped out. Sometimes you call it from a cross-street, or up-street to better your chances.

And of course there's the detail of the delayed release time of these records.

It's as if you took all the stock privacy-violation concerns with surveillance and applied them to a situation in which the real-world details don't make any sense.

It would be interesting to see how many GPS pairs map to a single address not at an intersection. I would guess most people don't go directly to their address but to the nearest cross streets since its cheaper and faster.

> How is this not front page New York Times???

Because it's unproven.

If no one beats me to it, I'll grab a dump of the data and look myself.

But in the abstract, I agree; there's a non-zero risk of being able to identify passengers from the logs. This makes for an interesting ethical problem: should this data have been made available? It becomes much, much less useful without those coordinates.

Which is more important? Government transparency or citizen privacy?

I don't even get how this is government transparency. That's like saying everyone's cell phone records ("just the metadata") should be published, minus names, because telecom is regulated.

Government transparency is largely about opening the data they have available to them up to the public in a consumable way. FOIL requests like the one Whong filed are essentially the second-oldest way they have done this. (The oldest being, well, just asking directly.)

So, yes. It can definitely be argued that cell phone records, insofar as they are shared with the government in the first place, should be published. A simple counterargument would be that cellphones do not consume a public resource, unlike the way taxis consume road space, so there is no reason to share said data with the government and consequently the public.

But as a rule of thumb, government transparency is that any data the government has ultimately belongs to the public. It's not a matter of regulation.

>A simple counterargument would be that cellphones do not consume a public resource, unlike the way taxis consume road space, so there is no reason to share said data with the government and consequently the public.

That argument would be incorrect, since all cell phones run over a public resource: the airwaves. These airwaves are licensed by the FCC licenses to private companies, but (in theory at least) the public retains ownership of those airwaves.

Can Uber be required to make a comparable disclosure?

This is transport network packet "metadata".

No. Corporations are sacrosanct because private property is holy ground upon which the public may not tread unless invited.

That said, Uber certainly has the data. I'm sure we'll have another weev to expose it for us.

Pretty sure cab companies in NYC are also private.

Yes, but this information didn't come from cab companies. It came from the NYC Taxi & Limo Commission: http://www.nyc.gov/html/tlc/html/home/home.shtml

And they got it without treading on the cab companies' property, how?

I don't know. Ask them.

Oh, but I can tell you: the Commission pushed it, despite protests and even lawsuits from the private cab owners (see Taxi & Limousine Commission v. Hassan El-Nahal).

I was just pointing out that this very story disproves your claim that "private property is holy ground upon which the public may not tread unless invited", since that's exactly what happened here, and so I don't see why couldn't Uber also be "convinced" to install the trackers.

Ah, in that case, you're right. We should absolutely require Uber to divulge similar information.

Anonymization projects should really invest in an hour of consulting time with a cryptographer-- they would be able to see these flaws instantly.

Nit: this is a lookup table, not a rainbow table. Rainbow tables involve a clever optimization that compresses multiple passwords (in a chain) into a single entry in the table, saving a great amount of disk space.

Thanks for your nit. I always wrongly associated rainbow tables with the Rainbow Codes the British used to name their military projects. http://en.m.wikipedia.org/wiki/List_of_Rainbow_Codes http://en.m.wikipedia.org/wiki/Rainbow_table

These are old lessons: in 2006 AOL [1][2] and Netflix [1][3] both released datasets that were supposed to be anonymized but were easily de-anonymized. There are older examples based on Census data[4]. It's difficult if not impossible to release a dataset that is both useful and truly anonymized; in Schneier's words:

The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan's and Shmatikov's de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.

[1] https://www.schneier.com/blog/archives/2007/12/anonymity_and...

[2] http://www.securityfocus.com/brief/286

[3] http://www.securityfocus.com/news/11497

[4] http://crypto.stanford.edu/~pgolle/papers/census.pdf

Stop rainbow attacks peeps, salt your hashes.


That applies for passwords (where you hope the data is fairly random, and unknown) but there are only 13,237 taxis in new york, and you can download the list! You'd simply try each one. The author only took hours to crack the list because he generated hashes for all possible medallion numbers, without using the list.

Also, even these numbers only apply to queries where you want to discover all of the drivers for all of the data. It seems more likely to me that someone would want to know who was driving a particular taxi at a particular time, or what a particular driver was doing on a range of days. In both of these cases, the number of records you need to deal with is massively reduced, and the second attack implies you know the plaintext.

So no, salting doesn't help against abuses of this data when hashing is so fast, and even using a slow KDF won't help much against the second attack.

You're assuming that the salt would be public. In practice there is no reason for it to ever be. As long as it stays private, it would be impossible to reverse HASH(salt || taxi_id) back to the taxi id.

The author's approach doesn't actually make use of a rainbow table; he just generates a flat hash -> plaintext map. A rainbow table is a specific way of representing such a map that is more compact but takes more computation time to "access" (and also may be incomplete).

Right, same attack with a less sophisticated data structure.

Suppose they had generated a random unique ID for each driver and used that instead of a hash throughout. If you had a record of a single ride you made with a taxi driver, you could still find that ride in the database (start location, time, stop location, time). Then you can take your driver's ID and track all other trips that driver has made. Is that truly anonymous?

That's a very interesting question. In your example, finding the identity of one or two drivers might be possible easily, but finding the identity of many drivers would still be very difficult. I guess whether it's anonymous or not depends upon how strictly you think anonymity is defined.

Really interesting. It also depends what other datasets are available

I'm sure Uber is busily crunching all of this data and will use it to figure out how to efficiently destroy the remaining taxi cab drivers.

Would you consider anonymizing the data properly and re-publishing as canonical torrent for future analysis?

Is there any real point now though? The raw data is publicity available.

So anyone who wants to remove the anonymous fields and get the underlying driver is free to do so.

Meanwhile anyone who is not interested in the anonymous fields can just leave it alone?

Information diffusion is a function of time. Some people have the data now. Many more will use the data over time. Most of those new data users will simply click on a link in a blog or google or HN. The data may also be stored in a canonical open-data location. Each of those instances can have anonymized data.

What's the difference between distributing open-source with a known vulnerability and distributing open-data that knowingly violates the privacy of many people? If this was source code, there would be "responsible disclosure" that allowed the software author time to issue a new release of software. One could similarly work with NYC citygov digital team to anonymize the data properly and have them reissue an official dump, possibly with additional data from 2014. That would provide some incentive for developers to use the newer data.

Yes, malicious analysts can find the old data. But that is no reason for non-malicious analysts to keep replicating data that violates privacy. If this were data where the loss of privacy had significant financial or legal consequences, then naive data distributors and analysts would be inadvertently contributing to those consequences.

One should try to do the right thing, even if it seems technically pointless. In this case, working with the people who shared the data to fix the mistake. Otherwise, one could imagine future citygov publication requiring much more slow and expensive review of data to be released, e.g by lawyers who still won't find the next technical mistake. It's in the interest of all parties to make this particular instance right, to ensure future openness of privacy-protecting data.

> What's the difference between distributing open-source with a known vulnerability and distributing open-data that knowingly violates the privacy of many people?

The difference is that software is something people choose to use and update while data is something other people have. The only use of a properly amonymised version of this dataset would be for whitehats who would not do malicious things with the current version either.

Yeah, this is a really good point. I'm going to try to reach out to someone in the government on Monday. I don't really have many contacts over there, so if anyone has suggestions on how to navigate the bureaucracy, I'm all ears.

Might be worth trying the email address on the page of NYC Digital:

digital@cityhall.nyc.gov http://www.nyc.gov/html/digital/html/about/contact.shtml

I'd recommend talking to Chris Whong and seeing if he has any advice, actually.

I'm not sure exactly what good it would do since the data are already out there. I guess I could re-anonymize the ids and see if I could get the links replaced.

At least they tried to anonymize the data. Someone in my hometown recently filed a FOIA request for information about schoolteachers' pension plans and the district gave him a straight dump of the database which included the Social Security Number of every teacher in the district.

There's some interesting analysis going on over at:


Can someone explain to me what the real privacy concern is? The way I see it, the drivers are on a job. To me it seems the same as mapping out the route which a bus driver took. It's not like the passenger's information is being made public.

Cabs are not always taking a predfined route like a bus driver. They are picking people up at home/work which makes it relatively easy to extract information about their personal life.

> creating a secret AES key, and encrypting each value individually

This doesn't sound like a good choice. It's security through obscurity.

It should be just as secure: not revealing the AES key you use to encrypt data is about the same as not revealing the seed of the random number generator you use to randomly generate ids.

(In my unqualified opinion,) I don't think this qualifies as security though obscurity. It seems to be just as secure as using AES for encryption: a secret key produces a ciphertext that an attacker can read, without being able to decrypt it.

Hash without salt, you're at fault; use a nonce, you're not a dunce.

Differential Privacy might help.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact