Hacker News new | past | comments | ask | show | jobs | submit login
Passenger Privacy in the NYC Taxicab Dataset (neustar.biz)
132 points by rgrzywinski on Sept 16, 2014 | hide | past | web | favorite | 24 comments

Amazing. If you said to someone "Hey, I wanted to know where you went after the cab picked you up last year, so I called up the cab company and asked them where they dropped you off and they told me", they would be outraged at (your behavior and) the breach of privacy shown by the cab company. But the city released a dataset that allows exactly this query. What were they thinking?

Something else that could be mentioned in the linked article: if someone you were with got in a cab in 2013, and they told you where they were going, and you remember the approximate time and location, you can tell whether it was their true destination regardless of how many other people were being picked up at the time, because you don't have to find the exact ride they took, you only have to see whether any rides went to the place they told you.

This search is even extremely resistant to the differential privacy suggested by the post's authors. I'd be much happier simply stating that location data is not de-identifiable, and no-one should use a cab in a city that logs location data if they aren't happy with an adversary knowing where they went.

What I wondered about that data set is, if two people living/working at two locations, consistently take taxis to meet at various other locations at the same times, could that pattern be identified in the data?

That is, are there locations A and B such that there are matching trips to locations M1,M2,... at times T1,T2,... i.e. (A,M1,T1),(B,M1,T1),(A,M2,T2),(B,M2,T2) and perhaps reverse trips (M1,A,T1+x) etc?

Further classification of M* -- hotels, for example -- could classify the nature of the meetings. You might be able to identify the addresses of people having affairs, or other deliberately secret rendezvous.

This would be relatively difficult for Manhattan, some parts of the outer boroughs though are a different matter.

I was concerned when this first hit HN because I have a friend that lives in a fairly sparsely part of town and his (now ex) girl friend has a possessive ex-husband that doesn't like her seeing other men. He isn't going to be able to make sense of the data himself, but if someone weaponizes it the way you are talking about it could be a real problem for people with stalkers/psycho-exes.

I wonder if you could build a small, money generating business from answering exactly that. If you could get more recent data dumps at some interval, you could even provide an email alert system.

Interesting... wondering what other interesting questions the data can answer

Interesting read. The frequent visitors to gentlemen's clubs are probably dancers rather than patrons.

That's almost certainly true. For any longtime NYC resident who knows things[1] it's obvious from that map that the most prominent visible destination from the Hustler Club data is Sin City in the South Bronx.

[1] Specifically things like where notable strip clubs are

If true, that makes this data far more dangerous.

Dancers work on Wall Street?

downtown has some good values, you'd be surprised

I'd be very surprised if dancers were taking cabs to work

I just messaged a friend who used to work in the field, and she confirmed that a lot of her coworkers took cabs to work -- she clarified that there's often nowhere for the dancers to park at the venue (I imagine this would be particularly true in Manhattan) and taking the bus with your makeup on can be an unpleasant experience.

My good friend owns one of those places, his girls all either book a taxi or get picked up by their partners.

As an attractive girl, you do not want to be walking / taking public transport in certain areas of town at 6am, it's a sad reality.

Even more terrifying is that it is going to be trivial to determine the pickup location just by reversing the trip.

Please show this as an example to people that say "Why should I care? I have nothing to hide".

Why? Taking public transit late at night would probably be a very bad idea.

In fairness to the celebrities accused of being cheapskates, I thought it was the case that the trip record in the dataset didn't include a tip amount if it was paid in cash.

Number of trips by tip percentage where payment_type='CSH': http://i.imgur.com/EJh1B2d.png

~85M rides with 0% tip is "interesting".

I don't know how the data is obtained, but it's probably more likely that the driver is tipped but the actual tip amount is unreported than the actual tip amount being zero.

Yes, I believe that a cash tip is not included in the data. But compared to other TMZ news we figured it 'could' be interesting.

As a general fan of open data like this, I've been a little worried these analyses would lead to the data not being released in the future. Hopefully if they change anything in the future, it will still be useful and interesting.

This is an excellent (if troubling!) piece which deserves a wide audience. As we say in Sauf Effrica: bleddy lekker stuff!

Anyone know of an existing online front-end connected to Google Maps for this data?

It's pretty big (>5gb). It is here to download though: https://archive.org/details/nycTaxiTripData2013

Outstanding ground breaking article

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact