
40% of NYC Taxi Trips Are Uniquely Identified by Census Tracts and Hour - lil_tee
https://gist.github.com/toddwschneider/c79ea3272a631ee2fddf
======
danso
Very smart inquiry...I had been somewhat of skeptic initially that the
potential danger to privacy outweighed the value of making the data as
transparent as possible, but that's just a guess out of my preconceived
notions of taxi use, which were already inadequate as it still blows my mind
how many taxi trips there are on an average day.

I think an argument can still be made that even if the OP is right about the
quantity that can be uniquely identified -- keeping the coordinate data still
outweighs the real-life privacy risk, that is, the small number of people who
want to hire a private investigator/specialist to analyze this data to catch a
specific person would find it much faster to track the person the way that
PI's normally do so. But the rebuttal can't simply be, "uniquely identifiable
trips are probably so rare as to be inconsequential"

------
rahimnathwani
"it turns out that if you know the census tracts for pickups and drop offs,
plus pickup times truncated to the nearest hour, then you can uniquely
identify 40% of NYC taxi trips"

Hmmm... but if you already have those pieces of information (start tract, end
tract, start hour) what would you want to get from the data? How much someone
paid? How much they tipped? Whether they paid with cash or card?

Can anyone see an obvious nefarious use for this data?

~~~
antimagic
So, you think your partner is cheating on you with their ex. You know they got
a taxi from your apartment last Sunday at 6pm, you look up the data, and sure
enough, the taxi went to the ex's address, and not home as your partner
claimed.

Or, you arrive at work late, claiming that you stopped off at a client's
before heading in to work, but your employer can now verify that you actually
got the taxi from your home address.

I'm assuming in these examples that you just need to know pick-up OR drop-off
- if you need both, then I agree with you that it's not much of a concern.

~~~
rahimnathwani
"if you need both, then I agree with you that it's not much of a concern"

You need both pick-up and drop-off to get a unique row.

Now, I think about your examples, though, perhaps the _absence_ of a
particular row could show you _didn 't_ do something.

------
dopamean
I find this kind of analysis to be really awesome and I'd love to learn how to
do even a more basic version of it. Does anyone have some resources they can
point me to?

I'm actually working on a small project that has a much, much smaller dataset
than the NYC Taxi data but some similar attributes (geographic coordinates
mainly). I'd love to produce something like this with what I find (assuming I
can find anything interesting).

------
srean
I wonder how much one loses if the first and the last couple of miles are
fuzzed over. Such data would still be quite useful.

------
dbpokorny
> uniquely identified by birthday, gender, and ZIP code

This is not correct; you need to say "full birthday" which includes the year,
otherwise the statement is nonsense.

~~~
nitrogen
I'd still bet that month/day, gender, and zip code are enough to uniquely
identify a very large number of people.

~~~
0xcde4c3db
I don't see how that would work unless you're talking about ZIP+4 and not just
a regular ZIP code. It would work for very sparsely-populated areas and
extremely small towns, but most residents will share a ZIP code with thousands
of others.

