

FOILing NYC’s Taxi Trip Data - danso
http://chriswhong.com/open-data/foil_nyc_taxi/

======
ajju
The post is from March 18th. Did Chris end up hosting the data somewhere?

Edit: Talked to Chris on Facebook. He is in the process of uploading it to
someone who has volunteered to host and I have offered to host on Summon.com

If you can host/mirror this data please reply here.

~~~
ajju
Here are the torrents:

[http://chriswhong.com/wp-
content/uploads/2014/06/nycTaxiTrip...](http://chriswhong.com/wp-
content/uploads/2014/06/nycTaxiTripData2013.torrent)

[http://chriswhong.com/wp-
content/uploads/2014/06/nycTaxiFare...](http://chriswhong.com/wp-
content/uploads/2014/06/nycTaxiFareData2013.torrent)

~~~
aw3c2
Thank you!

------
kourt
_You’re not only required to provide a large enough hard drive, but it must be
“brand new, still in the box and unopened”, presumably for security reasons.
This requirement is a bit silly in my opinion, and probably prevents a lot of
would-be FOILers from getting this data,_

That's a completely sensible stipulation. It's imperfect like all security,
but it's a legitimate reason rather than just a barrier.

~~~
MichaelApproved
If it's for security, it would be theater. How hard would it be for someone to
open the package, put their spyware on it and repackage the drive? Even easier
are OEM drives that come in unsealed boxes.

I'm guessing it's more to make sure the drive is in working order and clean of
any data. This would limit liability for accidentally deleted data and broken
drives. If the drive doesn't work, it's still under warranty. It also saves
time on trying to make an old drive work.

A new drive equals zero hassle for them.

~~~
foobarian
It also stops Joe Shmoe from coming in with the same drive he's been using for
a few years already but has no clue is filled with all sorts of malware and
viruses.

------
samirmenon
Wow, I'm impressed with the responsiveness. It could be better, but honestly,
if the other municipalities in New York state were up to at least this
standard, I'd be very happy. In my experience, however, FOIL requests are
often delayed, "forgotten", or ridiculously stored (on reams of paper, in
ancient data formats, etc).

~~~
judk
Processing issues are usually only a problem if the information implicates the
government in criminal activity.

------
yaur
Am I the only one that sees some significant privacy issues with exact
pickup/drop off times and location being released? It seems like singling out
a single passenger's data (e.g. to/from home) would not be that difficult.

~~~
danso
Not many people take taxis from home to work every day, or even weekly. If you
are the type, then likely, you are someone who lives in Manhattan (getting to
work via cab in a borough is a tenuous situation) and in a dense enough area
where you are one of dozens/hundreds of people who could conceivably be
dropped off at your home spot (think of the density of high-rises).

~~~
yaur
The question is more inspired by someone I know who lives in Manhattan who has
a psycho ex. This data would answer the question (if he were tech savvy enough
to mine it) "Where does her new boyfriend live?" which is rather frightening
IMO.

~~~
danso
OK...but how would this psycho-ex track the new boyfriend down?

Presumably, the ex knows where the girlfriend lives...and I guess, he also
knows what the new boyfriend looks like? So he watches the apartment until the
BF leaves by taxi. The ex then notes the taxi's time of pickup. And then...

The ex waits a full month before calling up the TLC, buying a new hard drive,
transferring a couple of GB, and then doing the data analysis to find that
particular taxi that made a pickup within the vicinity of the girlfriend's
apartment, and finding where that taxi made a dropoff?

And then the ex goes to those coordinates and...then what? Barges into one of
high-rises and knock on every door until he finds the new boyfriend?

I think that if the psycho-ex were to act like a psycho, he probably will not
do it through this kind of data analysis.

~~~
yincrash
Not just a full month. Up to six months it seems. The response from TLC made
it sound like new data is only available twice a year.

------
jevinskie
Very cool article! I'm torn on the "bring your own hard drive" issue. In one
way it is very anachronistic given today's cloud technology but the flip side
is that the OP was dealt zero procedural roadblocks along the way. Nobody at
the city said "No." and they seemed helpful at every step. I'd tally that as a
Win in today's bureaucratic and overly secretive world.

I find myself wanting to make a FOI request to my city. I have seen tricked
out Parking Enforcement cars trolling the streets this year. They have license
plate reading cameras mounted along the car's perimeter. I want to know if
that information is stored, for how long, and who has access to it. Have any
law enforcement agencies queried the database?

I would appreciate all the pointers I can get for proceeding with a FOI
request. So far, I have been using MuckRock as my primary source of tutorial.

~~~
Maxious
> who has access to it You can get the ANPR database under FOI
> [https://github.com/johnschrom/Minneapolis-ALPR-
> Data](https://github.com/johnschrom/Minneapolis-ALPR-Data)

------
leorocky
Is he allowed to post the data online? If so maybe we just need to
collectively do the FOIL requests and upload the data to a community managed
site where it can be made available to anyone.

~~~
iancarroll
Isn't that muckrock?

~~~
leorocky
Google doesn't know what mudrock is. Do you have a URL?

~~~
spleeyah
[https://www.muckrock.com/](https://www.muckrock.com/)

~~~
leorocky
Did you edit a typo or was I just blind? I searched for mudrock, not muckrock.
Thanks. :D

~~~
iancarroll
Yeah, I edited it. Sorry! :)

------
yincrash
Medallion and hack license appear to be one way unsalted hash

~~~
mentat
That will make things interesting. I think people elsewhere in the thread
really aren't grasping the privacy implications.

------
brianjolney
Can we get this put on S3? Would love to play around with it.

~~~
tejaswiy
I think a torrent would be better. There will probably be lots of interest in
this, so S3 can get expensive.

~~~
IanCal
Useful tip, if you add "?torrent" to the end of the url for something stored
in S3, you get a torrent. I think this only works for files < 5GB though.

~~~
mikeash
Any idea of the reason for the size limit? That's a pretty weird limit to put
on a torrent feature.

~~~
mentat
Reducing the incentive to host uncompressed DVDs / BluRay?

------
PanMan
Anybody got cool idea's how to visualize such a dataset? I have a similar set
of data I collected, but haven't gotten beyond the "trips per day, length,
etc" basics. I feel there is something beyond the most basic visualisation, on
a more meta level, but am not sure what.

------
threeseed
Just a thought. You could do some interesting reporting on this data. What
about finding all trips to a particular address e.g. a politician's house ? Or
finding all taxis exiting a known crime sense.

~~~
josephlord
Abortion clinics, drug clinics, psychiatrists, cancer specialists...

Now Manhattan may be dense enough it might not leak too much personal
information but the same granularity of location data in a suburban or rural
area may be very intrusive.

------
danso
Call me gobsmacked.

Since I've lived in NY, I've seen plenty of cool visualizations and
stories...about where pickups happen, time of day, volume, etc., and I've
periodically asked around, where does the data come from? Obviously I didn't
ask well enough...because if all it took was an old-fashioned public request
(and a brand new hard drive)...wow.

The trip data is interesting enough...but the _fare data_ is really mind
blowing. Everytime I get out of a cab, I wonder, "should I have tipped that
much?" The (crowd-based) answer is apparently not that hard to find...

~~~
mikegreco
There are 2 possibilities:

1\. People aren't tipping 2\. Cab drivers aren't reporting their tips.

Being a native new yorker, legality aside, I'd bet most _cash_ tips go
unreported. Reporting a tip makes that tip taxable, so there is a very strong
incentive to bury it if they think they can avoid trouble or suspicion.

~~~
yincrash
Then you can filter the data to show only CC transactions. There might be some
additional variance of CC vs cash tipping, but I think the overall trend will
still be there in just the CC transactions.

------
Xorlev
I'm curious how large the data is after gzipping it.

