Hacker News new | comments | show | ask | jobs | submit login
NYC Taxi Trips Data from 2013 (andresmh.com)
72 points by Anon84 on June 18, 2014 | hide | past | web | favorite | 14 comments

The data fields kind of confuse me...I assumed that there would be one medallion per operating cab, right? And I assume that one cab, i.e. medallion, can make one pickup in a given second...therefore, pickup_datetime and medallion should result in a unique key, right?

But I get dupes. From the 2013 January dataset, medallion "DD336C4ADA65CCBD2284F63BF348A4F0" makes 3 pickups at 7:09AM on Jan 13, 2013. The dropoff time is noted as that exact same second of the pickup. Yet the pickup coordinates, and the dropoff coordinates, are all different. Though there are all kinds wonky real-life situations...a driver starting the fare and shutting it off during a trip, accidentally, perhaps...this seems to indicate an error in how the data is recorded...because the machine, even if doing something that breaks business rules, the timestamp should still be consistent with real time, right? Anyway, it's something to be wary of when doing analysis.

And it also looks like the hack licenses are all different...which I'm not even sure how that works, but it doesn't seem to be a case simply of 3 different drivers being registered to the same medallion...

The three data rows in question:

    DD336C4ADA65CCBD2284F63BF348A4F0  689153C038F51E19A47257E15DF838BD  VTS 1   2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.000000  -73.937492  40.758327 -73.937492  40.758327
    DD336C4ADA65CCBD2284F63BF348A4F0  0200F6738B90178B8B75EEF9E3C1988E  VTS 1   2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.050000  -73.973656  40.738422 -73.973656  40.738422
    DD336C4ADA65CCBD2284F63BF348A4F0  B8C7594C95BB86A228C785305287583B  VTS 1   2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.900000  -73.979225  40.747093 -73.929260  40.850357

Heh, this is far from the worst problems in this dataset. Wait until you find the clusters of lat-long points coordinates that are off by 1 degree in either direction (or 0.1 degrees)

Then there are some pickups clustered around weird areas of NJ and CT (NYC cabs are not supposed to pick up passengers outside NY). Lots of weirdness to sink your teeth into.

But it's a really fun dataset. You can see hurricane Sandy, big city events, construction, etc.

I think a medallion represents a group of taxis cabs. Each cab has its own hack license.

One medallion per cab. One license per cab driver.

Just to repeat my comment from a previous post, The hack license and medallion fields appear to be unsalted one way hashes.

Will be interesting to see the comparison to 2014, when Boro Taxis[0] were introduced to serve the outer boroughs.

[0] http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.sht...

If you clicked through to Chris Whong's site and wondered about the phrase "hack license" this wikipedia article may be of interest: http://en.wikipedia.org/wiki/Hackney_carriage

I uploaded these datasets to Google BigQuery in case anybody prefers to use that to go data-diving: https://bigquery.cloud.google.com/table/833682135931:nyctaxi... https://bigquery.cloud.google.com/table/833682135931:nyctaxi...

This is great. Is there anyway you could use data types other than string? The way it's currently setup makes it hard to analyze.

Sure, it wouldn't be hard to re-load the data in a new dataset/table with more accurate types. Strings are just the default. I just wasn't sure how clean the data was and didn't want to debug bad data before loading. I was lazy basically.

But if you wanted to do it, it's pretty straightforward.

In the meantime I haven't found it to be too hard to just INTEGER(field) and FLOAT(field) wherever I need to.

few interesting facts : from july data

22 % rides are less than 1 mile long

85 % rides are with 1 passenger

and more at : http://www.twikstik.com/blog/2

was playing around with google maps and data : you can see the heatmap for taxi pick up by the day for first 6 days of july. Also could be useful as where uber drivers can find passengers based on last year data

passenger pick up for 4th july : http://akuchlous.github.io/NYC_CAB_ANALYTICS/July/4/

passenger pick code : https://github.com/akuchlous/NYC_CAB_ANALYTICS

checksums please :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact