But I get dupes. From the 2013 January dataset, medallion "DD336C4ADA65CCBD2284F63BF348A4F0" makes 3 pickups at 7:09AM on Jan 13, 2013. The dropoff time is noted as that exact same second of the pickup. Yet the pickup coordinates, and the dropoff coordinates, are all different. Though there are all kinds wonky real-life situations...a driver starting the fare and shutting it off during a trip, accidentally, perhaps...this seems to indicate an error in how the data is recorded...because the machine, even if doing something that breaks business rules, the timestamp should still be consistent with real time, right? Anyway, it's something to be wary of when doing analysis.
And it also looks like the hack licenses are all different...which I'm not even sure how that works, but it doesn't seem to be a case simply of 3 different drivers being registered to the same medallion...
The three data rows in question:
DD336C4ADA65CCBD2284F63BF348A4F0 689153C038F51E19A47257E15DF838BD VTS 1 2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.000000 -73.937492 40.758327 -73.937492 40.758327
DD336C4ADA65CCBD2284F63BF348A4F0 0200F6738B90178B8B75EEF9E3C1988E VTS 1 2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.050000 -73.973656 40.738422 -73.973656 40.738422
DD336C4ADA65CCBD2284F63BF348A4F0 B8C7594C95BB86A228C785305287583B VTS 1 2013-01-13 07:09:00 2013-01-13 07:09:00 1 0 0.900000 -73.979225 40.747093 -73.929260 40.850357
Then there are some pickups clustered around weird areas of NJ and CT (NYC cabs are not supposed to pick up passengers outside NY). Lots of weirdness to sink your teeth into.
But it's a really fun dataset. You can see hurricane Sandy, big city events, construction, etc.
But if you wanted to do it, it's pretty straightforward.
In the meantime I haven't found it to be too hard to just INTEGER(field) and FLOAT(field) wherever I need to.
22 % rides are less than 1 mile long
85 % rides are with 1 passenger
and more at : http://www.twikstik.com/blog/2
passenger pick up for 4th july : http://akuchlous.github.io/NYC_CAB_ANALYTICS/July/4/
passenger pick code : https://github.com/akuchlous/NYC_CAB_ANALYTICS