Hacker News new | comments | show | ask | jobs | submit login
NYC's Taxi and Limousine Commission Trip Record Data for 2014-2015 (nyc.gov)
68 points by bko 599 days ago | hide | past | web | 44 comments | favorite

I downloaded the first year dataset and didn't get very far in my analysis... but this was the first result, which was kind of pretty:


Love this. I took a few minutes and converted your image into an Illustrator/vector file. Not 100% true to your original, but pretty good. Infinitely scalable so people can print it if they want.

Preview in black and white: http://i.imgur.com/OhemZK2.jpg

Bonus! preview in Uber blue: http://i.imgur.com/gvFDtWN.jpg

Source AI files: https://www.dropbox.com/s/zw1oqdwhaqhqb6u/NYC%20Taxi%20Data....

This is quite beautiful! I was thinking of doing something similar when I saw the post but well... Could you give some info on your stack?

I would love to buy a wall print like this.

This really is beautiful and I agree, if you do not want to sell it, is there any chance you could share a higher res image?

The trouble with making a higher res image is that when you make that lat/lon buckets smaller, you have fewer samples in each one, so the image gets noisier. For the best possible image you'd want to download all the years of data.

The process of making it was quite simple. I zeroed a 2d array of integers, then took all the pickup/dropoff points and incremented the nearest cell. The pixel values are based on the logarithm of the counts, since otherwise everything outside midtown would be pretty much black.

There are some artifacts, like the thin vertical line down the east river. I think that was because of how the data was rounded, i.e. the number of unique longitude values that map to a certain image column.

I wrote this myself with a few hundred lines of C++, though I'm sure there's GIS software out there that will do all this for you with a few clicks.

Can you explain why there is a line going to the airport? Shouldn't the airport be more of an island of light (since I imagine most people aren't getting picked up / dropped off a half mile from the airport)?

Also did you overlay it onto a map? How did you get the angled effect if it's just a grid?

> Can you explain why there is a line going to the airport?

I assume it is because there is a fixed fare to/from JFK so drivers have little incentive to start/stop the meter at the exact pickup/dropoff location.

> Also did you overlay it onto a map?

No. If taxis did not pick up or drop off people on some street, that street does not appear. For example there is an area downtown where there are streets but they have had security barriers since 9/11 thus no taxis.

> How did you get the angled effect if it's just a grid?

None of NYCs grids are exactly north/south/east/west aligned.

I was thinking of overlaying it on a map but since it's a small cross section being referenced,

there is not much distortion (world appears flat) which makes it quite clever hack to plot on the grid!

If you are interested in what the 2013 data looks like at full resolution, I have a web map of it (https://www.mapbox.com/blog/vector-density/). Haven't updated with the new data yet.

Where would one start learning methods of data vis like this?

That is just each point plotted on a map. So, check out software that can make maps? QGIS is nice.

You don't even need dedicated mapping programs. It's just latitudes and longitudes, so any 2D plotting program would be sufficient (as long as the data can fit into memory!).

I made this map awhile ago for Instagram photos using R/ggplot2: http://i.imgur.com/IvGox1f.png

Although the country boundaries aren't strictly necessary. Since there appears to be a demand for generating these maps, I'll work on a tutorial for this NYC data set.

you didnt have to install anything but ggplot2 to build that. wanna share the source :)

You murdered Greenland .D

Apparently, they didn't plot it on the map but on a grid. The density of data gives the well defined contour; a clever hack!

It would look 99% the same if you just plotted the coordinates with some (smartly chosen) transparency

Beautiful! Love that LaGuardia and JFK glow.

This would be even more beautiful if animated.

very nice - how did you create it?


- 2013 data as FOILed by Chris Whong http://chriswhong.com/open-data/foil_nyc_taxi/

- 2008 to 2013 data as FOILed by me, on BigQuery https://bigquery.cloud.google.com/table/alien-climber-851:ny...

...note that after Whong's request, the TLC redacted the medallion numbers, making it virtually impossible to analyze trips by cabbie.

> 2008 to 2013 data as FOILed by me

Is that data set available for people who do not use Google Accounts as well? Maybe you could upload it to https://archive.org.

is google charging you for hosting that data in Big Query ?

I think so...? There was a free period and now I'm getting charged about $1.50 a month (though on my credit card, it's billed as Google AdWords...). However, I just checked the actual invoice and I don't see a line item for the taxi data, just for the 40GB of other data that I have online. The taxi data is about 90GB.

Hosting data on BigQuery isn't free. (only 1TB of processing is free per month)

ok so just so i understand (i am new to big query) - even though I am not hosting this dataset (and thus not being charged) - I can, for free, query 1TB worth of queries for free using big query ?


I work on BigQuery.

Whoever owns Storage, NYC TLC in this case, pays a minimal fee of 2 cents per GB per month for storage. This includes multiple factors of replication/durability.

Whoever is doing the querying - this can be you - pays 5 dollars per TB queried. First 1TB per month is free.

I'm not able to review the data yet but I wonder if TLC took in any of the feedback to better anonymize the data. [0]

Also, I would love to see an analysis of whether traffic is actually getting worse compared to that of last year. This claim was made by mayor de Blasio as a reason to cap Uber rides.

[0] http://research.neustar.biz/2014/09/15/riding-with-the-stars...

Also: https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a...

Downloading one of the CSVs to check it out. Each one is about 2GB.

EDIT: Per the BigQuery table schema, medallion is no longer a field.

Local governments should be demanding detailed data from companies like Uber in exchange for legalization, even raw data on locations of available cars. They have the leverage to get it now but they're wasting the chance. This data could eventually be used to avoid a true monopoly.


we already have enough hoops for companies to jump through to prevent competition so we certainly do not need any more. New York is a perfect example were regulation and so contorted the market you can make money selling your permission to run the business to the point it might be more profitable than running the business.

From medallions to food cart and restaurant permits, regulation is keeping competition out while rewarding those who merely sit on permits and rent their use. It is nearly an identical situation to how badly patents are managed and rewarded

One of the reasons that taxis are licensed is that they're supposed to pick up everyone and anyone. Uber and Lyft don't seem to have that problem with race (AFAIK) but there are two other situations where people have trouble:

(1) small children, where you need a car seat (or two)

(2) people with disabilities - service animal, wheelchair, etc.

Analyzing the data would hopefully show whether people had to wait for 2 hours.

Also, as a substitute for race, it might be possible to see if certain areas are under-served or not served at all. Perhaps drivers are avoiding picking up in Harlem.

Anecdata: As a wheelchair user and frequent traveler, Uber has never ever failed me, but hailing a cab was nearly impossible (was, because at this point even if uber charged twice what a cab does, I'd still use them every time for the certainty that they'll pick me up).

"One of the reasons that taxis are licensed is that they're supposed to pick up everyone and anyone."

Which is clearly violated thousands of times per day. Sure you can call 311 in NYC but very few people do.

The difference is, the city has the authority to crack down on it. I was just in NYC, and cabs are running a video in the back seat informing passengers that it's illegal for cabs not to pick you up for race/disability. The city doesn't legally have that kind of leverage over Uber.

Leverage that is rarely and reluctantly used, and that no one reasonably expects to be used, and that does not translate into any observable consequences in the lives of black people attempting to get a ride.

I've once heard a quip that "You have no constitutional right to eat at a restaurant, but you do have one to a speedy trial -- but which one feels more secure?"

The city has "leverage" to stamp out discrimination against black people wanting a cab ride, but "no leverage" to stamp out discrimination against black people on Uber -- yet which one will more reliably secure a ride?

I watched that video, too, right after the first cab drove off when he found out where I was going to.

I could report him, but I'm a busy adult with kids of my own to mind, and I don't have time to try to parent someone else's problem.

So we shouldn't try to improve it?

NYC did exactly this. The city threatened all rideshare companies with a cap on their driver growth rate, which Uber successfully "defeated" by giving up access to detailed data to the city.

Uber hailed this as a victory - but the way I see it, when your victory means "maintenance of the status quo while conceding your data to a third party" it probably wasn't actually a victory.

This is a disturbing dataset, in that it seems straightforward to extract personal data from it. Consider the information revealed by a taxi ride between a personal address and a workplace, a sensitive location, or another personal address...

This kind of data was previously an issue with Uber, too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact