
Using GPUs to Speed Through the 1.2B Record Taxi Dataset - jtsymonds
https://www.mapd.com/blog/2016/10/13/speeding-through-nyc-the-billion-row-nyc-taxi-dataset/
======
sxp
One of the big data points missing from this article is the price. Unless you
need features specific to the high end cards such as unlocked 64bit or 16bit
performance or antialiased lines, the consumer cards have much higher
performancep dollar$ [1]. It would be really interesting if they compared
their 8 K80s ~(8 * ($4K for 8TFLOPS)) against a set of GTX 1080s ~($650 for
8TFLOPS)

[1]
[https://www.youtube.com/watch?v=LC_sx6A5Wko](https://www.youtube.com/watch?v=LC_sx6A5Wko)
&
[http://www.videocardbenchmark.net/gpu.php?gpu=Tesla+C2050](http://www.videocardbenchmark.net/gpu.php?gpu=Tesla+C2050)

~~~
jtsymonds
Hi SXP, thanks for your comment. You might want to check out Mark
Litwintschik's posts (independent blogger who has benchmarked this dataset
across many different databases) for performance on GeForce GTX TITAN X's. 4 x
GeForce GTX TITAN X: [http://tech.marksblogg.com/billion-nyc-taxi-rides-
nvidia-tit...](http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-
tit..). 8 x K80s: [http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-
tes...](http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tes..). He
has additional posts on MapD on Pascal Titan X's and AWS as well. In full
disclosure I work at MapD...

~~~
sp8962
Nice demo, but it would be even nicer if you could get the blooper fixed:
"Mapbox's Openstreetmap"

MapBox is an active and respected participant in the OpenStreetMap project and
uses our data in some of its products, but that is it.

~~~
jtsymonds
Blooper Fixed :)

------
solatic
> we use the GPU to render the image, compress it to a .png (about 100KB) and
> send it to the browser as a tile. This allows for lightning fast rendering
> and the perception by the user that all of this data is actually in their
> browser.

With the enormous caveat that you need to have a low latency to their server
to get this illusion of client-side rendering. Considering that they only have
this cluster of K80's close to them, geographically, and not a number of
clusters spread out globally, this isn't a usable example in much of the
world.

Now, I don't expect them to roll out K80 clusters world-wide just for the sake
of a demo, but it's still pretty important.

~~~
Asooka
I'm in Eastern Europe and it loads up in like half a second. Much faster than
I'd expect the browser to process queries on a 1.2bln dataset and without
taking up untold gigabytes of memory.

------
fogleman
GPUs sound cool, but here's what I did with the same dataset, one using flat
files and another using cassandra:

[https://www.michaelfogleman.com/static/yellow/](https://www.michaelfogleman.com/static/yellow/)

[https://www.michaelfogleman.com/static/density/](https://www.michaelfogleman.com/static/density/)

~~~
tmostak
This is 77M rows, not the full 1.2B dataset shown in the MapD demo (with 60
variables). It also looks like he map is pre-rendered as opposed to being
dynamically rendered with filters applied.

Pretty cool but a different animal.

------
allengeorge
The link to the demo is broken in the blog post. It's actually:
[https://www.mapd.com/demos/taxis/](https://www.mapd.com/demos/taxis/)

------
e9e
They left their source map open. Interresting tech choices:

\- React / Redux / mapbox-gl

I always look at the data table implementation to see how far people go. And
here they made their own implementation based on d3.

Here's the sources for those curious: [https://github.com/d8d/mapd-
sources/tree/master/out](https://github.com/d8d/mapd-sources/tree/master/out)

~~~
zyang
d3/dc.js/crossfilter to be precise. having been working on something similar,
I found dc.js to be redundant if you already use redux. it's much cleaner to
use a lighter weight charting library with cross filter.

~~~
e9e
Thanks for the tip, I found this table/d3 implemenation but it looks like they
use it only for the grouping table: [https://github.com/d8d/mapd-
sources/blob/master/out/home/jen...](https://github.com/d8d/mapd-
sources/blob/master/out/home/jenkins-slave/workspace/mapd2-frontend-
release/libraries/mapdc/overrides/src/mapd-table.js)

------
Twirrim
Is this the data set?
[http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtm...](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml)

Why wouldn't they be hosting this in compressed form. A quick shot through
pigz has it down to < 50% original size.

~~~
jtsymonds
Twrrim,

Slightly different. We have appended all of the data from Factual as well.
This includes the location of every business in NYC.

~~~
35bge57dtjku
Why don't you publish the real data then?

~~~
imaginenore
Because it's their business?

------
raverbashing
What are the 'commuter confidential' tricks around bridges? I know some
bridges are tolled...

~~~
jtsymonds
Look at the coloring around the rides near bridges. People take the subway
down to the closest point and then take a cab home. The hybrid trip is both
pocketbook friendly and probably faster.

------
eternalban
Interesting data point for the scale-up vs scale-out debate.

[https://www.mapd.com/assets/static/images/barchart.png](https://www.mapd.com/assets/static/images/barchart.png)

~~~
infinite8s
That's an interesting slide, but without knowledge of the size of the dataset
it could be misleading (especially considering communication costs between
nodes in a cluster).

~~~
jtsymonds
Hi infinite8s, to get additional information on how that chart was made, you
can to go [https://www.mapd.com/product/](https://www.mapd.com/product/)
scroll down to the bar chart, and click “See Details” under the chart. Shows
the machines used, queries, and the source data set and size. Note that the
machine configurations used to generate the chart were normalized for
equivalent cost on AWS, i.e. the chart is hardware-dollar normalized.

------
m0atz
This is fucking awesome.

------
smlacy
This blog post has since been deleted. :(

------
devereaux
Failed to Load Dashboard TypeError: this.painter is undefined

HN effect?

------
vegabook
Very impressive technology, but is there an open source version? Even a
limited one? That one can try on something more modest than 100 grand's worth
of pro GPUs?

~~~
jtsymonds
There is not an open source version as yet, but you can spin up these
instances on an hourly basis on AWS
[https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606...](https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606291055&sr=0-1&ref_=srh_res_product_title)
and on IBM Softlayer.

~~~
vegabook
thanks, but at 5 bucks an hour for an entry-level instance (single 12GB GPU)
I'm looking at 120 bucks a day if I don't want to constantly re-upload my
dataset into MapD (a very slow operation judging by Mark Litwintschik's posts
linked by you). That's a very very high price for such a modest hardware
configuration, not to mention the more credible one which goes for an eye-
watering 30 bucks an hour ie not much change from a grand a day. Not for us
startup folk, clearly.

I have to say it seems your pricing for such a new entrant and before having
built share, is bound to attract very stiff newcomer competition.
"Interesting" business model.

~~~
tmostak
MapD has a persistent store and normally customers would keep that on an EBS
volume, so they don't have to reload their data every time they spin up an AWS
instance.

~~~
vegabook
fair enough but software cost 3-4x the (already high) hourly hardware cost
seems excessive.

~~~
wingless
If you find this too pricey, then build a cheaper or free competitor.

------
ChoHag
How long before we stop bollocking about and just call them all PUs?

~~~
kalleboo
GPU: Generic Processing Unit

