
1.1B taxi rides benchmarked on distributed GPU-powered MapD - tmostak
http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-ec2-p2-8xlarge-mapd.html
======
arnon
1.1B records = 500GB of raw CSV data. This fits into RAM quite easily on a
machine like the P2.8xlarge, especially when compression is used (like MapD
uses).

I'd like to see how well this performs on a dataset that doesn't fit in the
RAM.

~~~
tmostak
Its honestly a pretty small dataset for us. MapD can easily do sub-100ms
queries on 100B+ records with a single rack.

You can fit a lot of GPU + CPU memory on a small cluster. That RAM goes even
further when you use compression like you mentioned.

We're fast at pulling data off of disk but we've never really aimed to be a
traditional disk-based data warehouse. There are already great systems for
that.

~~~
menegattig
At what point do you think starts making sense cost-wise to use MapD (or GPU
in general) instead of Redshift or BigQuery?

~~~
kornish
My guess is that it makes more sense when interactive query latency (hundreds
of ms) is of the utmost concern.

~~~
menegattig
Right, I agree with you.

I was just wondering what kind of companies (except from financial sector)
would be willing to spend hundreds of thousands to get their latency from
hundreds of ms to dozens of ms. I'm saying that because if you have a very
well-tuned Redshift cluster, you can easily get dozens of ms for your queries,
spending thousands of dollars, not hundreds of thousands.

~~~
tmostak
You might be surprised.

Telcos need to troubleshoot network problems in real time, automakers and
insurance companies need to track cars in real-time, oil companies need to
interactively query and visualize geological data, and the infosec industry
needs real-time packet analysis. We have customers almost in every vertical,
all united by their need for real-time analytics. Some want to use MapD us
visualization, others for programatic querying for things like fraud
detection, and others still to feed into machine learning algorithms.

I'm also curious how you envisage paying thousands of dollars per year to get
queries in dozens of ms on datasets this size, much less 10-100X larger (which
customers would often use MapD for). Mark benchmarked a 6-node ds2.8xlarge
cluster of Redshift (> $40/hour) and found it up to 70X slower than MapD on
this dataset. That's similar to our price on Amazon for this 2-node cluster.

Not saying Redshift isn't a great system, just that I don't buy the
price/performance numbers you are quoting (for real workloads, not for some
specific query that can be indexed well, etc)

~~~
menegattig
Very clear, thanks so much for the detailed explanation.

I'm still curious about how much it would cost for scenario where you have 1
billion user and 200 billion events for a year of data and keep adding 10
billion monthly (a very real DMP or Telco scenario) and you have to make a
query like this one below on top of all this data (200 billion records). I'm
wondering how many MapD servers/infrastructure I would need to have in order
to get results under 100-ms.

Count UNIQUE Users that from "San Francisco" OR "New York" AND accessed the
pages "/sports" OR "/news" more than 3 times in the past 12 months.

------
mmrezaie
Is there a bridge for using MapD with Spark interface or somehow combining
them? This can be interesting for the clusters with a lot of GPUs and a lot of
data to do data manipulations.

~~~
Bedon292
That actually sounds amazing, I would totally use that. I am actually hoping
that they will create GDAL bindings for geospatial data.

~~~
tmostak
We actually already can import geospatial formats via GDAL (shapefiles,
geojson, kml). We can render points and polygon data. More geospatial
abilities to come!

~~~
Bedon292
Awesome! Thanks for the update, super stoked for that. Can you also export
through GDAL?

------
trafficlight
Found elsewhere on the internet: 'On a system with eight Tesla K80s, which
might cost somewhere between $60,000 to $70,000, the license for the MapD
stack would be “a small multiple” of this hardware cost.'

I guess I'm not playing with this anytime soon.

~~~
Bedon292
Still not cheap, but you could play with it on an p2.xlarge AWS instance for
$4 an hour:
[https://aws.amazon.com/marketplace/pp/B01M0ZY2OV](https://aws.amazon.com/marketplace/pp/B01M0ZY2OV)

------
Bedon292
Amazing to see the improvements that MapD has made over the past few years. I
have been following them for a long time, and was excited to catch wind of 3.0
this morning. Then I get on here to see someone already benchmarking and
working with it.

------
baronseng
the benchmark page [1] mark did a good summary of how various technologies
compare.

[1]
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)

------
menegattig
Price comparison between Amazon Redshift, Google BigQuery, ElasticSearch and
SlicingDice using the same dataset:

[https://blog.slicingdice.com/slicingdice-pricing-model-
and-c...](https://blog.slicingdice.com/slicingdice-pricing-model-and-
competitors-comparison-31f1c9f0f076)

~~~
marklit
Same dataset? Mind telling me the uncompressed byte count?

