

Amazon Redshift is 10x faster and cheaper than Hadoop and Hive - fujibee
http://www.slideshare.net/Hapyrus/amazon-redshift-is-10x-faster-and-cheaper-than-hadoop-hive

======
cwsteinbach
Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.

* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend _17 hours_ converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?

~~~
ajays
_Why wasn't that used in this performance comparison?_

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims

------
meritt
Comparing a column-oriented RDBMS with parallel query execution versus hadoop
is a joke in the first place. Hadoop is extremely slow. That's nothing new.
This is not an apples-to-apples comparison whatsoever.

How does it compare against Greenplum or Aster or Vertica and is it more cost-
effective? Those are important questions.

~~~
nieksand
Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out...
the technologies are very different. However, there is a large overlap in use
cases.

~~~
annnnd
I strongly disagree. I am (or was) using both Hadoop and HBase and they are
useful for very different purposes (huge amounts of nonstructured data,
possibly with difficult-to-predict use cases versus structured data). Also
note that Hive is just a layer over Hadoop with DB-like syntax, it doesn't
make Hadoop a DB. It is still running MR queries beneath it.

------
free652
So redshift took 155 seconds + 17 hours (17 * 3600) = 61355 secs total

vs 1491 Hadoop

Looks like to me Hadoop is about 40 times faster...

~~~
MBCook
That's like saying a bicycle is faster than a car because I can buy a bike in
10 minutes while it may take me a couple of hours to get through the car's
paperwork.

If you do enough queries, redshift will come out faster (assuming the numbers
are correct).

~~~
TallGuyShort
If you do enough queries, you should spend the time to use RCFile for Hive, in
which case redshift wont come out _that_ much faster. The point is the 17
hours is not negligible.

~~~
dromidas
That is a good case since customers who typically need a datawarehouse aren't
just going to upload data once... they probably are going to upload
frequently.

~~~
TallGuyShort
You're missing my point and resorting to sarcasm - very nice </sarcasm>. My
point is not that Hive is the better choice because everyone is going to
reload their data frequently. My point is that if you want a fair benchmark,
don't use an obviously slow data format for Hive. They spent time importing
data optimized for RedShift, but they took a very naive approach for Hive. I'm
sure RedShift will still be faster, but not 10 times faster.

------
iblaine
There's a motive here. hapyrus.com is pushing themselves as Redshift
consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can
pay us to help you use it.

------
jaytaylor
I haven't tried redshift before, but coming from a MR/Hadoop/Hive background,
this seems to me like quite a sensational claim. I'd be very keen to hear
other's thoughts on how widely these kinds of gains would apply for BigData
processing.

As Carl Sagan said..

"Extraordinary claims require extraordinary evidence"

<http://en.wikipedia.org/wiki/Carl_Sagan>

~~~
saidajigumi
Hive is not particularly fast in and of itself; it just has horizontal scaling
and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis
added):

> Amazon Redshift delivers fast query and I/O performance for virtually any
> size dataset by using _columnar storage_ technology and parallelizing and
> distributing queries across multiple nodes.

Column stores databases[2] can be _screamingly_ fast for analytics operations
compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or
MonetDB[4] for examples of specific implementations. I'd fully expect a
competent column store designed for horizontal scaling to obliterate Hive for
a wide range of problems.

The usual big-data caveat: you need to pay attention to the fit of your tools
against your problem and your data. I don't expect RedShift to be any
different. Still, it's pretty exciting to see a new analysis DB tech cropping
up like this. And doubly interesting to see this coming from Amazon.

[1] <https://aws.amazon.com/redshift/>

[2] <https://en.wikipedia.org/wiki/Column-oriented_DBMS>

[3a] <http://kx.com/kdb-plus.php>

[3b]
[https://en.wikipedia.org/wiki/K_%28programming_language%29#K...](https://en.wikipedia.org/wiki/K_%28programming_language%29#K_financial_products)

[4] <http://www.monetdb.org/Home>

~~~
AndyNemmity
SAP HANA has a column store, and a row store, and does OLAP (Analytics) and
OLTP.

There is a lot of new DB tech, Redshift doesn't seem particularly competitive
at the moment unless you only need to use it a portion of the time, where
Amazon excels.

------
jeremyjh
1.2 TB really is not very much data in the context of "Big Data". The supposed
advantage of Hadoop is that it can scale horizontally with linear performance.

~~~
pjscott
They note on slide 9 that this is only biggish data -- but sometimes, that's
what you need to work with.

------
ryanbrush
I wish the post had gone into depth on _why_ Redshift was significantly
faster, but I'm betting it uses in-memory joins whereas (hence the size
limitations it mentions) whereas Hive joins are just MapReduce jobs that keep
only minimal subsets of data in memory at a given point. The upshot is the
Hive/MapReduce strategy isn't limited by physical memory.

Of course, if your data set can fit in memory, then Redshift or similar
technologies probably is a better choice than Hive. But it's important to
remember that the performance gains here come as the result of a tradeoff.

~~~
taligent
It was significantly faster because as was mentioned above the graph ignores
the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.

------
tonfa
I wonder how it compares to BigQuery:
<https://developers.google.com/bigquery/docs/pricing>

------
pytrin
Worth noting this presentation was made by Hapyrus, a Hadoop specialized
startup from 500startups. They know quite a bit about running Hadoop.
Following the results of their tests they are now adding Redshift support to
their services.

~~~
Uchikoma
... and want to sell their Redshift services starting with a bang.

------
ameyamk
They should compare redshift with hadoop + Imapala, OR hbase with Phoexnix
from Salesforce. Comparing with hadoop + hive is not a correct comparison

------
BrianEatWorld
I am still new to large data, but isn't a solution like Redshift similar to
Google's Big Query in that it only works with data that has a schema? How
might one use Redshift with a db thats originally in Mongo?

~~~
zeeg
You won't fit that much data into Mongo anyways, so does it matter?

~~~
taligent
People have been apparently storing 3TB of data in MongoDB.

So I guess it does matter.

------
Radim
Indeed, this comparison seems fishy.

Nevertheless, I'll take a moment to predict that articles like this will be
only becoming more and more frequent in time. Hadoop has entered its
"enterprisey" stage, with massively complex, cumbersome code, arcane
performance tuning, bullshit consulting business built around it (complete
with books and "certificates")...

The more agile competitors will be snapping at its flanks (and ankles),
sometimes without merit, and sometimes with.

------
AndyNemmity
I'd be interested in seeing a comparison between Redshift and SAP HANA, but a
more fair comparison than this one by someone who isn't partisan.

------
z_
Slide 2&6, one query every 30 minutes.

It turns out usage based billing can be cheaper if you don't use a resource.

~~~
danudey
I'm willing to bet that's a not-uncommon scenario for a lot of organizations,
however. If you're doing continuous querying of large amounts of data, then
it's probably worth building your own hadoop cluster (physically or via
Amazon), but a lot of people are just going to accumulate data and then make
queries against it. Lots of 'active users per day', 'traffic by hour',
'purchases by popularity', etc. only get run to create data for the CEO every
morning, or by the marketing manager every afternoon, that sort of thing.

------
fujibee
We wrote the blog post about this benchmark.
[http://www.hapyrus.com/blog/posts/behind-amazon-redshift-
is-...](http://www.hapyrus.com/blog/posts/behind-amazon-redshift-
is-10x-faster-and-cheaper-than-hadoop-hive-slides)

------
kushti
Seems like stupid marketing shit

------
hobbyist
Are they benchmarking hash join on hadoop and redshift?

------
cmccabe
Anyone interested in SQL queries on Hadoop should be checking out Cloudera
Impala. It's open source.

