

Data-Processing Frameworks Benchmark: Redshift, Hive, Shark, Impala - ceyhunkazel
https://amplab.cs.berkeley.edu/benchmark/

======
espeed
Spark is a big deal
([http://spark.incubator.apache.org/](http://spark.incubator.apache.org/)).
It's a next-gen open source cluster-computing system built on top of the
Berekely Data Analytics Stack (BDAS -
[https://amplab.cs.berkeley.edu/software/](https://amplab.cs.berkeley.edu/software/)),
which includes Mesos, Spark, SparkStreaming, Shark, and GraphX (to name a
few).

Mesos is the foundation of the stack, and Spark started out as a research
project because they needed something to run on Mesos. But you can also run
Hadoop on Mesos, and you can run Spark and Hadoop on the same Mesos cluster.
Twitter runs almost everything on Mesos and works directly with AMPLab on the
project.

See Benjamin Hindman's presentation on "Managing Twitter Clusters with Mesos"
([http://www.youtube.com/watch?v=37OMbAjnJn0&list=PL9F5093F238...](http://www.youtube.com/watch?v=37OMbAjnJn0&list=PL9F5093F238695612&index=5)).

SparkStreaming replaces the need for Storm and handles failures/stragglers
better. As Nathan Marz the creator of Storm said, "Spark is interesting
because it extends MapReduce with a new primitive that allows Pregel to be
built on top of it. So Spark is both Hadoop and Pregel"
([http://nathanmarz.com/blog/thrift-graphs-strong-flexible-
sch...](http://nathanmarz.com/blog/thrift-graphs-strong-flexible-schemas-on-
hadoop.html#comment-334743458)).

GraphX ([https://amplab.cs.berkeley.edu/publication/graphx-
grades/](https://amplab.cs.berkeley.edu/publication/graphx-grades/)) is new,
and it's GraphLab2 built on Spark, which enables fast processing of Pregel-
like algorithms. GraphLab2 ([http://graphlab.org/](http://graphlab.org/))
includes a suite of machine learning tools (similar to Mahout).

Berkeley's "Analyzing Big Data with Twitter" series
([http://www.youtube.com/playlist?list=PLE8C1256A28C1487F](http://www.youtube.com/playlist?list=PLE8C1256A28C1487F))
includes a couple of presentations related to the project.

The last presentation is by the Spark-lead Matei Zaharia
([http://www.cs.berkeley.edu/~matei/](http://www.cs.berkeley.edu/~matei/)),
and he gives a good high-level overview: "Analyzing Big Data with Twitter:
Spark"
([http://www.youtube.com/watch?v=rpXxsp1vSEs&list=PLE8C1256A28...](http://www.youtube.com/watch?v=rpXxsp1vSEs&list=PLE8C1256A28C1487F&index=15)).

There is another presentation in the series by GraphLab-lead Joey Gonzalez
([http://www.cs.cmu.edu/~jegonzal/](http://www.cs.cmu.edu/~jegonzal/)):
"GraphLab: Big Learning with Graphs"
([http://www.youtube.com/watch?v=E1LwqtBdPYs](http://www.youtube.com/watch?v=E1LwqtBdPYs)).

See also the GraphLab paper "PowerGraph: Distributed Graph-Parallel
Computation on Natural Graphs"
([https://www.usenix.org/system/files/conference/osdi12/osdi12...](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-167.pdf))
and presentation ([https://www.usenix.org/conference/osdi12/167-powergraph-
dist...](https://www.usenix.org/conference/osdi12/167-powergraph-distributed-
graph-parallel-computation-natural-graphs)).

AMPLab has plenty of sponsors
([https://amplab.cs.berkeley.edu/sponsors/](https://amplab.cs.berkeley.edu/sponsors/)),
both Twitter and Yahoo are adopting it, and evidently Facebook and Amazon may
too ([http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-
am...](http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-amplab-
spark/all/)). The more I learn about the BDAS stack, the more I think it's
going to usurp Hadoop/Storm.

For more on AMPLab, see...

AMPLab Stack Presenations:
[http://www.youtube.com/user/BerkeleyAMPLab/feed?activity_vie...](http://www.youtube.com/user/BerkeleyAMPLab/feed?activity_view=5)

Slides/Summaries: [http://ampcamp.berkeley.edu/amp-camp-one-
berkeley-2012/](http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/)

~~~
izendejas
There was some big news this week regarding Spark:
[http://gigaom.com/2013/09/25/databricks-raises-14m-from-
andr...](http://gigaom.com/2013/09/25/databricks-raises-14m-from-andreessen-
horowitz-wants-to-take-on-mapreduce-with-spark/)

The creators Matei Zaharia, Ion Stoica et al raised a substantial amount. With
Tachyon (in-memory file system that supports lineage and is thus fairly
robust) and more recently MLBase, one should look at Spark beyond its
excellent performance and really look at the overall package and versatility
it provides.

I think many people somewhat pointlessly get caught up in arguments about what
constitutes big data and what doesn't. As someone who's used Spark
substantially for machine learning as well as other complex types of
processing beyond your standard joins, filters, etc Spark is incredibly useful
even for a few GB of data because it allows one to iterate rapidly.

With MLBase and all, I think Spark will really have an impact because your
average engineer will be able to run some standard ML algorithms out of the
box at scale. That is huge. That's what matters with data (big or not) -- it's
the insights you can gain.

Edited: typos. Also, some more on mlbase:
[http://www.mlbase.org](http://www.mlbase.org) which will be released tomorrow
if am not mistaken.

~~~
espeed
Congrats to Matei and Prof Stoica on the new company.

------
AmiiJewels
I applaud the effort but is this really "big data" \- the largest data sets
they seem to test are ~150GB, that would fit comfortably on my Mac Book Pro a
number of times over. Many of these systems being tested are designed to scale
efficiently when the data starts peaking > 5TB and therefore I am dubious
about the median response time results - things that work well for small
datasets (where small is defined as < 1TB) easily fall apart when you scale
them up a little bit more.

~~~
physcab
These technologies aren't useful just for storing data. Yes, you can do that
on your MacBook. These are useful for when you need answers quickly or need to
perform complex analysis. For example, spark (the engine behind shark) allows
you to run things like logistic regressions at scale in just a few seconds. As
far as I know, you can't load 150GB of data into memory in R on your MacBook
and then run a logistic regression a few seconds later.

------
CurtMonash
The author had a terrible brain cramp in the sentence "Redshift uses columnar
compression which allows it to bypass a field which is not used in the query."

That totally confuses columnar compression with columnar I/O, an error I've
been railing against for several years, e.g. in
[http://www.dbms2.com/2011/02/06/columnar-compression-
databas...](http://www.dbms2.com/2011/02/06/columnar-compression-database-
storage/) (I.e., ever since Oracle tried to popularize the confusion.) But
this is a particularly bad instance.

~~~
keithgabryelski
I also don't see a schema that is tuned for redshift.

There is no description of a sortkey or a distribution key (let alone which
compression encoding was used)

And at least for the first query you could use UNLOAD instead of select
(depending on how you are managing the data coming out of redshift it might be
a reasonable solution and doesn't force all your data through the leader node
constrained by whatever client driver you are using to read the results).

Instead of trying to select out of redshift this much data -- select the data
into another table or (again) use unload.

------
ceyhunkazel
For a starter guide to Amazon Redshift there is a book
[http://www.amazon.com/Getting-Started-Amazon-Redshift-
Stefan...](http://www.amazon.com/Getting-Started-Amazon-Redshift-
Stefan/dp/1782178082/)

------
ceyhunkazel
SAP HANA would be a better option than Redshift. You can get cloud version of
HANA. It support R, JavaScript, ArgGIS and more SQL data types.

~~~
bcoates
Redshift is $0.43/TB-hour, it looks like HANA on AWS is around $59/TB-hour.
You get a lot for the money (HANA software, in-memory, tons more CPU) but your
workload had better really need it at that price difference!

~~~
ceyhunkazel
I think you calculated wrong.
[https://aws.amazon.com/marketplace/pp/B009KA3CRY](https://aws.amazon.com/marketplace/pp/B009KA3CRY)

EC2 Instance Type Software EC2 Total

8XL cc2.8xlarge $0.99/hr $2.50/hr $3.49/hr

EBS Storage Fees $0.10 / GB / Month for Standard EBS Storage

