

MapReduce and Spark - rxin
http://vision.cloudera.com/mapreduce-spark

======
lmm
I like Spark over Hadoop just from an interface point of view, particularly
the ability to just start up a (Scala) shell and start playing around. Hadoop
can be very effective, but even getting "hello world" to run requires an
intimidating array of setup.

~~~
gknight
Have you tried Apache Hive? I believe it was meant to make Hadoop easier to
use by way of SQL-like commands. Something like Qubole might be able to help
too.

~~~
pavanred
Pig is another option. It allows using SQL like commands on the grunt shell,
making using Hadoop a lot easier.

~~~
rxin
Going from Hive/Pig to Spark enables substantial improvement in developers'
productivity (for non-reporting/BI workloads). You can properly unit test your
program, use a debugger, and have all your code in the same place in the same
language (rather than in the case of Pig, write UDFs in Java and then use a
pseudo-scripting language for workflow specification).

All of these are just productivity gains; not to mention the performance gains
you get when you go from MapReduce to Spark.

------
hobbyist
I often read that spark avoids the costly synchronization required in
mapreduce, since it uses DAG's. Can someone explain how is that achieved. If
the application so demands that you can launch jobs together, that can be done
even with hadoop/mapreduce. If one job requires the output of another, then
the job has to wait for synchronization whether its mapreduce or DAG.

~~~
xtreme
Spark's major benefit comes from storing the intermediate results in-memory
instead of storing it in HDFS as Hadoop does. Let's say a certain query needs
to run 3 mapreduce jobs A, B, C one after another. In Hadoop, there will be 3
hdfs reads and writes. With spark, there will be only 1 hdfs read (before
launching A) and 1 write (after C is completed). In spark, the output of A
gets stored in RAM which is read by B and so on until the final write.

The DAG used by spark represents how one job/partition of data depends on
another job/partition and what methods (e.g. filter) need to be applied on the
parent data to get the child data. This is useful when a node goes down and
that portion of data has to be recomputed. Note that users can choose to
persist some intermediate results to hdfs to avoid recomputation in case of
failure.

------
justinkestelyn
Some interesting use cases are also described on Cloudera's developer blog, at
[http://blog.cloudera.com/blog/2013/11/putting-spark-to-
use-f...](http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-
memory-computing-for-your-big-data-applications/).

------
fintler
Although spark is nice, I'm also looking forward to mpi/orted integration with
hadoop...

"Performance: Launches ~1000x faster, runs ~10x faster"

"Launch scaling: Hadoop (~N), MR+ (~logN)"

"Wireup: Hadoop (~N2), MR+ (~logN)"

[http://slurm.schedmd.com/slurm_ug_2012/MapRedSLURM.pdf](http://slurm.schedmd.com/slurm_ug_2012/MapRedSLURM.pdf)

------
wheaties
What I would love to know is if Mahout works out of the box with Spark or if
there's a third party library that bridges the two.

~~~
wandermatt
No. See MLbase [http://mlbase.org](http://mlbase.org)

