

Scrap your MapReduce – Introduction to Apache Spark - Garbage
http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/

======
jnaour
Good introduction. Spark is really a project to watch in the data analysis
field on distributed architecture. We had performed several benchmarks and
Spark keeps its promisses. 2.5x faster comparing to Pig for the same algorithm
on the same cluster.

For iterative algorithm with the in-memory possibilities, performances are
really good comparing to Hadoop.

The project is still young with several bugs but the documentation is really
good and the code is well commented and robust.

~~~
deadgrey19
As part of our work we have done extensive comparisons of Spark on various
workloads, clusters and cluster sizes comparing with Hadoop Map Reduce, Naiad
and several other frameworks. We've found Spark to be temperamental, hard to
configure, and with wildly varying performance, suited only to a small set of
computations for which in-memory state reuse is beneficial (mostly it isn't).

In nearly every test Naiad has beaten Spark.

More info on Naiad: [http://research.microsoft.com/en-
us/projects/naiad/](http://research.microsoft.com/en-us/projects/naiad/)

~~~
dekhn
MSFT killed Naiad's predecessor Dryad in favor of Hadoop some time ago,
because Hadoop was becoming popular. The primary author linked in the page
works at "Microsoft Silicon Valley"which was just shut down and in fact now
lists himself on LinkedIn as "Researcher At Large, previously at Microsoft"

So, how do we know Naiad has much future? . Technologically, it may be
better/more reliable/faster, but if it's a niche product that gets desupported
just because it never took off... it doesn't really matter.

Spark on the other hand has a great deal of momentum and in my experience,
momentum and adoption trump technical elegance in the short run...

(don't get me wrong: I thought Dryad was awesome. Google's Flume is very
similar in some ways. MapReduce's days are numbered except for a small number
of problems which can't be easily ported).

~~~
deadgrey19
All good points. I can't say what the future of Naiad is. What they have done
to Microsoft Research Silicon Valley is disgusting (I worked there too for a
short time).

In our experiences the performance claims with Spark have been more hype than
substance. Naiad on the other hand has been hard to find a corner case for.

Naiad is open source licensed under an Apache License so one can only hope...

~~~
deadgrey19
FYI: Link to Naiad github repo:
[https://github.com/MicrosoftResearch/Naiad](https://github.com/MicrosoftResearch/Naiad)

~~~
dekhn
Thanks, but: this is a dead project.

Also, it appears to be tied to Windows (it's delivered as a VS solution).

------
frak_your_couch
If you are interested in this, you might be interested in my (warning:
shameless plug) 5 part blog series located at
[http://blog.caseystella.com/pyspark-openpayments-
analysis.ht...](http://blog.caseystella.com/pyspark-openpayments-
analysis.html). I'm using the python bindings for Spark to illustrate doing
data analysis on healthcare financial data on Hadoop.

------
virmundi
Spark is nice, but its memory model almost requires a full cluster overhaul.
We looked at it at my last project. Our cluster nodes only had 64 GB of RAM.
That was carved into 4 GB workers. In order to use Spark we'd have to halve
our number of workers because of the memory requirements.

Neat project. Has its place. Requires a different cluster configuration which
might limit its utility.

~~~
sitkack
You need to deploy your MapReduce cluster with Mesos, allowing both Spark and
MR to use the cluster at the same time.

------
krigi
I've just start using this at work. It's far easier to jump into than
MapReduce; orders a magnitude easier. Hopefully I'll be able to contribute
back to the project at some point.

------
markivraknatap
Good work. Love the title for your blog too :)

