
Spark: Open Source Superstar Rewrites Future of Big Data - MarlonPro
http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-amplab-spark/all/
======
phunge
Hadoop is a pile of bad code, a stagnant codebase, crusty APIs and a thick
surrounding layer of hype which obscures what it's really like to use. Spark
might be better or faster, but mostly what you need to beat Hadoop is to make
something practical, which lets developers be expressive rather than wrestle
with overdesigned nonsense.

I know this because it's my full-time job to actually get stuff done inside
Hadoop.

Spark may be a great system, but this article doesn't do much to settle the
issue. When you read fluff like "sweeping software platform", "famously
founded the Hadoop project", "great open source success stories" and machine
learning described as "crunching and re-crunching the same data -- in what's
called a logistic regression", it's time to move on.

~~~
otoburb
>>Hadoop is a pile of bad code, a stagnant codebase, crusty APIs and a thick
surrounding layer of hype which obscures what it's really like to use.

This is where marketing and branding becomes the primary factor influencing
adoption, and not technical merit. Hadoop gathered so much momentum and hype
as part of the Big Data buzz in the past few years that it's only now
beginning to percolate through to telecommunication carriers and other
larger/slower-moving enterprises.*

* I work primarily with wireless carriers; can't say much about broadband, although I'd hazard a guess and say that the majority are only now allocating experimental budgets to see how Hadoop can help them manage their Big Data.

------
izendejas
When I first learned about Spark, I knew this team would go on to build great
things and they've gone beyond my expectations all the while being very
friendly and supportive of the community.

Among the things I love most about the Spark and the eco-system:

* repl -- so great for running short little experiments or even full blown jobs. saves you time compiling small changes and really get to know your data quickly.

* caching -- processing in-memory opens up many possibilities beyond doing iterative, machine learning jobs. quantifind, for example, demoed a system that allows them to run ad-hoc queries using Shark on the fly across GBs of data (think OLAP, a bit) in seconds or less.

* scala -- makes for very succinct code using closures and built-in operations; check out some examples here: [http://spark-project.org/examples/](http://spark-project.org/examples/)

And some of the upcoming projects are also very cool. Tachyon, for example,
will enable users to share data with a very robust in-memory file system. A
teammate and I, for example, could have used it recently because we were
simultaneously running different analysis against the same data, so we had to
cache duplicate instances on two clusters.

------
subprotocol
Spark is a wonderful project, I blogged about it just the other day:
[http://subprotocol.com/2013/06/17/spark-darling-of-big-
data....](http://subprotocol.com/2013/06/17/spark-darling-of-big-data.html)

Spark make doing MR easy. I've used other frameworks on hadoop MR, but nothing
compares with the ease with which you can express computations using it. And
it does both batch and real-time/streaming. It is a very well thought-out
project.

~~~
pkolaczk
Not only it is easy to use, but the source code is a real pleasure to read,
contrary to Hadoop's mess.

------
justinsb
Can someone explain why the in-memory caching is such a big win? Does Hadoop
MapReduce not do caching as well? I'd expect at least filesystem caching when
the computation is running on the same machine as the data block...

~~~
rxin
Spark goes much more beyond just in-memory caching. It features a more
advanced scheduler and a higher level programming abstraction.

The programming abstraction treats all data as collections (RDD in Spark
terminology), and allow programmers to apply bulk transformation on these
collections. Some examples of operations you can apply include the traditional
map and reduce, the relational filter, join, outerJoin, and more advanced ones
like sample. This abstraction makes it much easier to write distributed
programs. As the Wired article mentioned, a distributed program written in
Spark often looks identical to a single node program. This substantially
reduces the amount of code one needs to write for distributed programs, and
the best part is the code really expresses the algorithm (rather than
cluttered with JobConf setup).

And the scheduler and the engine itself are aware of the general DAG of
operators, so they can schedule and run those operators better. For example,
if you have multiple maps, the execution gets pipelined; if you are joining
two collections that are partitioned the same way, the execution avoids an
expensive shuffle step.

There are many other benefits too. I'd encourage you to give a try. Thanks!

Disclaimer: I am on the Spark team at UC Berkeley.

~~~
HCIdivision17
Does Spark address the same problems as Storm?

It doesn't look like there's been any direct comparison of the two, though it
looks like there's overlap. (I've wanted to start a streaming data processing
project, and this looks like it would be good to consider for it.)

~~~
krcz
There are some nice looking Spark vs Storm graphs in their streaming
presentation slides: [http://spark-
project.org/talks/strata_spark_streaming.pdf](http://spark-
project.org/talks/strata_spark_streaming.pdf) . Makes me wonder how biased
these might be.

------
qznc
Spark project itself: [http://spark-project.org/](http://spark-project.org/)

------
sandGorgon
I'm not sure what is meant by supporting Scala and python.

* Here is an example of a simple job written using Scala for hadoop (and uses Mahout libraries) - [https://github.com/sandys/distributed-scala-mahout/tree/wiki...](https://github.com/sandys/distributed-scala-mahout/tree/wiki-1)

* you can embed pig inside jython

* you can write UDFs using jruby or jython.

* I didn't try to figure out how, but I'm pretty sure you can build a standalone job jar using jruby and warbler

There might be a different way of hooking into Spark through Scala or python,
but I'm pretty sure that fundamental _support_ is not the big advantage here.

Pig has a repl - but as I quickly realized from playing around, that you end
up mucking around with classpath problems (and questions of having your Jars
in distributed cache) once you attempt to build UDFs involving a few different
libraries.

Plus, I haven't used and so cannot comment on the ecosystem of
Cascalog/Clojure - which is as functional as you can get.

------
pvnick
This sounds amazing. I've done iterative jobs in hadoop before - it's very
hacky and I generally just have it launch job after job after job until the
result converges to where I want it. I'll definitely try this out soon.

Then again, while I absolutely love doing work with big data, I've been having
a bit of an "existential crisis" since the NSA leaks :(

~~~
dualogy
> Then again, while I absolutely love doing work with big data, I've been
> having a bit of an "existential crisis" since the NSA leaks

Book recommendation: "Who owns the future" by Jaron Lanier.

------
fatjokes
I've met Matei in passing through programming contests (where he is also a
star) but even in those brief moments, his brilliance is pretty apparent. Good
on him for the recognition---he deserves it.

------
DiabloD3
I don't understand why they wrote this in Java instead of a language more
suited to the task, such as Erlang.

Can someone explain this to me?

~~~
penland
According to github, 85% of this is in Scala.

It's written on the JVM for the simple reason that if they wrote it in Go or
Erlang, no Enterprise would adopt it as there isn't a CTO at a non-tech
Fortune 500 that has every heard of Erlang or GO, and wouldn't know the first
thing about trying to hire developers for it. Remember, the jobs written for
Map Reduce are done in the same language ( typically ) as the MapReduce code
itself.

~~~
shin_lao
Why didn't they go native?

~~~
pkolaczk
Why would they?

~~~
shin_lao
I thought they cared about performance...

------
mikegagnon
The article discusses the challenge of supplanting entrenched software such as
Hadoop. I am actually a bit more optimistic about the ability to swap out
Hadoop with technologies like Spark.

At Twitter we don't program using Hadoop directly; we mostly use either
Scalding or Pig, languages that compile down to Hadoop code.
[https://dev.twitter.com/blog/scalding](https://dev.twitter.com/blog/scalding)
[http://www.slideshare.net/kevinweil/hadoop-pig-and-
twitter-n...](http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-
nosql-east-2009)

I believe this is how many other companies use Hadoop as well.

The benefit here is that it's possible to write new backends for Pig and
Scalding that compile down to Spark or anything else. And then you have
backwards compatibility with all your old big-data code.

~~~
shazzdeeds
One of the Cloudera devs told me 80% of all Hadoop users run Hive. This
suggests most devs secretly want to continue using SQL, but want more scalable
relational solutions. Thus why Cloudera is backing Impala, and Facebook is
about to open source Presto.

I worked on an entire team of developers where I was the only one who
understood the raw Java MapReduce API. Almost everyone else on my team got by
with learning HiveQL, and a very minimal understand of MapReduce design flow.

Your belief is absolutely correct.

------
shin_lao
I think they focus too much on algorithms and not enough on the technical
implementation which is one of the major reasons why Hadoop is so slow.

------
weinzierl
"60 servers owned by MegaUpload were directly confiscated by the FIOD and
transported to the US."

This implies that the confiscated servers were originally not located in the
US. I wonder which country they were located and on what legal basis the US
could confiscate servers there?

