
Apache Spark: The Next Big Data Thing? - thinkberg
http://blog.mikiobraun.de/2014/01/apache-spark.html
======
ajtulloch
MLlib [1] is definitely worth checking out as a Spark library for common ML
algorithms - it's crazy how little code is needed to implement efficient,
distributed inference algorithms.

Spark's primitives (especially the RDD abstraction) make it feel like a DSL
for distributed ML algorithms - for example, check out the implementation of
distributed alternating least squares matrix factorization at [2] - ~100 lines
of Scala code once stripped of the Java interoperability boilerplate.

On the weekend, I took a crack at implementing some distributed ML algorithms
using Alternating Direction of Multipliers (ADMM) in Spark, following [3]. In
a few hours and ~400 lines of Scala code, it was possible to implement
distributed versions of L^1 regularized logistic regression, ridge regression,
and SVMs [4].

It's a very impressive framework, and very much empowering.

[1]: [http://spark.incubator.apache.org/docs/latest/mllib-
guide.ht...](http://spark.incubator.apache.org/docs/latest/mllib-guide.html)

[2]: [https://github.com/apache/incubator-
spark/blob/fdaabdc673875...](https://github.com/apache/incubator-
spark/blob/fdaabdc67387524ffb84354f87985f48bd31cf60/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala)

[3]:
[http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pd...](http://www.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf)

[4]:
[https://github.com/ajtulloch/admmlrspark/tree/master/src/mai...](https://github.com/ajtulloch/admmlrspark/tree/master/src/main/scala)

------
izendejas
I believe the answer to that question is yes. I've raved about Spark many
times on HN before. We're using it and absolutely love it.

Not sure why the author thinks there's no built-in support for iterations. The
support is there, native scala/java. But you might have to collect to the
driver/master to check for convergence, for example. Their simple logistic
regression example doesn't check for convergence, but hard codes the number of
iterations.
[http://spark.incubator.apache.org/examples.html](http://spark.incubator.apache.org/examples.html)

Some quirks that I've learned. When doing groupbys or other join operations,
it helps to specify the number of partitions -- actually this is true in
general, but more so for join-based operations, otherwise you can run into
memory/gc issues. The equivalent is of course, setting the number of reducers
to something which ensure you won't run out of heap space.

Because spark serializes the closures which transform your data, if you don't
cache (at least via disk, using persist), then when iterating over an RDD or
re-using it, you'll just waste cycles.

Beyond that, as an experienced scala developer, I've found that Spark feels
incredibly natural. And being able to run things on the repl cannot be
appreciated enough.

------
jandrewrogers
Spark cleans up some abstractions found in the Hadoop ecosystem but I would
hesitate to call it the "next big thing" because it doesn't really address any
of the core weaknesses of Hadoop. In Big Data, there are three touchstone
applications that the current generation of platforms largely do poorly: real-
time, geospatial, graph. Spark does not really address any of these. It might
be more accurate to say that Spark is the "next big thing" in the _Hadoop_
universe but still retains many of the same limitations on expressiveness,
thereby creating opportunities for a more general Next Big Thing in Big Data.

Specifically, "faster batch + in-memory" is basically a simple patch on the
batch mode problem but does not really address continuous flow real-time
models. At large scales it is quite difficult to get robustly efficient
behavior out of a parallel system that is batching things on 5 second
intervals if the data flow is anything but trivial.

For geospatial and graph analysis, Spark appears to retain the same limitation
of Hadoop in that it cannot deal with data models and operations without an _a
priori_ optimal partitioning function. Static hash and range partitioning
won't cut it, particularly if the streaming data sources are actually real-
time. The ability to generate uniform partitioning of complex data models with
inherently unpredictable data distributions is critical to parallelizing some
important analysis types but there is no obvious support for such mechanisms
in Spark.

One could look at Spark as Hadoop done right, more or less.

~~~
espeed
GraphX ([http://amplab.github.io/graphx/](http://amplab.github.io/graphx/)) is
essentially GraphLab2 ([http://graphlab.org/](http://graphlab.org/)) built on
Spark. Joey Gonzalez, the creator of GraphLab, is also the GraphX lead.

Here are the papers for GraphLab and GraphX...

GraphLab:
[http://graphlab.org/home/publications/](http://graphlab.org/home/publications/)

GraphX: A Resilient Distributed Graph System on Spark
([https://amplab.cs.berkeley.edu/publication/graphx-
grades/](https://amplab.cs.berkeley.edu/publication/graphx-grades/))

See also "Introduction to GraphX - Presented by Joseph Gonzalez, Reynold Xin -
UC Berkeley AmpLab 2013"
([http://www.youtube.com/watch?v=mKEn9C5bRck](http://www.youtube.com/watch?v=mKEn9C5bRck))

------
benjiweber
Spark does look very promising. It is great being able to experiment with jobs
on distributed collections from a Scala repl. Consequently it is very quick to
get started.

My first impressions are that there seems to be a fair amount of abstraction
leakage. Some things that compile and look valid fail at runtime - e.g.
referring to other RDDs from within a filter predicate or map function. Other
More complex jobs cause the nodes to fail and it gets into infinite loops of
restarting the nodes, replaying the job, and them dieing again.

I hope once I get a better understanding of what is going on underneath I will
understand what is going on here.

~~~
jey

        More complex jobs cause the nodes to fail and it gets
        into infinite loops of restarting the nodes, replaying 
        the job, and them dieing again.
    

This is probably a bug in your code. Debugging these cluster applications does
take some getting used to. You'll want to look at the stderr output of the
failing executor, and you'll probably see that it's dying due to some kind of
exception. You can do this by visiting port 8080 of the master node over HTTP,
i.e. [http://mymaster:8080](http://mymaster:8080). Feel free to email the
Spark users list if you have any questions:
[https://spark.incubator.apache.org/mailing-
lists.html](https://spark.incubator.apache.org/mailing-lists.html)

It's true that you do have to understand the programming model and some
details of how it's implemented to use Spark effectively. However, any
abstraction that was "pure" and perfectly non-leaky would necessarily
sacrifice some performance and transparency to achieve that goal. Spark aims
to be both high-level _and_ high-performance.

Full disclosure: I'm on the Spark team at the UC Berkeley AMPLab.

------
gizzlon
I would like to play with Spark, have anyone tried any of the Docker images?
There's too many =/

~~~
tszming
The best way to try out Spark & related tools is to follow the 2 days
exercises from the AMP Camp 3, they have a script to launch the whole cluster
(Berkeley Data Analytics Stack) on ec2 very easily so you can get your feet
wet without too much effort.

[1] [http://ampcamp.berkeley.edu/big-data-mini-
course/](http://ampcamp.berkeley.edu/big-data-mini-course/)

[2] [http://ampcamp.berkeley.edu/big-data-mini-
course/launching-a...](http://ampcamp.berkeley.edu/big-data-mini-
course/launching-a-bdas-cluster-on-ec2.html)

~~~
gizzlon
two days? :O

~~~
etrain
I helped write the AMPCamp exercises (and parts of Spark) - the exercises were
done in the context of a two-day long seminar held at Berkeley where the days
were mostly filled with talks/training about the stack. The actual exercises
were designed to take a couple of hours each day, with plenty of time for
questions/etc. You can probably get through them in an evening, and I've seen
people do it in less!

------
gclaramunt
I love Spark, is about time to start the NoHadoop movement!

~~~
alanctgardner2
In the next year I can imagine most people deploying Spark will be doing it on
Hadoop, since Cloudera 5 will support Spark. It's a natural fit, most people
don't hate HDFS but their use case doesn't naturally fit the MR programming
model.

~~~
gclaramunt
yeah, I mostly complain about trying to fit everything to the Map/Reduce model

~~~
TallGuyShort
The "Hadoop ecosystem" is getting less and less about MapReduce, and more
about other execution frameworks that can share HDFS with it.

~~~
alanctgardner2
You don't happen to work at Cloudera, do you? I noticed you have some
submissions about Impala and Oracle being evil, which seems to be a pretty
common view among the ex-Oracle DBAs there

~~~
TallGuyShort
I do happen to work at Cloudera (hence the Impala submissions), although I'm
neither an ex-Oracle DBA nor a huge believer that they're evil. I really don't
have a lot of first-hand experience with Oracle as a company - which as you'll
see, is why my submission was actually a question about the community's
perception and why that's a common view.

~~~
bcbrown
Hey, I've recently begun the interview process at Cloudera, would you mind
sending me an email? My address is in my profile, and I'd love to ask you a
couple questions.

------
kordless
It is no coincidence that we're seeing the theme of decentralization cropping
up everywhere. Consider an investment approach based on the assumption
decentralization is getting ready to eat the world.

