

Apache Spark 1.0.0 - steveb
http://spark.apache.org/releases/spark-release-1-0-0.html

======
MoOmer
For new entrants, here's an email I sent out to some colleagues of mine just
getting into ML. I'm wrapping up a project that's using Mahout, and am getting
into Spark & MLlib now. I've regurgitated this on reddit already.

I've been following Apache Spark [0], a new-ish Apache project created by UC
Berkeley to replace Hadoop MapReduce [1], for about a month now; and, I
finally got around to spending some time with it last night and earllllllly
this morning.

Added into the Spark mix about a year ago was a strong Machine Learning
library (MLlib) [2] similar to Mahout [3] that promises much better
performance (comparable/better than Matlab [4]/Vowpal Wabbit [5])

MLlib is a lower level library, which offers a lot of control/power for
developers. However, Berkeley's Amplab has also created a higher level
abstraction layer for end users called MLI [6]. It's still being actively
developed, and although updates are in the works, they haven't been made
available to the public repository for a while [7]

Check out an introduction to the MLlib on youtube here:
[https://www.youtube.com/watch?v=IxDnF_X4M-8](https://www.youtube.com/watch?v=IxDnF_X4M-8)

Getting up to speed with Spark itself is really pain-free compared to some
tools like Mahout etc. There's a quick-start guide for Scala [8], a getting
started guide for Spark [9], and lots of other learning/community resources
available for Spark [10] [11]

[0] [http://spark.apache.org/](http://spark.apache.org/)

[1] [http://hadoop.apache.org/](http://hadoop.apache.org/)

[2] [http://spark.apache.org/mllib/](http://spark.apache.org/mllib/)

[3] [https://mahout.apache.org/](https://mahout.apache.org/)

[4]
[http://www.mathworks.com/products/matlab/](http://www.mathworks.com/products/matlab/)

[5]
[https://github.com/JohnLangford/vowpal_wabbit/wiki](https://github.com/JohnLangford/vowpal_wabbit/wiki)

[6] [http://www.mlbase.org/](http://www.mlbase.org/)

[7] [http://apache-spark-user-
list.1001560.n3.nabble.com/Status-o...](http://apache-spark-user-
list.1001560.n3.nabble.com/Status-of-MLI-td3610.html)

[8] www.artima.com/scalazine/articles/steps.html

[9] [http://spark.apache.org/docs/latest/quick-
start.html](http://spark.apache.org/docs/latest/quick-start.html)

[10]
[http://ampcamp.berkeley.edu/4/exercises/](http://ampcamp.berkeley.edu/4/exercises/)

[11]
[https://spark.apache.org/community.html](https://spark.apache.org/community.html)

------
krallin
Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an
existing Hadoop cluster.

It leverages HDFS to distribute archives (e.g. your app JAR) and store results
/ state / logs, and YARN to schedule itself and acquire compute resources.

It's pretty amazing to see how you use Spark's API to write functional
applications that are then distributed across multiple executors (e.g. when
you use Spark's "filter" or a "map" operations, then the operation potentially
gets distributed and distributed on totally different nodes).

Great tool — exciting to see it reach 1.0.0!

~~~
metronius
Do you mean SIMR or something another ?

------
steveb
I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is
generating a lot of excitement in the big data community:

[https://speakerdeck.com/stevendborrelli/introduction-to-
apac...](https://speakerdeck.com/stevendborrelli/introduction-to-apache-spark)

------
eranation
I wonder if anyone with experience with Spark can comment / rebut this post:
[http://blog.explainmydata.com/2014/05/spark-should-be-
better...](http://blog.explainmydata.com/2014/05/spark-should-be-better-than-
mapreduce.html)

~~~
subprotocol
I use spark a lot and my experience has been quite the opposite. The queries I
run against spark are billions of events and results are sub-second.

I could only speculate as to what this users issues were. One difference
between hadoop and spark is that it is more sensitive in that you sometimes
need to tell it how many tasks to use. In practice it is no big deal at all.

Perhaps the user was running into this- the data for a task in spark runs all
in memory, whereas hadoop will load and spill to disk within a task. So if you
give a single hadoop reducer 1TB of data, it will complete after a very long
time. In spark if you did this you would need to have 1TB of memory on the
executor. I wouldn't give an executor/JVM anything over 10GB. So if you have
lots of memory, just be sure to balance it with cores and executors.

I have seen spark use up all the inodes on systems before. A job with 1000 map
and 1000 reduce tasks would create 1M spill files on disk. However that was on
an earlier version of spark and I was using ext3. I think this has since been
improved.

For me spark runs circles around hadoop.

~~~
iskander
>The queries I run against spark are billions of events and results are sub-
second.

This is interesting, I haven't gotten Spark to do anything at all in less than
a second. How big is this dataset (what does each event consist of)? How is
the data stored? How many machines / cores are running across? What sort of
queries are you running?

>I could only speculate as to what this users issues were.

I'm the author of the above post and unfortunately I can also "only speculate"
what my issues were. Maybe Spark doesn't like 100x growth in the size of an
RDD using flatMap? Maybe large-scale joins don't work well? Who knows. The
problem, however, definitely doesn't seem to be anything from the tuning
guide(s).

~~~
subprotocol
> How big is this dataset (what does each event consist of)?

Standard clickstream data, maybe 50-ish parameters per event.

> What sort of queries are you running? > How is the data stored?

Depends on the use-case. For sub-second adhoc queries we go against bitmap
indexes. Other queries we uses RDD.cache() after a group/cogroup and answer
queries directly from that. For other queries we go hit ORC files. Spark is
very memory sensitive compared to hadoop, so using a columnar store and only
pulling out the data that you absolutely need goes a very long way. Minimizing
cross-communication and shuffling is key to achieving sub-second. It's
impossible to achieve that if you're waiting for TB of data to shuffle around
=)

> How many machines / cores are running across?

Depends on the use case. Clusters are 10-30 machines, some we run virtual on
open stack. We will grow our 30 node cluster in 6mo.

> Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap

You may actually just need to proportionally scale the number of partitions
for that particular task by the same amount. Also when possible use
mapPartitions, it is very memory efficient compared to map/flatMap.

> Maybe large-scale joins don't work well

Keep in mind that what ever happens per task happens all in memory. For large
joins I created a "bloom join" implementation (not currently open source =( )
that does this efficiently. It takes two passes at the data, but minimizes
what is shuffled.

~~~
iskander
> For sub-second adhoc queries we go against bitmap indexes.

Did you implement the index yourself? How many events survive the initial
filtering?

------
agibsonccc
Spark is an interesting technology, from what I've heard it doesn't actually
have traction in industry yet though.

Anyone here actually using it in production? I know it's blazing fast etc, and
I like it as a map reduce replacement. It has all the makings of a great
distributed system, I'm still waiting to see a major deployment yet..

~~~
subprotocol
May be of relevance:
[https://cwiki.apache.org/confluence/display/SPARK/Powered+By...](https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark)

I don't know what you would count as major deployment, but I've deployed a
30-node cluster on HW for running sub-second real-time adhoc queries. I've
also run many smaller 10-20 node virtual clusters on open stack. It is a rock
solid platform. Our hosted ops loves it because it just works.

The amazing thing about spark is how insanely expressive and hackable it is.
The best way I can describe it is this:

* Hadoop: You spend all of your time telling it how to do what you want (it is the assembly language of bigdata)

* Spark: you spend your time telling it what you want, and it just does it

~~~
agibsonccc
This does help actually. And yes: it doesn't have to be a 1000 node cluster or
anything crazy. I've just talked to a lot of people at bigger companies and
they've all said it falls over yet.

Great to hear success stories!

------
kovrik
Any active Clojure bindings?

clj-spark seems to be abandoned (last commit was a year ago)...

~~~
gphil
I'm curious about this too--clj-spark didn't work for me so I'm currently
prototyping a job using the Java bindings in Clojure. If I end up wrapping the
Java bindings in a useful way I would consider putting together some kind of
release if there's community interest in that.

~~~
terranstyler
Me too.

However, shouldn't there be a much more flexible approach where you just send
your functions to an execution server (just like an agent)? You might want to
define some keywords to refer to previously used functions or data.

Then again, such an agent would be pretty much a REPL, so you might just want
to ssh to a REPL that does load balancing and has sub-repls (on other
machines) that fail over.

Thoughts on that?

------
alexatkeplar
Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic
MapReduce pretty easily - check out our tutorial project for how:
[https://github.com/snowplow/spark-example-
project](https://github.com/snowplow/spark-example-project)

------
Nasiruddin
Great...new era of distributed computing

