
Announcing Apache Spark 1.4 - rxin
http://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html
======
eranation
Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at
Berkeley's AMPLab) in collaboration with DataBricks (Commercial company
started by Spark creators) just started a free MOOC on edx:
[https://www.edx.org/course/introduction-big-data-apache-
spar...](https://www.edx.org/course/introduction-big-data-apache-spark-uc-
berkeleyx-cs100-1x)

(If you wonder what is Spark, in a very unofficial nutshell - it is a
computation / big data / analytics / machine learning / graph processing
engine on top of Hadoop that usually performs much better and has arguably a
much easier API in Python, Scala, Java and now R)

It has more than 5000 students so far, and the Professor seems to answer every
single Piazza question (a popular student / teacher message board).

So far it looks really good (It started a week ago, so you can still catch up,
2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and
there is not too much to catch up)

I use Spark for work (Scala API) and still learned one or two new things.

It uses the PySpark API so no need to learn Scala. All homework labs are done
in a iPython notebook. Very high quality so far IMHO.

It is followed by a more advanced spark course (Scalable Machine Learning)
also by Berkeley & Databricks.

[https://www.edx.org/course/scalable-machine-learning-uc-
berk...](https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-
cs190-1x)

(not affiliated with edx, Berkeley or databricks, just thought it's a good
place for a PSA to those interested)

The Spark originating academic paper by Matei Zaharia (Creator of Spark) got
him a PHd dissertation award in 2014 by the ACM ([http://www.acm.org/press-
room/news-releases/2015/dissertatio...](http://www.acm.org/press-room/news-
releases/2015/dissertation-award-14/))

Spark also set a new record in large scale sorting (Beating Hadoop by far):
[https://databricks.com/blog/2014/11/05/spark-officially-
sets...](https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-
record-in-large-scale-sorting.html)

* EDIT: typo in "Berkeley", thanks gboss for noticing :)

~~~
0xFFC
I would love to learn about spark,but as some one who li e in third world
country I hate edx,instead I am in love with udacity and coursera.the place I
am living ,we don't have much traffic monthly ,instead we can download
everything we want between 1am-6am,so there is no way to download course from
edx ,simply and using it later.I wish it was on udacitg or coursera,is there
any torrent for course material?

~~~
sidmitra
I'm doing the spark course. Edx has a download button on the videos, and can
download PDF files for the lectures. The rest like quizes that are embeded, i
just screenshot or save as pdf for posterity.

Are you sure you can't download, or maybe they've changed recently.

~~~
0xFFC
Yes I am aware of download button , but consider every course is ~50 distict
video and also consider our downloading time you are going to agree with me
about downloading is extermely painful ,why they just doesn't put whole
material (at least just videos) like the way udacidy does.

~~~
jm0
You can download the lectures using the edx-downloader:
[https://github.com/shk3/edx-downloader](https://github.com/shk3/edx-
downloader)

------
fleeno
As someone who doesn't know what Apache Spark is, this article reads like it
could have been randomly generated.

~~~
sixdimensional
Apache Spark is a general purpose distributed data processing and caching
engine. It is an evolution of MapReduce concepts into more general "directed
acylic graph" processing, which is very flexible for defining and executing
data processing work on a distributed cluster. It's got some similarities to
PrestoDb, Apache Drill and or Apache Storm (although not quite the same).

It also has some nice data mining libraries, a library for handling streaming
data, some connectivity to external data sources and a library for accessing
data stored in its generic "data frames" via SQL. "Data frames" are just an
abstraction for a dataset, but they are distributed, and in-memory and/or
persistent.

Personally, I like to think of as an engine for data analysis/processing and
queries, but different in that it is not really a "database" like you would
traditionally consider. It's almost like if you took the SQL data processing
engine out of your database and made it really flexible.

Edit: Also, all the functionality of Apache Spark is programmatically
accessible in Java, Scala and Python, or through SQL with their Hive/thrift
interface.

------
chiachun
The release notes: [https://spark.apache.org/releases/spark-
release-1-4-0.html](https://spark.apache.org/releases/spark-
release-1-4-0.html)

Another major change is that it supports Python 3 now.
[https://issues.apache.org/jira/browse/SPARK-4897](https://issues.apache.org/jira/browse/SPARK-4897)

~~~
choppaface
They've integrated Tungsten / native sorting into shuffle and observed some
decent speedups:

* [https://issues.apache.org/jira/browse/SPARK-7081](https://issues.apache.org/jira/browse/SPARK-7081)

* [https://github.com/apache/spark/pull/5868#issuecomment-10183...](https://github.com/apache/spark/pull/5868#issuecomment-101837095)

However, I guess reduceByKey (and friends) don't benefit yet.

Their SGD implementation still uses TreeAggregate (
[https://github.com/apache/spark/blob/e3e9c70384028cc0c322cce...](https://github.com/apache/spark/blob/e3e9c70384028cc0c322ccea14f19d3b6d6b39eb/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L189)
) so I wonder when they're planning to add some of the "Parameter Server"
stuff (e.g. perhaps butterfly mixing or Kylix
[http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf](http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf)
)

------
minimaxir
I'm excited about SparkR, even though R is shunned in the field of big data.
Between that and dplyr (which inspired the SparkR syntax) for data
manipulation and sanitation, it should be much easier to write sane,
reproducible code and visualizations for big data analysis. (the Python/Scala
tutorials for Spark gave me a headache)

SparkR appears to have strong integration into Rstudio, which is big news:
[http://blog.rstudio.org/2015/05/28/sparkr-preview-by-
vincent...](http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent-
warmerdam/)

~~~
eranation
R on Spark is great, but the biggest issue in my view is R's runtime
licensing, isn't it GPL? Am I worried for nothing?

~~~
zmmmmm
I too have been mystified by R's licensing. I actually don't see how anyone
can ship a commercial product using R in its current form. At very least
you're in a legal gray area, at worst you are involuntarily open sourcing your
product. Not that there's anything wrong with open sourcing a product, but I
think there's an enormous potential issue that could foul up a lot of people
down the track. The best discussion I have seen about this pretty much ends up
with uninformed speculation. For now, I take the policy of "explore and
prototype in R, build the real system in something else". Fortunately the
flaws and limitations of R as a language make this a sensible choice for a
host of other reasons as well.

------
DannoHung
Does anyone know if there's a guide to integrating Spark between a realtime
write only database and a historical database?

I've looked into using Spark Streaming, but I can't work out how you could
seamlessly transition data from a streaming batch to the historical db in a
reasonably tight time period.

I'd be willing to pay for training if it came to it, but I don't think I'm
using the right search terms.

~~~
nl
Do this: [http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-
ti...](http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-
sessionization-with-spark-streaming-and-apache-hadoop/)

Then at the end instead of writing to HBase you can write JDBC to do the
insert.

~~~
threeseed
JDBC works great if you have a large RDD that you want to persist in one go.

But if you are writing row by row you will need to implement your own batching
algorithm and connection pooling to get any decent performance.

~~~
gglanzani
You could always use mapPartition to open one connection per partition.

------
krat0sprakhar
I, somehow, always keep getting confused between Spark and Storm! Can someone
explain the difference between the two (usecases etc.) as if explaining to a
five year-old? Thanks!

~~~
nl
Storm = Streaming data processing, written in Clojure and previously used at
Twitter until they replaced it with Heron.

Spark = Streaming (technically micro-batch) and batch data processing, written
in Scala and used _very_ widely.

------
lazzlazzlazz
Is support for User Defined Aggregation Functions (regarding DataFrames)
slated for 1.5?

------
Tepix
Too bad the website is so hard to read.

Time for that site to join contrastrebellion.com

