
Announcing Spark 1.6 - rxin
https://databricks.com/blog/2016/01/04/announcing-spark-1-6.html
======
mb22
We've been testing 1.6 since before release, specifically SparkSQL and there
are some big performance improvements in this release. We're putting together
a 3rd party benchmark I'll post to HN when we are done.

------
minimaxir
It sounds like many of the improvements are not available on PySpark yet,
which is disappointing. (Notes say feature parity for MLIB, so I'll look into
that) However, the notes sound promising.

~~~
mziel
To be honest PySpark and SparkR are always going to be 2nd category citizens
(because of the serialization/pickling between the two environments).
Databricks shows nice graphs, saying they are equivalent for DataFrames,
however those count only for built-in functions that basically translate code
into execution plan for Catalyst. For anything bespoke (UDFs, custom
Transformers/Estimators) you're better off using Scala.

~~~
rxin
This is true when you compare the performance vs Java/Scala, but if you
compare it with other tools that are native in Python, it is not really much
worse. For examples, Pandas operations that use custom UDFs are substantially
slower than the native operations.

That said, as part of Project Tungsten, we have some ideas about a batch
columnar format that can be shared by Python, R, Scala and Java, and that
should be able to eliminate most of the inefficiency in serialization across
process boundaries.

~~~
mziel
That sounds very interesting. Is there any ticket, where I can follow the
progress on the batch columnar format you mentioned?

Btw, I was critical about the issue above, but I do love Spark, using it on a
daily basis. :)

~~~
rxin
I just created a JIRA ticket tracking this:
[https://issues.apache.org/jira/browse/SPARK-12635](https://issues.apache.org/jira/browse/SPARK-12635)

Thanks for the reminder!

------
kod
So the detailed post on datasets at
[https://databricks.com/blog/2016/01/04/introducing-spark-
dat...](https://databricks.com/blog/2016/01/04/introducing-spark-
datasets.html)

uses groupBy

I'm pretty sure based on previous comments you've made that groupBy was one of
the things you'd rather eliminate from the RDD api, because of the performance
impact compared to reduceByKey (which is almost always what people should be
using instead).

Are you at all worried about confusion if groupBy now performs ok on datasets,
but not on rdds?

~~~
rxin
Despite our attempts at warning people, a lot of users still use groupByKey in
RDDs. Hopefully over time this won't be a problem as the engine should be able
to figure out more intelligently and do the proper rewrite (of course, we
won't be able to do it 100%).

~~~
gshayban
Many people blindly point to the docs to say "don't use groupBy, prefer reduce
because it's faster..." Are there better examples that illustrate the
fundamental differences between the two operations? Surely there is still a
need for both operations

~~~
IvanVergiliev
Reduce can perform reductions on locally on each machine before shuffling the
data. This decreases the memory as well as the network overhead. If you need
all the elements for a given key - e.g. to display them to a user or save them
to a DB, perhaps you should use groupBy. If you're going to perform some form
of a reduce after that though, it's likely sub-optimal.

------
_laf
This post: [https://databricks.com/blog/2015/07/30/diving-into-spark-
str...](https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-
execution-model.html)

In the section titled "Future Directions for Spark Streaming" there is a
paragraph about _Event time and out-of-order data_ and _Backpressure_. This
would blow my mind to be able to use; this is a real pain currently.

~~~
peterstjohn
Backpressure is in 1.5+ by setting spark.streaming.backpressure.enabled=true
([https://spark.apache.org/docs/latest/streaming-
programming-g...](https://spark.apache.org/docs/latest/streaming-programming-
guide.html#requirements)). Like you, I'm looking forward to the out-of-order
data support.

------
mziel
New statistics/machine learning algorithms are always welcome, but the big
plus for productionizing is ML pipelines persistence.

That said, during Spark Summit Databricks guys themselves were most excited
about Dataset API. Looking forward to giving it a try.

------
mark_l_watson
Great news, Spark is awesome. Only problem is that I now need to review my
Spark material for an eBook that I released a month ago and update the
examples to work on version 1.6, if required.

------
DannoHung
Any support for nearest neighbor joins? They are very important for aligning
events in time series data sets.

------
huula
Love Spark! great works, folks!

------
ranjeet_hacker
Excited about dataset api and ML pipeline.

