
2016 Spark Summit East Keynote - mydpy
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
======
eranation
Very excited to hear the plans for GraphFrames - finally GraphX getting some
attention!

[https://spark-summit.org/east-2016/events/graphframes-
graph-...](https://spark-summit.org/east-2016/events/graphframes-graph-
queries-in-spark-sql/)

~~~
mydpy
At Spark Summit East there are a lot of people evangelizing GraphX and trying
to convince people to think about their problems using graphs.

------
krcz
How advanced is the Structured Streaming functionality? Looking at the JIRA
[1] I cannot find even design prototype there, which is kind of strange if
they want to have it ready by end of April. But as there was a presentation on
the topic at the summit [2], I hope it's just developing it without discussion
on JIRA.

[1]
[https://issues.apache.org/jira/browse/SPARK-8360](https://issues.apache.org/jira/browse/SPARK-8360)
[2] [https://spark-summit.org/east-2016/events/keynote-day-3/](https://spark-
summit.org/east-2016/events/keynote-day-3/)

~~~
mydpy
Did you read the proposal?
[https://issues.apache.org/jira/secure/attachment/12775265/St...](https://issues.apache.org/jira/secure/attachment/12775265/StreamingDataFrameProposal.pdf)

------
mydpy
I think the more exciting announcement was Databricks community edition, which
allows you to use 2.0:

[https://news.ycombinator.com/item?id=11126179](https://news.ycombinator.com/item?id=11126179)

------
azth
Slide 10:

> CPU speeds have not kept up with I/O in the past 5 years.

I presume he means the other way around?

Also, what does he mean by native memory management? Does he mean off-heap
allocation?

And what's he referring to regarding code generation?

~~~
mydpy
He means the other way around: I/O improvements are outpacing CPU
improvements.

Native refers to (I think) the following issues:
[https://issues.apache.org/jira/browse/SPARK-12785](https://issues.apache.org/jira/browse/SPARK-12785)
[https://issues.apache.org/jira/browse/SPARK-8641](https://issues.apache.org/jira/browse/SPARK-8641)

Code generation is enabled by SPARK-8641, but not sure exactly what it
entails. I think it is related to some of the RDD transformation/action
merging they do to optimize runtime operations in 2.0.

Your thoughts?

~~~
haimez
It entails fusing "narrow" operations on data frames into a single method of
generated code to avoid virtual method invocations, help the JIT, and improve
hardware branch prediction among other things.

Basically you take the general, composable API functions and generate specific
equivalent code that avoids the overhead of dealing with abstract interfaces
like Iterators.

~~~
vvanders
Nifty. I've always been a big fan of code generation.

How does Spark guarantee data memory contiguity(or do they at all)? Do they
use misc.sun.unsafe or some form of memory management?

~~~
haimez
Yeah, since spark 1.6 (some optimizations in spark 1.5) the data being
operated on is managed "off heap" which means it doesn't add to garbage
collection times and is stored in a more contiguous and cache friendly layout.
It's definitely using Unsafe but can't speak to the exact implementation.

------
TheGuyWhoCodes
Has it become easier to run ad hoc queries with spark? I remember a year ago
that the only available solution was the job server by ooyala. Which seems to
be a missing feature of core Spark, and isn't something I was willing to bet
my product on.

Datastax evangelized people to use Spark to run queries over Cassandra but it
looks so awkward and time consuming to copy jars around to the master,
basically you need a dev ops team to this and even more scriptology for
production.

~~~
eip
>to run queries over Cassandra

Why not just use Presto? It gives you basically full SQL capability for
Cassandra with minimal effort.

~~~
TheGuyWhoCodes
Presto seems like a good product, haven't really test it too much tho. When we
started to look at Cassandra 1.5 years ago, Spark integration was all the
rage, and the premise it was the missing link for doing analytic on Cassandra.

We tested it thoroughly and came to the conclusion that is wasn't a mature
enough solution.

------
DannoHung
Are Spark streams ever going to reach a point where you can just have a table
sitting in memory aggregating data and then you run queries on the _whole_
thing without having to worry about windowing or anything?

------
mziel
Last Spark Summit the videos were up on Youtube 1-2h after each talk. Anybody
knows where to find the ones from this summit?

------
josep2
Started using Spark 1.6 a few months ago. Excited for the Kafka Connector
feature.

~~~
kod
What exactly was the mention of Kafka referring to? Spark has had decent kafka
integration for a while now.

~~~
peterstjohn
I think it's support for Kafka Connect, new in Kafka 0.9:
[http://kafka.apache.org/090/documentation.html#connect](http://kafka.apache.org/090/documentation.html#connect)

~~~
kod
It doesn't have anything to do with Kafka Connect.

Matei was actually talking about the existing Spark Kafka direct stream
implementation, which has been available since Spark 1.3

The video of the talk is available here:
[http://livestream.com/fourstream/sparksummiteast2016-tracka/...](http://livestream.com/fourstream/sparksummiteast2016-tracka/videos/112612459)

~~~
peterstjohn
Ah, thanks! I have unfortunately been too busy to watch the streams this week,
and assumed it was Connect because I think Confluent is/was doing a talk with
Kafka Connect and Spark at the summit.

