
Spark 2.0 Technical Preview - rxin
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
======
graffitici
The new Structured Streaming API looks pretty interesting. I have the
impression that many Apache projects are trying to address the problems that
arise with the lambda architecture. When implementing such a system, you have
to worry about dealing with two separate systems, one for low-latency stream
processing, and the other is the batch-style processing of large amounts of
data.

Samza and Storm mostly focus on streaming, while Spark and MapReduce
traditionally deal with batch. Spark leverages its core competency of dealing
with batch data, and treats streams like mini-batches, effectively treating
everything as batch.

And I imagine in the following snippet, the author is referring to Apache
Flink, among other projects:

> One school of thought is to treat everything like a stream; that is, adopt a
> single programming model integrating both batch and streaming data.

My understanding of Structured Streaming also treats everything like batch,
but can recognize that the code is being applied to a stream, and do some
optimizations for low-latency processing. Is this what's going on?

~~~
rxin
tl;dr is yes (to your last question).

The longer answer is that this is about how to logically think about the
semantics of computation using a declarative API, and the actual physical
execution (e.g. incrementalization, record at a time processing, batching) is
then handled by the optimizer.

------
minimaxir
Given that all the big data talk lately has been about GPU
computing/TensorFlow, I'm glad to see that this Spark update shows in-memory
computing is still viable. (Much cheaper to play with too!)

The key feature for me is Machine Learning functions in R, which otherwise
lacks parallelizeable and scalable options. (Without resorting to black magic,
anyways)

~~~
luckydata
Honest question: what does "in memory" mean exactly? Do traditional databases
not process data using memory? Doesn't Spark spill to disk when it reached the
limit of the system memory? The in-memory thing always puzzled me.

~~~
btown
The difference is that every variable is sharded by default. So if you have
100 machines with 8gb memory each, you can keep an 800gb array in memory. Some
traditional databases can do this for one operation at a time, like querying a
sharded index in Mongo or doing a range query in Cassandra. But usually you
want to do a whole pipeline of these operations. In that case, your
computation engine should own the memory for efficiency, and seamlessly
redistribute based on access patterns. Thus, Spark and friends.

------
harigov
I like the direction Spark is heading. I am happy to see that they look at
Spark in the same way compiler developers look at programming languages. There
are huge optimizations to be made in this space. It's insane how inefficient
our current systems are when it is related to big-data processing.

~~~
riyadparvez
I think, as experiments published in last year based on Spark, the
inefficiency stems from the assumption that CPU cycles are abundant w.r.t. RAM
and bandwidth. So, nobody focuses on optimizing program itself as much as to
reduce memory or bandwidth usage.

~~~
acjohnson55
Seeing the talk today, I think the Spark team is very aware that CPU cycles
haven't become more abundant, while I/O has. There was an entire slide to this
effect. I don't know enough about the space to say more than that, but it
seems it's definitely on their mind. One of the things I remember being
mentioned is operating on data in a much more efficient binary representation
than native Java memory model, allowing better use of cache and such.

------
buryat
For me, the biggest improvement is the unified typed Dataset API [1]. The
current Dataset API gave us a lot of flexibility and type-safety and the new
API lets us use it as DataFrame API instead of converting to RDD and
reinventing the wheel, like aggregators [2].

[1] [https://databricks-prod-
cloudfront.cloud.databricks.com/publ...](https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/431554386690871/4814681571895601/latest.html)

[2]
[https://docs.cloud.databricks.com/docs/spark/1.6/examples/Da...](https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html)

------
doug1001
The micro-benchmarks are impressive--eg, to join 1 billion records: Spark 1.6
~ 61 sec; Spark 2.0 ~ 0.8 sec

i assume results such as this are due various optimizations under the Tungsten
rubric (code generation, manual memory management) which rely on the
sun.misc.Unsafe api.

~~~
rxin
Some of the performance gain was coming from the use of Unsafe in earlier
versions of Spark (e.g. Spark 1.5). However, the massive gain you are seeing
in Spark 2.0 are not coming from Unsafe. It is coming from this idea we call
"whole-stage code generation", which eliminates virtual function calls and
puts intermediate data in CPU registers as much as possible (versus L1/L2/L3
cache or memory).

We will be writing a deep dive blog post about this in the next week or two to
talk more about this idea.

~~~
rxin
Yup similar to that. Our next blog post in the pipeline is going to reference
this paper.

~~~
cachemiss
(Guessing you meant to respond to me)

Excellent, I'm glad that the "big data" world is starting to look at database
literature in terms of how it does execution, as there is much to be learned.

Most of these systems are extremely inefficient (looking at you Hadoop), when
they don't really have to be. Efficient code generation should be table stakes
for any serious processing framework IMO.

~~~
rxin
Yup was replying to you. Clicked the wrong button :)

------
oonny
Is there a video of a real live example of how spark helped to solve a
specific problem? I've tried quite a few times to get my head wrapped around
what Spark helps you solve.

~~~
iskander
In theory, Spark lets you seamlessly write parallel computations without
sacrificing expressivity. You perform collections-oriented operations (e.g.
flatMap, groupBy) and the computation gets magically distributed across a
cluster (alongside all necessary data movement and failure recovery).

In practice, Spark seems to perform reasonably well on smaller in-memory
datasets and on some larger benchmarks under the control of Databricks. My
experience has been pretty rough for legitimately large datasets (can't fit in
RAM across a cluster) -- mysterious failures abound (often related to
serialization, fat in-memory representations, and the JVM heap).

The project has been slowly moving toward an improved architecture for working
with larger datasets (see Tungsten and DataFrames), so hopefully this new
release will actually deliver on the promise of Spark's simple API.

~~~
oonny
Thanks for the reply but I was looking for a usecase. e.g. with spark i was
able to do X. I don't even know where Spark would be applied to.

~~~
cldellow
We use it for two things:

* distributed machine learning tasks using their built-in algorithms (although note that some of them, e.g. LDA, just fall over with not-even-that-big datasets)

* as a general fabric for doing parallel processing, like crunching terabytes of JSON logs into Parquet files, doing random transformations of the Common Crawl

As a developer, it's really convenient to spin up ~200 cores on AWS spot
instances for ~$2/hr and get fast feedback as I iterate on an idea.

------
babo
Is compiling from source is the only way to test Spark 2.0 without using the
proprietary Databricks package?

~~~
BrandonBradley
There are nightly snapshots. Give those a try!

