
Spark as a Compiler: Joining a Billion Rows per Second on a Laptop - rxin
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
======
rusanu
To put this into context I would recommend reading 'MonetDB/X100: Hyper-
Pipelining Query Execution' [0]. Vectorized execution has been sort of an open
secret in database industry for quite some time now.

For me, is particularly interesting reading the Spark achievements. I was part
of the similar Hive effort (the Stinger initiative [1]) and I contributed some
parts of the Hive vectorized execution [2]. I see the same solution that
applied to Hive now applies to Spark:

\- move to a columnar, highly compressed storage format (Parquet, for Hive it
was ORC)

\- implement a vectorized execution engine

\- code generation instead of plan interpretation. This is particularly
interesting for me because for Hive this was discussed then and actually _not_
adopted (ORC and vectorized execution had, justifiably, bigger priority).

Looking at the numbers presented in OP, it looks very nice. Aggregates,
Filters, Sort, Scan ('decoding') show big improvement (I would expected these,
is exactly what vectorized execution is best at). I like that Hash-Join also
shows significant improvement, is obvious their implementation is better than
the HIVE-4850 I did, of which I'm not too proud. The SM/SMB join is not
affected, no surprise there.

I would like to see a separation of how much of the improvement comes from
vectorization vs. how much from code generation. I get the feeling that the
way they did it these cannot be separated. I think there is no vectorized
plan/operators to compare against the code generation, they implemented both
simultaneously. I'm speculating, but I guess the new whole-stage code
generation it generates vectorized code, so there is no vectorized execution
w/o code generation.

All in all, congrats to the DataBricks team. This will have a big impact.

[0]
[http://oai.cwi.nl/oai/asset/16497/16497B.pdf](http://oai.cwi.nl/oai/asset/16497/16497B.pdf)
[1] [http://hortonworks.com/blog/100x-faster-
hive/](http://hortonworks.com/blog/100x-faster-hive/) [2]
[https://issues.apache.org/jira/browse/HIVE-4160](https://issues.apache.org/jira/browse/HIVE-4160)

~~~
sitkack
How much of this work gets the working set off of the JVM heap? They are
generating JVM bytecode? Compact heap layout is the next big win.

~~~
rusanu
OP points to SPARK-12795 [0] and is all open source. They generate Java source
code. You can read more at the prototype pull request:
[https://github.com/apache/spark/pull/10735](https://github.com/apache/spark/pull/10735)
(I couldn't find a spec doc). If I understand correctly they insert a
`WholeStageCodegen`[1] operator into the plan:

    
    
        /**
         * WholeStageCodegen compile a subtree of plans that support codegen together into single Java
         * function.
         *
         * Here is the call graph of to generate Java source (plan A support codegen, but plan B does not):
         *
         *   WholeStageCodegen       Plan A               FakeInput        Plan B
         * =========================================================================
         *
         * -> execute()
         *     |
         *  doExecute() --------->   inputRDDs() -------> inputRDDs() ------> execute()
         *     |
         *     +----------------->   produce()
         *                             |
         *                          doProduce()  -------> produce()
         *                                                   |
         *                                                doProduce()
         *                                                   |
         *                         doConsume() <--------- consume()
         *                             |
         *  doConsume()  <--------  consume()
         *
         * SparkPlan A should override doProduce() and doConsume().
         *
         * doCodeGen() will create a CodeGenContext, which will hold a list of variables for input,
         * used to generated code for BoundReference.
         */
    
    
    
    

[0]
[https://issues.apache.org/jira/browse/SPARK-12795](https://issues.apache.org/jira/browse/SPARK-12795)
[1]
[https://github.com/apache/spark/blob/0e70fd61b4bc92bd744fc44...](https://github.com/apache/spark/blob/0e70fd61b4bc92bd744fc44dd3cbe91443207c72/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala)

------
eggy
>> This style of processing, invented by columnar database systems such as
MonetDB and C-Store

True, since MonetDB came of out of the Netherlands around 1993, but KDB+ came
out in 1998, well over 8 years before C-Store.

K4 the language used in KDB is 230 times faster than Spark/shark and uses
0.2GB of RAM vs. 50GB of RAM for Spark/shark, yet no mention in the article.
It seems a strange omission for such a sensational sounding title [1]. I don't
understand why big data startups don't try and remake the success of KDB
instead of reinventing bits and pieces of the same tech with a result in
slower DB operations and more RAM usage.

[1] [http://kparc.com/q4/readme.txt](http://kparc.com/q4/readme.txt)

~~~
threeseed
Because Spark is far, far more than just about storing data and doing basic
queries. It is also an analytics e.g. machine learning/modelling platform and
a framework for building complex applications on top of Hadoop.

~~~
eggy
KDB+/Q/K is all about analytics too. It is used in more than just financial
time series data. Such as power utility usage and resourcing for one.

Hadoop is also about distributed computing, but if the process requires 250x
as much RAM, you're going to have a very high TCO regardless of how cheap RAM
or servers can be.

J is an opensource APL-derived language, but different in some important ways,
but it is not as fast as KDB+/Q/K. J is truly an array-based language, whereas
K is list based, so it has more in common with Lisp in that singular respect.

Kerf is a new language being worked on by one of the creators of the Kona
language, an opensource version of the K3 language [1].

It seems these type of articles ignore non-opensource solutions even if they
are more efficient in many ways including TCO. How many man-years need to be
spent to try and duplicate an existing solution, and still not be a better
solution?

[1] [https://github.com/kevinlawler/kerf](https://github.com/kevinlawler/kerf)

~~~
scottlocklin
I often feel like the one eyed man trying to describe color and shape to a
world of blind people when describing the power of these programming systems.
I realize everyone wants free stuff, but folks seem to find it difficult to
comprehend how much more powerful and performant some of the pay-for systems,
in particular the array(ish) systems are than stuff like Spark. In principle,
peer review should make all open source systems superior. In practice, the
people most capable of making something like Spark fast work for Oracle,
Microsoft or Kx on database and machine learning engines people pay money for.
While much software development is just hooking components together, there is
still such a thing as software engineering, and those skills are extremely
rare. Even looking at humble stuff like file systems... I thought EXT4 was
pretty good until reading of the wonders of ZFS. Thanks for mentioning Kerf.
I'll also mention Nial, which has had some exciting developments lately.

------
mastratton3
I've been personally very impressed with Spark's RDD api for easily
parallelizing tasks that are "embarrassingly parallel". However, I have found
the data frames API to not always work as advertised and thus I'm very
skeptical of the benchmarks.

I think a prime example of this is I was using some very basic windowing
functions and due to the data shuffling (The data wasn't naturally
partitioned) it seemed to be very buggy and not very clear why stuff was
failing. I ended up rewriting the same section of code using hive and it had
both better performance and didn't seem to have any odd failures. I realize
this stuff will improve but I'm still skeptical.

~~~
tma-1
I have been extensively using the dateframe/sql API and I just love it. Most
of the issues I have had stemmed from the cluster / Spark configuration and
not the API itself. Using SQL is so much more intuitive them using multiple
joins, selects, filter etc on an rdd.

~~~
mastratton3
So I did find it useful for doing additional exploratory aggregations once the
data was already cleaned and denormalized. My comment was more directed at the
upfront initial data processing (In our case, extracting time series data out
of a large amount of files).

I did hit issues w/ multiple joins and shuffling though. Have you not hit
issues w/ shuffling?

I was using Spark 1.5.1 for the record.

~~~
tma-1
Have you tried tuning Spark's memory parameters?

------
dswalter
It's interesting to see that as further work is done on spark (and I'm pleased
they're actually improving the system), it behaves more and more like a
database.

------
foota
Reading about databases always makes me sad about the enterprise database I
work with.

~~~
PeCaN
You can read about the awesome theory behind relational algebra. Gave me a
great appreciation for how awesome SQL databases actually are (and why they
work the way they do).

~~~
50CNT
Any good books on that?

~~~
ddispaltro
This java project gives you a glimpse at how it all works. It was extracted
out of a now defunkt db vendor (I think?).
[https://calcite.apache.org/](https://calcite.apache.org/)

~~~
Jach
Yeah, it came out of the Eigenbase project.
([http://luciddb.sourceforge.net/](http://luciddb.sourceforge.net/) and
[http://www.eigenbase.org](http://www.eigenbase.org)). DB internals are indeed
quite fun. (I worked on
LucidDB([https://github.com/LucidDB/luciddb](https://github.com/LucidDB/luciddb))
for a time.)

------
jkot
Typesafe plans to add support for vectorization into Scala compiler for Akka.
Also JVM 8 JIT compiler does it to some extend already.

~~~
ddispaltro
Where'd you read this, I'd love to take a look.

~~~
hokkos
[https://github.com/dotty-linker/dotty](https://github.com/dotty-linker/dotty)

------
faizshah
I'll have to benchmark Spark 2.0 against Flink, it seems like it could be
faster than Flink now. It does depend on if the dataset they used was living
in memory before they ran the benchmark, but it's still some pretty impressive
numbers and some of the optimizations they made with Tungsten sound similar to
what Flink was doing.

~~~
capkutay
Really depends on the use case. If you're trying to do streaming, spark will
introduce some latency because it's based on micro-batching. Flink and Storm
are better for low-latency scenarios.

~~~
faizshah
I agree that Flink is better for low latency, but I was also under the
impression Flink's DataSet API was faster for batch processing than Spark. Now
that they're purporting Spark 2.0 may be up to 10x faster than 1.6 I'll have
to take another look.

