
FastSpark: A New Fast Native Implementation of Spark from Scratch - mkj
https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-implementation-of-spark-from-scratch-368373a29a5c
======
sandGorgon
A lot of people using PySpark are moving to Dask for significantly faster
performance. Dask is also built for kubernetes - which is a huge deployment
win.

Spark is still in-between yarn and kubernetes.

~~~
wenc
Spark is still potentially faster for SQL-like workloads due to the existence
of a query optimizer. Dask works at a different level of abstraction and does
not have a query optimizer.

~~~
sandGorgon
Are you talking about the spark SQL catalyst optimiser ?

That's apples to oranges - because dask does not expose a SQL syntax that
needs a query optimiser.

Also pyspark has the additional issue of serialisation between python and jvm.
Turns out that just getting rid of that is a huge performance boost.

~~~
wenc
It’s not apples to oranges with respect to my point though.

Most operations on dataframe-like objects can be described in SQL operations.
Spark supports these operations and Catalyst can optimize query plans for
these.

You are correct in that Dask does not optimize for this because Dask
operations are more primitive hence it does not have the correct level of
abstraction to do query optimization, only task graph optimization. Which
reinforces my point that if you have a SQL-like workload on Dask.dataframes,
chances are Dask may not outperform Spark.

~~~
sandGorgon
IMHO i dont agree with you. Spark's SQL is a consequence of the need to work
across languages - Scala and Python. So SQL gives the abstraction necessary to
be usable in both places. It is also the language in which data scientists
communicate with Spark production engineers.

Every Spark production engineer i know, translates the SQL written by data
scientists back into high performance RDD code.

That's the advantage of Dask - there is no SQL abstraction needed. Pandas
Dataframes are already the lingua franca of data scientists.. in fact, orders
of magnitude more than SQL ever will be.

TLDR - Dask doesnt need SQL because the people who will push Dask to
production already are far more comfortable in Dataframes than they ever will
be in SQL.

You may still argue that spark RDD is faster than dask (and you may indeed by
right)...but not having a SQL engine is not a problem for Dask.

~~~
wenc
> Spark's SQL is a consequence of the need to work across languages - Scala
> and Python

Not really. The Spark API has equivalent calls in both Scala and Python, with
Scala being the superset. Spark's SQL is a high-level abstraction that
internally mapped to these operations.

> Every Spark production engineer i know, translates the SQL written by data
> scientists back into high performance RDD code.

This would be very unusual and rarely advisable with Spark > 2.0. Spark
Dataframes are generally more memory-efficient, type-safe and performant than
RDDs in most situations, so most data engineers work directly in Spark
Dataframes -- dropping to RDDs only in specific situations requiring more
control.

If you know data engineers who are somehow translating SQL into RDDs (except
in rare circumstances) you might want to advise them to move to Spark > 2.0
and change their paradigms. They might be working with older Spark paradigms
[1] and might have missed the shift that happened around 2.0 and missing out
on all the work that has been done since.

> That's the advantage of Dask - there is no SQL abstraction needed.

SQL is only a language to access the dataframe abstraction (Spark Dataframes,
Pandas dataframes, etc.) -- the fact it is higher-level means it is amenable
to certain types of optimization.

If you take the full set of dataframe operations and restrict it to the set
that SQL supports (group by's, where's, pivots, joins, window functions, etc.)
you can apply query optimization.

Dask does not, and hence allows more powerful lower-level manipulations on
data, but it therefore also cannot perform SQL-level query optimization, only
task-graph level optimizations.

This Spark vs Dask comparison on the Dask website provides more details [2].

> Pandas Dataframes are already the lingua franca of data scientists.. in
> fact, orders of magnitude more than SQL ever will be.

I wonder if this is where our misunderstanding lies -- I sense that you might
be thinking of SQL strictly as the syntax, whereas I use SQL as a shorthand
for a set of mathematical operations on tabular structures -- which is
equivalent in the subset.

[1] [https://databricks.com/blog/2016/07/14/a-tale-of-three-
apach...](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-
apis-rdds-dataframes-and-datasets.html)

[2]
[https://docs.dask.org/en/latest/spark.html#](https://docs.dask.org/en/latest/spark.html#)

------
raja_sekar
Author of the article here. Sorry about Medium restriction. I will just use
Github to host my content hereafter. For now, you can use this friend link.
[https://medium.com/@rajasekar3eg/fastspark-a-new-fast-
native...](https://medium.com/@rajasekar3eg/fastspark-a-new-fast-native-
implementation-of-spark-from-
scratch-368373a29a5c?source=friends_link&sk=4c498ff1410d550048b9871191720997)

------
choppaface
The examples are all in Rust, so it’s very hard to make a non-toy demo.
Usually if one uses the RDD API, one has some sort of library code that’s
already in Java or Python and it’s impractical to port that code to make the
job run. Or more likely, somebody will write an initial version using that
code and then port / optimize the job later.

Native dependencies usually mean you’ll need docker. Spark pre-dates docker,
and just relatively recently added the Kubernetes runner, which makes
dockerized jobs easy. But historically it hasn’t been easy to run a job in a
containerized environment with the native dependencies you need. You can ship
native deps with your job, but that’s not easy, especially if you need a
rebuild with each job.

The main advantage of Spark is flexibility and interoperability. You save time
by not having to write something optimized on day 1 (for something you might
throw away). And you get SQL support, something Beam / Hadoop don’t have
(certainly not for Python). There are lots of benchmarks where Spark SQL is
not a winner, but the point is Spark will help you save development time.

~~~
raja_sekar
The author of the repo here. It is still very much in the POC stage. It will
definitely have python APIs in the future. One of the primary reasons to
choose a native language is to have better python integration. I intend to
have to APIs almost identical to Spark, so that it will be easy to migrate. It
is still very early to assure this, but it is the objective.

------
latenightcoding
Also relevant:
[https://github.com/thrill/thrill](https://github.com/thrill/thrill)

A Spark inspired framework written in modern C++.

------
Joeri
This sounds too good to be true. If it is this easy to be orders of magnitude
faster than spark on JVM, why haven't the spark developers ported spark to
native code already?

~~~
Barraketh
I know that Spark has had a lot of work put into it, but my personal
experience with it has been pretty negative. I've spent a lot of time at my
job trying to tune it to our workflows (extremely deep queries), with only
moderate success. I've just POC'd a custom SQL execution engine that was 200x
faster than spark for the same workflows. Now, our requirements are pretty
non-standard, but I find it pretty easy to believe these benchmarks.

~~~
madhadron
McSherry et al's paper "Scalability! But at what COST?" is worth reading. A
single threaded, single core implementation typically outperforms Spark.

The best rule of thumb I'm aware of is: unless you can't fit your computation
on a single machine or your jobs are likely to fail before completing from the
size and length involved, you are generally better off without Spark or
similar systems. And if sampling can get you back onto a single machine, then
you're _really_ better off.

~~~
raja_sekar
In my experience too I observed that distributed code introduces a lot of
redundancy and it requires a lot of data to beat the performance of a single-
threaded/single machine implementation. Check out McSherrys' Timely Dataflow,
it is truly an amazing piece of work.

------
mindcrime
That's mondo righteous. I think I may have finally found a reason to learn
Rust.

------
gok
I'm kind of surprised it took this long for someone to do this. It was clear
very early on that the JVM was a bad match for what Spark was trying to do.

~~~
mistrial9
this sounds far too simplistic, so .. reference?

~~~
bcbrown
I attended a talk at Strata a few years back by a Spark committer who was
talking about how Spark was stretching JVM memory allocations far past how the
JVM was originally designed. Do a couple searches for "spark JVM OOM" and
you'll see some discussions about similar things.

~~~
gok
Dead on

------
krcz
I'm wondering how much could be gained if one used all possible optimizations:
e.g. by analyzing the data flow graph - expressed using DSL - and generating
native node programs, using CPU thread pinning and user space network stack
(like ScyllaDB does [1]).

[1]
[https://www.scylladb.com/product/technology/](https://www.scylladb.com/product/technology/)

~~~
wmf
A lot. [https://www.weld.rs/](https://www.weld.rs/)

------
MaxBarraclough
For the confused: this is about Apache Spark, not the Ada-based SPARK
language. [0]

Perhaps I'm alone here but I'd prefer the title say _Apache Spark_ explicitly.

[0]
[https://en.wikipedia.org/wiki/SPARK_(programming_language)](https://en.wikipedia.org/wiki/SPARK_\(programming_language\))

~~~
orhmeh09
Spark (and Apache Spark) is a trademark of the Apache Foundation. If the title
were SPARK in all caps, I'd understand, but how often do you read articles
about SPARK where the name is written as "Spark?"

------
aabbcc1241
I like to see people re-implement things and share their (better) results.
Even if the results isn't better than the 'battle-tested' existing solutions,
at least we can learn something in the process.

~~~
wenc
Scylla [1] for instance is a C++ rewrite and a drop-in replacement of JVM-
based Cassandra, and from what I've read is fairly stable and performs much
faster.

[1] [https://www.scylladb.com/](https://www.scylladb.com/)

------
mariusae
See also bigslice ([https://bigslice.io](https://bigslice.io)) for another
take on this.

------
splix
> You’ve reached the end of your free member preview for this month

:( I guess I can read Medium posts only during first couple of days in a month

~~~
vojta_letal
Anonymous windows have found a second use-case.

------
wiradikusuma
I wonder if Spark, compiled with Graal, would produce much better performance
(compared to plain Spark), so no need to rewrite.

------
missosoup
"You’ve reached the end of your free member preview for this month"

Stop hosting your content on a platform that holds it hostage so that it can
make money off it without giving anything back to you.

~~~
ieatpies
Try incognito mode

~~~
jdminhbg
You can also use Reader Mode on Safari, which not only avoids the modals and
popups but gets rid of the top and bottom bars as well. Long-click on the
Reader Mode button and you can set it to always use it on medium.com.

~~~
koolba
> Long-click on the Reader Mode button and you can set it to always use it on
> medium.com.

Learning this just made my day!

------
truth_seeker
Nice, but I can't find any reason to choose Spark over modern Distributed SQL
databases (CockroachDB, CitusDB, TiDB etc. or cloud vendor-specific SQL DBs)

~~~
aloknnikhil
Spark is specifically useful for querying streaming data. How would a
distributed database help with that? You'd have to build your own stream
executor on top of that.

~~~
truth_seeker
Agreed. at the same time, Building stream execution pipeline is not rocket
science. I am not saying modern distributed SQL Databases are exact
replacement or clones of Spark. I am saying with little more help from the
application server they are much more capable than Spark.

You can use the following options individually or in combination.

Option 1 : Pipeline DB extension (PostgreSQL)

Option 2 : Service broker in commercial SQL databases or building PUSH/PULL
queue if not supported. There are many libraries in each programming language
which tries to do that. Also see option 4.

Option 3 : Using CDC or Replication for synchronous or asynchronous streamed
computation on single or multi node cluster

Option 4 : Transducers. For example, you can compose many sql functions or
procedures to act on a single chunk of data instead of always doing async
streamed computation after each stage of transformation.

