
Ballista: Distributed Compute with Rust, Apache Arrow, and Kubernetes - andygrove
https://andygrove.io/2019/07/announcing-ballista/
======
s_Hogg
Hang in there mate :) I really don't think you deserve a lot of the crap
you've been given in this thread. Someone has to try something new.

~~~
eb0la
The fact people opposed to your idea / work means it is valuable enough for
people to say something against and not ignore it.

I must confess I miss _native_ execution of (big)Data jobs. I know moving jvm
bytecode between nodes to be run is portable, but nowadays nobody has mixed
architecture (intel/mips/sparc/arm) servers, so... why do we need a bytecode
execution layer?

Maybe I am too critic, but bare metal - hypervisor - jvm - app looks too much
layers to me.

~~~
s_Hogg
Totally, I think there are a lot of cases where all that machinery is actually
completely superfluous anyway because thinking about the problem you're trying
to solve could lead you to a way of doing it that doesn't need all that
compute. There's a blog post about this I read a while back where someone did
a simple graph search using spark and then did it in a slightly smarter way on
a commodity laptop and it outran spark on a large-ish amount of data. Wish I
could find it.

~~~
buckminster
This one?

[http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...](http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

~~~
s_Hogg
Yep, that's the one!

------
sandGorgon
How about Dask - which is fairly production grade and has experimental Arrow
integration.

[https://github.com/apache/arrow/blob/master/integration/dask...](https://github.com/apache/arrow/blob/master/integration/dask/Dockerfile)

Dask deploys pretty well on k8s -
[https://kubernetes.dask.org/en/latest/](https://kubernetes.dask.org/en/latest/)

~~~
wesm
Dask does not have "experimental Arrow integration". It supports using Arrow
to read Parquet files but no Arrow-based computational functionality.

~~~
arijun
Thanks for the clarification, Wes!

Semi-related question: How do you expect Arrow to be integrated to the larger
data science landscape?

Will it mostly be used as a go between format? Will new libraries using it
internally and old libraries just reading and translating it to a native
format? Do you think established libraries will change their back-end to
arrow? Is that even feasible with e.g. Pandas (or are you too far from their
governance now to say)?

~~~
wesm
Too big of a discussion for Hacker News! Come on dev@arrow.apache.org if you
want to talk about it

~~~
sandGorgon
thanks for correcting me. i was not aware of this nuance. would you be open to
posting quick thoughts here for the rest of us ?

~~~
wesm
No. If you want to talk about it come on the Apache Arrow mailing list

------
dswalter
I'm actually excited about the possibilities. I've watched DataFusion from
afar, and I have spent a decent amount of time wishing the Big Data ecosystem
had arrived during a time when something like Rust was a viable option, both
for memory and for parallel computing.

I use Presto all the time, I love how fully-featured it is, but garbage
collection is a non-trivial component of time-to-execute for my queries.

------
fspear
Are you looking for contributors? I don't have any rust, arrow or k8s
experience but been looking to learn all 3, I've also been looking to
contribute to os projects so I'm happy to pick up any low hanging fruits if
you are interested.

I do have a few years of experience with Spark and hadoop if that's worth
anything.

~~~
andygrove
Yes contributors are welcome. I will write up some guidance in the next few
days for those looking to contribute!

------
ohnoesjmr
I congratulate the effort, as I always thought that Spark is great, but the
fact it was written in Java hinders it quite badly (GC, tons of memory
required for the runtime, jar hell (want to use proto3 in your spark job? Good
luck)).

I do however worry that rust has a high bar of entry.

~~~
andygrove
I agree. Rust has a very steep learning curve compared to JVM languages. My
hope in building this platform is that it can provide value to other languages
(especially JVM) by taking query plans and executing them efficiently.

~~~
0815test
> compared to JVM languages

Apache Spark is written in Scala, and I wouldn't describe that as having an
'easy' learning curve, even compared to Rust! If you want something 'easy' on
the JVM, Kotlin (and perhaps Eta or Frege) might be more appropriate.

~~~
twic
Coming from Java, i found Rust easier to learn than Scala.

~~~
Recurecur
I expect you were exposed using a heavily functional programming approach.
That's a whole 'nother (larger) area to pick up along with the different
language.

If Scala is used as a "better Java" to start with, and then the developers
explore FP at their own pace, I think there's a better outcome. Granted Java
8+ has done some to close the gap with Scala as well.

I wonder if Scala-Native will absorb some Rustish features as other languages
have or are attempting to...

(What I said above echoes Odersky's ideas about Scala developer levels:
[https://www.scala-lang.org/old/node/8610.html](https://www.scala-
lang.org/old/node/8610.html))

------
senderista
If you’re looking for an approachable distributed query planner,
[https://github.com/uwescience/raco](https://github.com/uwescience/raco) might
be a good place to start.

------
cozos
Most "big data" distributed compute frameworks that come to mind are written
in a JVM language, so the focus on Rust is interesting.

So then, would Rust be better than a JVM language for a distributed compute
framework like Apache Spark?

Based on what others said in this thread, these are the primary arguments for
Rust:

1\. JVM GC overhead

2\. JVM GC pauses

3\. JVM memory overhead.

4\. Native code (i.e. Rust) has better raw performance than a JVM language

My take on it:

(1) I believe Spark basically wrote its own memory management layer with
Unsafe that let's it bypass the GC [0], so for Dataframe/SQL we might be ok
here. Hopefully value types are coming to Java/Scala soon.

(2) Majority of Apache Spark use-cases are batch right? In this case who cares
about a little stop-the-world pause here and there, as long as we're
optimizing the GC for throughput. I recognize that streaming is also a thing,
so maybe a non-GC language like Rust is better suited for latency sensitive
streaming workloads. Perhaps the Shenandoah GC would be of help here.

(3) What's the memory overhead of a JVM process, 100-200 MB? That doesn't seem
too bad to me when clusters these days have terabytes of memory.

(4) I wonder how much of an impact performance improvements from Rust will
have over Spark's optimized code generation [1], which basically converts your
code into array loops that utilize cache locality, loop unrolling, and simd. I
imagine that most of the gains to be had from a Rust rewrite would come from
these "bare metal' techniques, so it might the case that Spark already has
that going for it...

Having said that, I can't think of any reasons why a compute engine on Rust is
a bad idea. Developer productivity and ecosystem perhaps?

[0] [https://databricks.com/blog/2015/04/28/project-tungsten-
brin...](https://databricks.com/blog/2015/04/28/project-tungsten-bringing-
spark-closer-to-bare-metal.html)

[1] [https://databricks.com/blog/2016/05/23/apache-spark-as-a-
com...](https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-
joining-a-billion-rows-per-second-on-a-laptop.html)

~~~
andygrove
Some good points. Some incredible engineering has gone into Spark to work
around the fact that it runs on the JVM.

Memory overhead of Spark particularly (not just JVM) is very high. In some
cases close to 100x more memory than equivalent query execution with
DataFusion [0]. Also you might be interested to see my original blog post with
some of my thoughts on this [1].

[0]
[https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/](https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/)
[1] [https://andygrove.io/2018/01/rust-is-for-big-
data/](https://andygrove.io/2018/01/rust-is-for-big-data/)

~~~
cozos
Insightful blog posts! IMO a better memory comparison would be between a Spark
executor and a DataFusion ... container I guess (i.e. graphing query time vs
spark.executor.memory). This would give you a better idea of memory TCO on a
cluster.

------
kyllo
This is really cool! What do you see as ideally the primary API for something
like this?

SQL is great for relational algebra expressions to transform tables but its
limited support for variables and control flow constructs make it less than
ideal for complex, multi-step data analysis scripts. And when it comes to
running statistical tests, regressions, training ML models, it's wholly
inappropriate.

Rust is a very expressive systems programming language, but it's unclear at
this point how good of a fit it can be for data analysis and statistical
programming tasks. It doesn't have much in the way of data science libraries
yet.

Would you potentially add e.g. a Python interpreter on top of such a
framework, or would you focus on building out a more fully-featured Rust API
for data analysis and even go so far as to suggest that data scientists start
to learn and use Rust? (There is some precedence for this with Scala and
Spark)

~~~
andygrove
These are great questions and topics I plan on addressing in a future blog
post. SQL is a great convenience for simple analytical queries and it can be
nice to be able to mix and match SQL and other access patterns (this is one
thing I like with the Apache Spark DataFrame approach).

It's true that the Rust ecosystem for data science is not really there yet and
I am trying to inspire people to start changing that. I think Rust does have
some good potential here.

In the nearer term though I am exploring options around support user code in
distributed query execution and I don't want to limit it to Rust.

~~~
kyllo
I like it--I'm a data scientist by day, and I've been following Rust with
interest but I haven't found a good use case for it in my job yet. A Rust-
based Spark competitor sounds like it could be exactly that excuse to use Rust
at work that I've been looking for!

~~~
andygrove
This is exactly the situation I am in. We have workloads that I think we could
deliver with much lower TCO using Rust/DataFusion/Ballista.

We could probably even use a single node for some of our workflows just using
DataFusion directly with same performance as a distributed Spark job.

------
wiradikusuma
So Spark is bloated bcoz of JVM. Does Graal make the point moot?

~~~
voodootrucker
Not really, no. Even though Graal compiles to native code, these JVM based
languages are all based on two ideas:

1\. there will be a GC to manage memory 2\. memory will be managed

The GC slows things down, so spark tries to work around this by accessing "off
heap" memory (which really just means off the JVM's heap and on the OS'). So
you end up getting OOM errors if you give to little to the JVM or if you give
too little to the OS. It's a hacky balancing act to get native access to
memory, which comes for free with rust.

------
polskibus
How does this compare to dremio, that also uses Apache Arrow? Is this a
competitor?

~~~
atombender
What languages does Dremio support?

~~~
andygrove
Dremio is JVM based but with compilation down to LLVM and they contributed
Gandiva to the Apache Arrow project. Dremio seems pretty impressive but I
haven't personally used it.

[https://www.dremio.com/announcing-gandiva-initiative-for-
apa...](https://www.dremio.com/announcing-gandiva-initiative-for-apache-
arrow/)

~~~
atombender
Interesting, thanks. I know the entire industry is built around Hadoop and the
JVM right now (with some Python), but I'm hoping the pendulum will swing in a
different direction soon.

~~~
andygrove
Me too. I have a JVM background for > 20 years and have been using Apache
Spark for several years and while I have great respect for the engineering
that has gone into Spark, I feel that they just started out with the wrong
language. GC-based languages are not ideal for large scale distributed data
processing, IMHO.

~~~
pjmlp
Only when the said languages don't support value types, it was an error from
Java, not to offer what Wirth inspired languages already had in the 90's
(Oberon variants, Modula-3, Eiffel).

The first EA release for value types support is out now.

I firmly believe in the end tracing GC with value types and local ownership,
like D, Swift and C# are pursuing will win, at least in the realm of userspace
programming.

~~~
voodootrucker
Do you know the status of value types for Java? I see they were slated to be
in v10 of the JVM, and I can find discussion [0] up to a year ago about them
making it into the language, but that's ass far as my googling has taken me.

If Java can have struct access into native memory (without the presently
complicated API [1]) that would open a lot of doors and get it to near feature
parity with the CLR.

[1]
[https://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffe...](https://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html)

[0]
[https://news.ycombinator.com/item?id=14583530](https://news.ycombinator.com/item?id=14583530)

~~~
pjmlp
Yes, have a look at

[https://mail.openjdk.java.net/pipermail/valhalla-
dev/2019-Ju...](https://mail.openjdk.java.net/pipermail/valhalla-
dev/2019-July/006094.html)

And follow the links from there, including the Wiki page describing the EA
release, unfortunately my submission did not gather too much uptake.

It is taking all this this time because the team wants pre-value types world
jars to keep running unmodified on the new world, not an easy engineering
task.

------
eb0la
This project needs a "how to help" section _urgently_

~~~
andygrove
I will write up some guidance in the next few days for those looking to
contribute!

------
snicker7
Would this system support custom aggregates? How would I, for example, create
a routine that defines a covariance matrix and have Ballista deal with the
necessary map-reduce logic?

------
StreamBright
Could this be used without Kubernetes?

~~~
andygrove
With further development, yes. It's just Docker containers. However, the
current CLI is specific to Kubernetes.

If you weren't using Kubernetes, what orchestrator would you be interested in?

~~~
StreamBright
Thanks Andy! To be clear, I really appreciate your effort to create a better
platform for big data. I have spent 10 years on trying to make Hadoop & Spark
financially and technically scalable for companies with more than 1PB data and
I totally agree with you that we need a better system. I am just not ready to
trade Hadoop or Spark problems to Kubernetes problems. The question is what
orchestration do we need? What model do you want to implement? Could we build
a better Kubernetes?

~~~
andygrove
I've only been using Kubernetes for a couple months so far and am still
learning, but I am very impressed so far. I love the way it facilitates dev
and devops collaborating and the fact that it is cloud agnostic (I can even
run a Kubernetes cluster on my desktop for local testing). My opinion so far
is that the distributed cluster part is really solved by Kubernetes.

~~~
StreamBright
Yeah it is absolutely amazing for exactly that and yes it has a good approach
to the distributed cluster problem. Once you put it in production there a very
different picture.

[https://github.com/hjacobs/kubernetes-failure-
stories](https://github.com/hjacobs/kubernetes-failure-stories)

I haven't had a single Kubernetes user who had it in production and did not
have stability or performance issues with it. This is why I mentioned that I
am not ready to trade my Hadoop outages and performance issues for Kubernetes
ones.

------
blittable
Super cool. Perhaps naive, but how does distributing computation with
serialization square with Arrow's in-memory design?

~~~
andygrove
Each executor within the cluster would use Arrow in-memory design. If you have
enough cores and memory on a single node then potentially you wouldn't need a
cluster.

------
m0zg
Serious, non-facetious question: who is this for?

~~~
andygrove
This is for people who want to live in a future where we use efficient and
safe system level languages for massively scalable distributed data
processing. IMHO, Python and Java are not ideal language choices for these
purposes.

Rust offers much lower TCO compared to current industry best practices in this
area.

~~~
geezerjay
But claiming that "X software is written in Y language/framework" says nothing
about efficiency or safety. It's just meaningless marketting piggy-backing on
popular buzzwords.

And claims about "the future" are simply absurd. Oddly enough, this link
appears right next to another story on how Cobol powers the world's economy.

Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.

~~~
andygrove
Check out my previous benchmarks on Rust-based DataFusion vs JVM-based Apache
Spark workloads. It isn't just about raw performance numbers but also about
resource requirements (especially RAM requirements when using JVM or similar
GC-based languages). There are order-of-magnitude differences.

~~~
geezerjay
If you have tangible results then present your benchmarks. If you limit your
marketing to empty claims regarding "the future" and vague assertions on
performance then you're actually actively working to lower your credibility.

~~~
andygrove
This is a personal open source project. I'm not sure I'm "marketing" it since
I make zero dollars from this work.

For benchmarks, check out the past 18 months of posts on my blog. Here is the
most recent:

[https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/](https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/)

