Hacker News new | past | comments | ask | show | jobs | submit login
Ballista: Distributed Compute with Rust, Apache Arrow, and Kubernetes (andygrove.io)
194 points by andygrove on July 17, 2019 | hide | past | favorite | 99 comments



Hang in there mate :) I really don't think you deserve a lot of the crap you've been given in this thread. Someone has to try something new.


The fact people opposed to your idea / work means it is valuable enough for people to say something against and not ignore it.

I must confess I miss native execution of (big)Data jobs. I know moving jvm bytecode between nodes to be run is portable, but nowadays nobody has mixed architecture (intel/mips/sparc/arm) servers, so... why do we need a bytecode execution layer?

Maybe I am too critic, but bare metal - hypervisor - jvm - app looks too much layers to me.


Totally, I think there are a lot of cases where all that machinery is actually completely superfluous anyway because thinking about the problem you're trying to solve could lead you to a way of doing it that doesn't need all that compute. There's a blog post about this I read a while back where someone did a simple graph search using spark and then did it in a slightly smarter way on a commodity laptop and it outran spark on a large-ish amount of data. Wish I could find it.



Yep, that's the one!


I remember 4-5 years ago I tried to slice a big CSV file with Spark and took soo loong it didn't finished after some hours.

Later on the day I realized it could be done from the command line with 'cut' and 'grep'. Took less than 15 minutes.


I wonder how long it would have taken with `cut` and `ripgrep`


How about Dask - which is fairly production grade and has experimental Arrow integration.

https://github.com/apache/arrow/blob/master/integration/dask...

Dask deploys pretty well on k8s - https://kubernetes.dask.org/en/latest/


Dask does not have "experimental Arrow integration". It supports using Arrow to read Parquet files but no Arrow-based computational functionality.


Thanks for the clarification, Wes!

Semi-related question: How do you expect Arrow to be integrated to the larger data science landscape?

Will it mostly be used as a go between format? Will new libraries using it internally and old libraries just reading and translating it to a native format? Do you think established libraries will change their back-end to arrow? Is that even feasible with e.g. Pandas (or are you too far from their governance now to say)?


Too big of a discussion for Hacker News! Come on dev@arrow.apache.org if you want to talk about it


thanks for correcting me. i was not aware of this nuance. would you be open to posting quick thoughts here for the rest of us ?


No. If you want to talk about it come on the Apache Arrow mailing list


Although I know and work with some of the contributors from this project, I have no real world experience with Dask or Python data science tools in general.

Thanks for the link. I will read about their Kubernetes support.


I'm actually excited about the possibilities. I've watched DataFusion from afar, and I have spent a decent amount of time wishing the Big Data ecosystem had arrived during a time when something like Rust was a viable option, both for memory and for parallel computing.

I use Presto all the time, I love how fully-featured it is, but garbage collection is a non-trivial component of time-to-execute for my queries.


Are you looking for contributors? I don't have any rust, arrow or k8s experience but been looking to learn all 3, I've also been looking to contribute to os projects so I'm happy to pick up any low hanging fruits if you are interested.

I do have a few years of experience with Spark and hadoop if that's worth anything.


Yes contributors are welcome. I will write up some guidance in the next few days for those looking to contribute!


I congratulate the effort, as I always thought that Spark is great, but the fact it was written in Java hinders it quite badly (GC, tons of memory required for the runtime, jar hell (want to use proto3 in your spark job? Good luck)).

I do however worry that rust has a high bar of entry.


I'm quite sure as soon as there are performant WASM engines deployable outside a browser (e.g. parity-wasm) then Ballista will converge on it and it would hence support anything that can compile to wasm.

In fact, I think the time to begin that type of project is now because by the time you do the ground work for such a project and integrate, etc. a wasm engine will be ready.


The first EA releases for value types are out, even if it takes a couple of releases more, it is much easier to improve existing stacks than restart from zero.


I agree. Rust has a very steep learning curve compared to JVM languages. My hope in building this platform is that it can provide value to other languages (especially JVM) by taking query plans and executing them efficiently.


> compared to JVM languages

Apache Spark is written in Scala, and I wouldn't describe that as having an 'easy' learning curve, even compared to Rust! If you want something 'easy' on the JVM, Kotlin (and perhaps Eta or Frege) might be more appropriate.


Coming from Java, i found Rust easier to learn than Scala.


I expect you were exposed using a heavily functional programming approach. That's a whole 'nother (larger) area to pick up along with the different language.

If Scala is used as a "better Java" to start with, and then the developers explore FP at their own pace, I think there's a better outcome. Granted Java 8+ has done some to close the gap with Scala as well.

I wonder if Scala-Native will absorb some Rustish features as other languages have or are attempting to...

(What I said above echoes Odersky's ideas about Scala developer levels: https://www.scala-lang.org/old/node/8610.html)


Scala is way easier to use, given it uses a tracing GC, has three good IDEs, and more mature ecosystem.


If you’re looking for an approachable distributed query planner, https://github.com/uwescience/raco might be a good place to start.


Most "big data" distributed compute frameworks that come to mind are written in a JVM language, so the focus on Rust is interesting.

So then, would Rust be better than a JVM language for a distributed compute framework like Apache Spark?

Based on what others said in this thread, these are the primary arguments for Rust:

1. JVM GC overhead

2. JVM GC pauses

3. JVM memory overhead.

4. Native code (i.e. Rust) has better raw performance than a JVM language

My take on it:

(1) I believe Spark basically wrote its own memory management layer with Unsafe that let's it bypass the GC [0], so for Dataframe/SQL we might be ok here. Hopefully value types are coming to Java/Scala soon.

(2) Majority of Apache Spark use-cases are batch right? In this case who cares about a little stop-the-world pause here and there, as long as we're optimizing the GC for throughput. I recognize that streaming is also a thing, so maybe a non-GC language like Rust is better suited for latency sensitive streaming workloads. Perhaps the Shenandoah GC would be of help here.

(3) What's the memory overhead of a JVM process, 100-200 MB? That doesn't seem too bad to me when clusters these days have terabytes of memory.

(4) I wonder how much of an impact performance improvements from Rust will have over Spark's optimized code generation [1], which basically converts your code into array loops that utilize cache locality, loop unrolling, and simd. I imagine that most of the gains to be had from a Rust rewrite would come from these "bare metal' techniques, so it might the case that Spark already has that going for it...

Having said that, I can't think of any reasons why a compute engine on Rust is a bad idea. Developer productivity and ecosystem perhaps?

[0] https://databricks.com/blog/2015/04/28/project-tungsten-brin...

[1] https://databricks.com/blog/2016/05/23/apache-spark-as-a-com...


Some good points. Some incredible engineering has gone into Spark to work around the fact that it runs on the JVM.

Memory overhead of Spark particularly (not just JVM) is very high. In some cases close to 100x more memory than equivalent query execution with DataFusion [0]. Also you might be interested to see my original blog post with some of my thoughts on this [1].

[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/ [1] https://andygrove.io/2018/01/rust-is-for-big-data/


Insightful blog posts! IMO a better memory comparison would be between a Spark executor and a DataFusion ... container I guess (i.e. graphing query time vs spark.executor.memory). This would give you a better idea of memory TCO on a cluster.


I'm not too familiar with JVM internals or Spark, but I know with Cassandra at least there's a cost to off-heap memory. You gain in GC, but eventually you have to move that data in and out of the JVM's heap.

Even for batch processing, long GCs can be bad. It's not just processing the batch that stops, but the whole world. Anything trying to listen for more data, keep track of time elapsed, etc is going to run across more problems. It can also expose some race conditions that would normally be so unlikely that you'd never hit them.

Static JVM memory overhead isn't bad at all, probably even under your 100-200MB guess. The lack of compact primitives adds some additional proportional overhead, but the biggest factor is just the extra "slack" space needed for good GC performance. Depending on the circumstances and requirements that could be 30-400%.


One of the creators of Spark apparently thinks Rust is worth pursuing:

https://www.weld.rs/


This is really cool! What do you see as ideally the primary API for something like this?

SQL is great for relational algebra expressions to transform tables but its limited support for variables and control flow constructs make it less than ideal for complex, multi-step data analysis scripts. And when it comes to running statistical tests, regressions, training ML models, it's wholly inappropriate.

Rust is a very expressive systems programming language, but it's unclear at this point how good of a fit it can be for data analysis and statistical programming tasks. It doesn't have much in the way of data science libraries yet.

Would you potentially add e.g. a Python interpreter on top of such a framework, or would you focus on building out a more fully-featured Rust API for data analysis and even go so far as to suggest that data scientists start to learn and use Rust? (There is some precedence for this with Scala and Spark)


These are great questions and topics I plan on addressing in a future blog post. SQL is a great convenience for simple analytical queries and it can be nice to be able to mix and match SQL and other access patterns (this is one thing I like with the Apache Spark DataFrame approach).

It's true that the Rust ecosystem for data science is not really there yet and I am trying to inspire people to start changing that. I think Rust does have some good potential here.

In the nearer term though I am exploring options around support user code in distributed query execution and I don't want to limit it to Rust.


I like it--I'm a data scientist by day, and I've been following Rust with interest but I haven't found a good use case for it in my job yet. A Rust-based Spark competitor sounds like it could be exactly that excuse to use Rust at work that I've been looking for!


This is exactly the situation I am in. We have workloads that I think we could deliver with much lower TCO using Rust/DataFusion/Ballista.

We could probably even use a single node for some of our workflows just using DataFusion directly with same performance as a distributed Spark job.


> Would you potentially add e.g. a Python interpreter

Why would this be a desired feature? The Python ecosystem is a mess, simply having an interpreter in no ways means that you would be able to get your Python script to run, I think its much better to simply do a system call on a Python script against the Python stack in $PATH e.g. use a container based approach


So Spark is bloated bcoz of JVM. Does Graal make the point moot?


Not really, no. Even though Graal compiles to native code, these JVM based languages are all based on two ideas:

1. there will be a GC to manage memory 2. memory will be managed

The GC slows things down, so spark tries to work around this by accessing "off heap" memory (which really just means off the JVM's heap and on the OS'). So you end up getting OOM errors if you give to little to the JVM or if you give too little to the OS. It's a hacky balancing act to get native access to memory, which comes for free with rust.


How does this compare to dremio, that also uses Apache Arrow? Is this a competitor?


No. Dremio is a mature product backed by a company. This is a two week old PoC right now. There could be overlap and collaboration for sure via the Apache Arrow project since Dremio are well represented there and I am an Arrow committer too.


My goal is to kick start some things around the Rust language for data science / data analytics and projects like this help to showcase what Rust can do. See my original blog post for more context: https://andygrove.io/2018/01/rust-is-for-big-data/


Interestingly I’ve had the same observations that you’ve had about the work the Spark project has done to work around the JVM. And is also been idly wondering what a native-first distributes data processing framework would look like and how it will perform. I’ll be really interested in seeing where your project goes.


It will be tough, and next to improbable, to match python's ecosystem. Maybe a way into it would be to marry the two.


What languages does Dremio support?


Dremio is JVM based but with compilation down to LLVM and they contributed Gandiva to the Apache Arrow project. Dremio seems pretty impressive but I haven't personally used it.

https://www.dremio.com/announcing-gandiva-initiative-for-apa...


Interesting, thanks. I know the entire industry is built around Hadoop and the JVM right now (with some Python), but I'm hoping the pendulum will swing in a different direction soon.


Me too. I have a JVM background for > 20 years and have been using Apache Spark for several years and while I have great respect for the engineering that has gone into Spark, I feel that they just started out with the wrong language. GC-based languages are not ideal for large scale distributed data processing, IMHO.


Only when the said languages don't support value types, it was an error from Java, not to offer what Wirth inspired languages already had in the 90's (Oberon variants, Modula-3, Eiffel).

The first EA release for value types support is out now.

I firmly believe in the end tracing GC with value types and local ownership, like D, Swift and C# are pursuing will win, at least in the realm of userspace programming.


Do you know the status of value types for Java? I see they were slated to be in v10 of the JVM, and I can find discussion [0] up to a year ago about them making it into the language, but that's ass far as my googling has taken me.

If Java can have struct access into native memory (without the presently complicated API [1]) that would open a lot of doors and get it to near feature parity with the CLR.

[1] https://docs.oracle.com/javase/6/docs/api/java/nio/ByteBuffe...

[0] https://news.ycombinator.com/item?id=14583530


Yes, have a look at

https://mail.openjdk.java.net/pipermail/valhalla-dev/2019-Ju...

And follow the links from there, including the Wiki page describing the EA release, unfortunately my submission did not gather too much uptake.

It is taking all this this time because the team wants pre-value types world jars to keep running unmodified on the new world, not an easy engineering task.


Why are you hoping that the proverbial pendulum swings away fron Hadoop?


I find the Hadoop platform to be heavyweight and difficult to manage. Its design came from the original GFS and MapReduce papers, and was perhaps not great to begin with (name nodes failover, etc.).

I would like to see the industry move away from the JVM to less memory-hungry, AOT-compiled languages. Too much of the Hadoop world simply assumes that you'll be using a JVM language, and don't provide non-JVM APIs at all. It's been a while since I looked, but I think Apex and Flink were basically useless if you didn't use a JVM language or Python.

Of all the Apache projects related to Hadoop, I believe Beam is the only one that has Go support (for implementing processing logic), for example.


SQL


This project needs a "how to help" section urgently


I will write up some guidance in the next few days for those looking to contribute!


Would this system support custom aggregates? How would I, for example, create a routine that defines a covariance matrix and have Ballista deal with the necessary map-reduce logic?


Could this be used without Kubernetes?


With further development, yes. It's just Docker containers. However, the current CLI is specific to Kubernetes.

If you weren't using Kubernetes, what orchestrator would you be interested in?


Thanks Andy! To be clear, I really appreciate your effort to create a better platform for big data. I have spent 10 years on trying to make Hadoop & Spark financially and technically scalable for companies with more than 1PB data and I totally agree with you that we need a better system. I am just not ready to trade Hadoop or Spark problems to Kubernetes problems. The question is what orchestration do we need? What model do you want to implement? Could we build a better Kubernetes?


I've only been using Kubernetes for a couple months so far and am still learning, but I am very impressed so far. I love the way it facilitates dev and devops collaborating and the fact that it is cloud agnostic (I can even run a Kubernetes cluster on my desktop for local testing). My opinion so far is that the distributed cluster part is really solved by Kubernetes.


Yeah it is absolutely amazing for exactly that and yes it has a good approach to the distributed cluster problem. Once you put it in production there a very different picture.

https://github.com/hjacobs/kubernetes-failure-stories

I haven't had a single Kubernetes user who had it in production and did not have stability or performance issues with it. This is why I mentioned that I am not ready to trade my Hadoop outages and performance issues for Kubernetes ones.


I'd be interested in a version I could use with Nomad at least.


Super cool. Perhaps naive, but how does distributing computation with serialization square with Arrow's in-memory design?


Each executor within the cluster would use Arrow in-memory design. If you have enough cores and memory on a single node then potentially you wouldn't need a cluster.


Serious, non-facetious question: who is this for?


This is for people who want to live in a future where we use efficient and safe system level languages for massively scalable distributed data processing. IMHO, Python and Java are not ideal language choices for these purposes.

Rust offers much lower TCO compared to current industry best practices in this area.


I get that Rust is exciting, but you aren't really building this for engineers, right?

When embarking on a non-trivial project, it's good to know who the prospective end user is. Helps shape the product requirements. Looking at your feature list for V1, it can easily take a team of 5-7 excellent engineers with strong, relevant domain expertise a couple of years to do most of the items on it well if your end user is "anybody". Some items on it (such as distributed query optimization) are open research problems. Some (distributed query execution, particularly if you want to support high performance multi-way joins) can take years all on their own.

If it's something more specialized, it could take a lot less time and effort. I say this as someone who's "been there done that" (at Google). We had to support a broad range of use cases, but we also had some of the best engineers in the world, some with decades of experience in the problem domain, and it still took years to get the product (BigQuery) out the door. This was, to a large extent, because the requirements were so amorphous. First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.

It's much easier to not try to be all things to all people.


Some great points there.

I have multiple goals here.

Goal #1 is to demonstrate how Rust can be effective at data science / analytics and evangelize a bit.

Goal #2 is to have fun trying to build some of this myself and improve my skills as a software engineer.

Goal #3 (getting a little more vague now) is to ensure that in some hypothetical future I can get a job working with Rust instead of JVM/Apache Spark.


Got it. Fun side project then! Good luck. Rust does seem like a good fit for something like this.

I think the hard-ass responses in the thread (all unwarranted) simply misunderstand the goals.


FWIW, I'm keeping your name in the back of my head for when my current company inevitably fails at their big data revolution. Biggest product produces 10-15 PB/day that must be processed. Our existing solution is 20 years old and relies on navigating C code with function line counts on the orders of 10k. Rust would've saved me a lot of time from living in gdb.

Really enjoyed the article. Keep it up!


> First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.

Is there more about this story that you could link me to? Or would reading the whitepaper provide most of the interesting detail of how a log query engine became a general-purpose query product?


Rust still has to do some work regarding language ergonomics.

Currently I rather wait for Java to get its value types story properly done.

For me an ergonomic future is a tracing GC, value types and manual memory management in hot code paths, like D, Eiffel and Modula-3 are capable of, improved with ownership rules.


Conventional tracing GC is antithetical to Rust's zero-cost goals. The community is currently working on things like deterministic collection of reference cycles, e.g. https://github.com/lopopolo/ferrocarril/blob/master/cactusre... which should address the same use cases perhaps even more effectively.


Right, but if productivity suffers, vs what tracing GC + value types + manual memory in unsafe code, are capable of, then Rust will mostly take over device drivers and kernel code, not userspace apps.

Which is what is currently happening to C++ across all major OS SDKs, being driven down the OS stack.


I hope so! The Rust leadership's quixotic quest to turn it into a language for the server-side may delay this, though.


> Rust offers much lower TCO

Just to satisfy my curiosity, since I’m not that familiar with distributed computing, how is tail call optimization relevant in this area?


It's total cost of ownership. Spark is a beast, half the resources is taken up by the platform itself.


Really? Can you explain better "half the resources is taken up by the platform itself". Do you have some numbers, experiments?


Running hadoop nodes, name nodes, yarn masters. These are massive complicated processes, that cumulatively are usually more complicated and expensive than the spark jobs themselves. At least from the cases I've seen.


Actually, none of these processes is essential for running Spark unless you have to read data from HDFS and have to run Spark on YARN.

Spark supports reading data from different sources, e.g., cloud blob storage, relational databases, NoSQL systems like C* and HBase. NameNode is only required if your data is stored on HDFS, and that is not an essential problem of Spark.

As for scheduling, Spark can run in standalone mode without any YARN components. Actually, that is how Spark clusters run in Databricks.


> Rust offers much lower TCO compared to current industry best practices in this area.

The more commoditised the compute resources are, the more significant the cost of human expertise as part of TCO.

Does Rust really offer an easier path for the coder to model data analytics? Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?


>Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?

    time_to_implement * cost/hr + cost_of_bug_damage * P(bug) + time_to_fix_bugs * cost/hr + cost_of_deployment_hw
The dream of Rust is that P(bug) and cost_of_deployment_hw are less.


Assuming what you're saying is correct, then the best thing would be to port Apache Spark from Scala to Rust (or Go?).


That's kinda what I'm trying to do here. Ballista is definitely very much inspired by Apache Spark but not a direct port. Spark has some design choices that were very much driven from being a JVM-first platform e.g. the way lambdas are serialized. When I eventually get to support custom user code in distributed execution I don't want it to be limited to Rust, but that's a topic for another blog post later this year.


Andy, I see Spark as essentially a monad that lifts functions into it’s execution context, then sechedules the execution on remote nodes, and gathers the results.

How can you do that in an AOT language? Do you just express your functions at a higher level. Above the layer of containers?


Consider Just-in-Time compilation which is very much used in AOT languages [0]. That could be one avenue, as long as the compiled code works on each machine (common instruction set), or maybe something slightly higher-level like LLVM IR can be produced and transmitted and then compiled on the driver/executor.

Another approach could simply be to transmit the function source to the driver, compile it, then distribute it to the executors assuming they're the same architecture, or compiling it on executors if not.

There are certainly options.

[0]: For example see https://llvm.org/docs/tutorial/BuildingAJIT3.html


It’s not possible to do a direct port. Since Spark relies on java serialisation to send executable code and data (closures) over the wire to workers. This relies on byte code. Rust is compiled ahead of time.

So you can sort of see how the essence of spark can be ported. It’s essentially a monad. But a direct port to Rust is not possible because Rust uses AOT and the JVM was built to load code dynamically.


But claiming that "X software is written in Y language/framework" says nothing about efficiency or safety. It's just meaningless marketting piggy-backing on popular buzzwords.

And claims about "the future" are simply absurd. Oddly enough, this link appears right next to another story on how Cobol powers the world's economy.

Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.


Check out my previous benchmarks on Rust-based DataFusion vs JVM-based Apache Spark workloads. It isn't just about raw performance numbers but also about resource requirements (especially RAM requirements when using JVM or similar GC-based languages). There are order-of-magnitude differences.


If you have tangible results then present your benchmarks. If you limit your marketing to empty claims regarding "the future" and vague assertions on performance then you're actually actively working to lower your credibility.


This is a personal open source project. I'm not sure I'm "marketing" it since I make zero dollars from this work.

For benchmarks, check out the past 18 months of posts on my blog. Here is the most recent:

https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/


[flagged]


Benchmarks are rarely fair. They are often biased in favor of those producing the benchmarks, whether intentionally or not.

Whether you want to explore this more or not is your choice. Nobody is asking you to engage in this conversation. If you are not interested and/or are not enjoying the discourse here, feel free to move along.


please, stop it.


Don't you think you are being unreasonably harsh? Why are you attacking somebody's open source project with such passion?


Thanks eklavya but it's completely normal on HN to have trolls criticizing other people's work. Best thing is to just ignore them. They can move on to criticize other posts. Much easier than contributing.


"But claiming that "X software is written in Y language/framework" says nothing about efficiency or safety"

Except for the fact that it does in some cases. Rust is definitely safer than C or C++ (unless the Rust code is explicitly using the unsafe keyword).


[flagged]


I'm sensing that you are not familiar with Rust and the safety that it introduces compared to C/C++.


Some things definitely do become magically safe when the language you are using tightly restricts you from doing very common and unsafe things!


I admit I haven't benchmarked against COBOL blockchains.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: