The fact people opposed to your idea / work means it is valuable enough for people to say something against and not ignore it.
I must confess I miss native execution of (big)Data jobs.
I know moving jvm bytecode between nodes to be run is portable, but nowadays nobody has mixed architecture (intel/mips/sparc/arm) servers, so... why do we need a bytecode execution layer?
Maybe I am too critic, but bare metal - hypervisor - jvm - app looks too much layers to me.
Totally, I think there are a lot of cases where all that machinery is actually completely superfluous anyway because thinking about the problem you're trying to solve could lead you to a way of doing it that doesn't need all that compute. There's a blog post about this I read a while back where someone did a simple graph search using spark and then did it in a slightly smarter way on a commodity laptop and it outran spark on a large-ish amount of data. Wish I could find it.
Semi-related question: How do you expect Arrow to be integrated to the larger data science landscape?
Will it mostly be used as a go between format? Will new libraries using it internally and old libraries just reading and translating it to a native format? Do you think established libraries will change their back-end to arrow? Is that even feasible with e.g. Pandas (or are you too far from their governance now to say)?
Although I know and work with some of the contributors from this project, I have no real world experience with Dask or Python data science tools in general.
Thanks for the link. I will read about their Kubernetes support.
I'm actually excited about the possibilities. I've watched DataFusion from afar, and I have spent a decent amount of time wishing the Big Data ecosystem had arrived during a time when something like Rust was a viable option, both for memory and for parallel computing.
I use Presto all the time, I love how fully-featured it is, but garbage collection is a non-trivial component of time-to-execute for my queries.
Are you looking for contributors? I don't have any rust, arrow or k8s experience but been looking to learn all 3, I've also been looking to contribute to os projects so I'm happy to pick up any low hanging fruits if you are interested.
I do have a few years of experience with Spark and hadoop if that's worth anything.
I congratulate the effort, as I always thought that Spark is great, but the fact it was written in Java hinders it quite badly (GC, tons of memory required for the runtime, jar hell (want to use proto3 in your spark job? Good luck)).
I do however worry that rust has a high bar of entry.
I'm quite sure as soon as there are performant WASM engines deployable outside a browser (e.g. parity-wasm) then Ballista will converge on it and it would hence support anything that can compile to wasm.
In fact, I think the time to begin that type of project is now because by the time you do the ground work for such a project and integrate, etc. a wasm engine will be ready.
The first EA releases for value types are out, even if it takes a couple of releases more, it is much easier to improve existing stacks than restart from zero.
I agree. Rust has a very steep learning curve compared to JVM languages. My hope in building this platform is that it can provide value to other languages (especially JVM) by taking query plans and executing them efficiently.
Apache Spark is written in Scala, and I wouldn't describe that as having an 'easy' learning curve, even compared to Rust! If you want something 'easy' on the JVM, Kotlin (and perhaps Eta or Frege) might be more appropriate.
I expect you were exposed using a heavily functional programming approach. That's a whole 'nother (larger) area to pick up along with the different language.
If Scala is used as a "better Java" to start with, and then the developers explore FP at their own pace, I think there's a better outcome. Granted Java 8+ has done some to close the gap with Scala as well.
I wonder if Scala-Native will absorb some Rustish features as other languages have or are attempting to...
Most "big data" distributed compute frameworks that come to mind are written in a JVM language, so the focus on Rust is interesting.
So then, would Rust be better than a JVM language for a distributed compute framework like Apache Spark?
Based on what others said in this thread, these are the primary arguments for Rust:
1. JVM GC overhead
2. JVM GC pauses
3. JVM memory overhead.
4. Native code (i.e. Rust) has better raw performance than a JVM language
My take on it:
(1) I believe Spark basically wrote its own memory management layer with Unsafe that let's it bypass the GC [0], so for Dataframe/SQL we might be ok here. Hopefully value types are coming to Java/Scala soon.
(2) Majority of Apache Spark use-cases are batch right? In this case who cares about a little stop-the-world pause here and there, as long as we're optimizing the GC for throughput. I recognize that streaming is also a thing, so maybe a non-GC language like Rust is better suited for latency sensitive streaming workloads. Perhaps the Shenandoah GC would be of help here.
(3) What's the memory overhead of a JVM process, 100-200 MB? That doesn't seem too bad to me when clusters these days have terabytes of memory.
(4) I wonder how much of an impact performance improvements from Rust will have over Spark's optimized code generation [1], which basically converts your code into array loops that utilize cache locality, loop unrolling, and simd. I imagine that most of the gains to be had from a Rust rewrite would come from these "bare metal' techniques, so it might the case that Spark already has that going for it...
Having said that, I can't think of any reasons why a compute engine on Rust is a bad idea. Developer productivity and ecosystem perhaps?
Some good points. Some incredible engineering has gone into Spark to work around the fact that it runs on the JVM.
Memory overhead of Spark particularly (not just JVM) is very high. In some cases close to 100x more memory than equivalent query execution with DataFusion [0]. Also you might be interested to see my original blog post with some of my thoughts on this [1].
Insightful blog posts! IMO a better memory comparison would be between a Spark executor and a DataFusion ... container I guess (i.e. graphing query time vs spark.executor.memory). This would give you a better idea of memory TCO on a cluster.
I'm not too familiar with JVM internals or Spark, but I know with Cassandra at least there's a cost to off-heap memory. You gain in GC, but eventually you have to move that data in and out of the JVM's heap.
Even for batch processing, long GCs can be bad. It's not just processing the batch that stops, but the whole world. Anything trying to listen for more data, keep track of time elapsed, etc is going to run across more problems. It can also expose some race conditions that would normally be so unlikely that you'd never hit them.
Static JVM memory overhead isn't bad at all, probably even under your 100-200MB guess. The lack of compact primitives adds some additional proportional overhead, but the biggest factor is just the extra "slack" space needed for good GC performance. Depending on the circumstances and requirements that could be 30-400%.
This is really cool!
What do you see as ideally the primary API for something like this?
SQL is great for relational algebra expressions to transform tables but its limited support for variables and control flow constructs make it less than ideal for complex, multi-step data analysis scripts. And when it comes to running statistical tests, regressions, training ML models, it's wholly inappropriate.
Rust is a very expressive systems programming language, but it's unclear at this point how good of a fit it can be for data analysis and statistical programming tasks. It doesn't have much in the way of data science libraries yet.
Would you potentially add e.g. a Python interpreter on top of such a framework, or would you focus on building out a more fully-featured Rust API for data analysis and even go so far as to suggest that data scientists start to learn and use Rust? (There is some precedence for this with Scala and Spark)
These are great questions and topics I plan on addressing in a future blog post. SQL is a great convenience for simple analytical queries and it can be nice to be able to mix and match SQL and other access patterns (this is one thing I like with the Apache Spark DataFrame approach).
It's true that the Rust ecosystem for data science is not really there yet and I am trying to inspire people to start changing that. I think Rust does have some good potential here.
In the nearer term though I am exploring options around support user code in distributed query execution and I don't want to limit it to Rust.
I like it--I'm a data scientist by day, and I've been following Rust with interest but I haven't found a good use case for it in my job yet. A Rust-based Spark competitor sounds like it could be exactly that excuse to use Rust at work that I've been looking for!
> Would you potentially add e.g. a Python interpreter
Why would this be a desired feature? The Python ecosystem is a mess, simply having an interpreter in no ways means that you would be able to get your Python script to run, I think its much better to simply do a system call on a Python script against the Python stack in $PATH e.g. use a container based approach
Not really, no. Even though Graal compiles to native code, these JVM based languages are all based on two ideas:
1. there will be a GC to manage memory
2. memory will be managed
The GC slows things down, so spark tries to work around this by accessing "off heap" memory (which really just means off the JVM's heap and on the OS'). So you end up getting OOM errors if you give to little to the JVM or if you give too little to the OS. It's a hacky balancing act to get native access to memory, which comes for free with rust.
No. Dremio is a mature product backed by a company. This is a two week old PoC right now. There could be overlap and collaboration for sure via the Apache Arrow project since Dremio are well represented there and I am an Arrow committer too.
My goal is to kick start some things around the Rust language for data science / data analytics and projects like this help to showcase what Rust can do. See my original blog post for more context: https://andygrove.io/2018/01/rust-is-for-big-data/
Interestingly I’ve had the same observations that you’ve had about the work the Spark project has done to work around the JVM. And is also been idly wondering what a native-first distributes data processing framework would look like and how it will perform. I’ll be really interested in seeing where your project goes.
Dremio is JVM based but with compilation down to LLVM and they contributed Gandiva to the Apache Arrow project. Dremio seems pretty impressive but I haven't personally used it.
Interesting, thanks. I know the entire industry is built around Hadoop and the JVM right now (with some Python), but I'm hoping the pendulum will swing in a different direction soon.
Me too. I have a JVM background for > 20 years and have been using Apache Spark for several years and while I have great respect for the engineering that has gone into Spark, I feel that they just started out with the wrong language. GC-based languages are not ideal for large scale distributed data processing, IMHO.
Only when the said languages don't support value types, it was an error from Java, not to offer what Wirth inspired languages already had in the 90's (Oberon variants, Modula-3, Eiffel).
The first EA release for value types support is out now.
I firmly believe in the end tracing GC with value types and local ownership, like D, Swift and C# are pursuing will win, at least in the realm of userspace programming.
Do you know the status of value types for Java? I see they were slated to be in v10 of the JVM, and I can find discussion [0] up to a year ago about them making it into the language, but that's ass far as my googling has taken me.
If Java can have struct access into native memory (without the presently complicated API [1]) that would open a lot of doors and get it to near feature parity with the CLR.
And follow the links from there, including the Wiki page describing the EA release, unfortunately my submission did not gather too much uptake.
It is taking all this this time because the team wants pre-value types world jars to keep running unmodified on the new world, not an easy engineering task.
I find the Hadoop platform to be heavyweight and difficult to manage. Its design came from the original GFS and MapReduce papers, and was perhaps not great to begin with (name nodes failover, etc.).
I would like to see the industry move away from the JVM to less memory-hungry, AOT-compiled languages. Too much of the Hadoop world simply assumes that you'll be using a JVM language, and don't provide non-JVM APIs at all. It's been a while since I looked, but I think Apex and Flink were basically useless if you didn't use a JVM language or Python.
Of all the Apache projects related to Hadoop, I believe Beam is the only one that has Go support (for implementing processing logic), for example.
Would this system support custom aggregates? How would I, for example, create a routine that defines a covariance matrix and have Ballista deal with the necessary map-reduce logic?
Thanks Andy! To be clear, I really appreciate your effort to create a better platform for big data. I have spent 10 years on trying to make Hadoop & Spark financially and technically scalable for companies with more than 1PB data and I totally agree with you that we need a better system. I am just not ready to trade Hadoop or Spark problems to Kubernetes problems. The question is what orchestration do we need? What model do you want to implement? Could we build a better Kubernetes?
I've only been using Kubernetes for a couple months so far and am still learning, but I am very impressed so far. I love the way it facilitates dev and devops collaborating and the fact that it is cloud agnostic (I can even run a Kubernetes cluster on my desktop for local testing). My opinion so far is that the distributed cluster part is really solved by Kubernetes.
Yeah it is absolutely amazing for exactly that and yes it has a good approach to the distributed cluster problem. Once you put it in production there a very different picture.
I haven't had a single Kubernetes user who had it in production and did not have stability or performance issues with it. This is why I mentioned that I am not ready to trade my Hadoop outages and performance issues for Kubernetes ones.
Each executor within the cluster would use Arrow in-memory design. If you have enough cores and memory on a single node then potentially you wouldn't need a cluster.
This is for people who want to live in a future where we use efficient and safe system level languages for massively scalable distributed data processing. IMHO, Python and Java are not ideal language choices for these purposes.
Rust offers much lower TCO compared to current industry best practices in this area.
I get that Rust is exciting, but you aren't really building this for engineers, right?
When embarking on a non-trivial project, it's good to know who the prospective end user is. Helps shape the product requirements. Looking at your feature list for V1, it can easily take a team of 5-7 excellent engineers with strong, relevant domain expertise a couple of years to do most of the items on it well if your end user is "anybody". Some items on it (such as distributed query optimization) are open research problems. Some (distributed query execution, particularly if you want to support high performance multi-way joins) can take years all on their own.
If it's something more specialized, it could take a lot less time and effort. I say this as someone who's "been there done that" (at Google). We had to support a broad range of use cases, but we also had some of the best engineers in the world, some with decades of experience in the problem domain, and it still took years to get the product (BigQuery) out the door. This was, to a large extent, because the requirements were so amorphous. First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.
It's much easier to not try to be all things to all people.
FWIW, I'm keeping your name in the back of my head for when my current company inevitably fails at their big data revolution. Biggest product produces 10-15 PB/day that must be processed. Our existing solution is 20 years old and relies on navigating C code with function line counts on the orders of 10k. Rust would've saved me a lot of time from living in gdb.
> First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.
Is there more about this story that you could link me to? Or would reading the whitepaper provide most of the interesting detail of how a log query engine became a general-purpose query product?
Rust still has to do some work regarding language ergonomics.
Currently I rather wait for Java to get its value types story properly done.
For me an ergonomic future is a tracing GC, value types and manual memory management in hot code paths, like D, Eiffel and Modula-3 are capable of, improved with ownership rules.
Conventional tracing GC is antithetical to Rust's zero-cost goals. The community is currently working on things like deterministic collection of reference cycles, e.g. https://github.com/lopopolo/ferrocarril/blob/master/cactusre... which should address the same use cases perhaps even more effectively.
Right, but if productivity suffers, vs what tracing GC + value types + manual memory in unsafe code, are capable of, then Rust will mostly take over device drivers and kernel code, not userspace apps.
Which is what is currently happening to C++ across all major OS SDKs, being driven down the OS stack.
Running hadoop nodes, name nodes, yarn masters. These are massive complicated processes, that cumulatively are usually more complicated and expensive than the spark jobs themselves. At least from the cases I've seen.
Actually, none of these processes is essential for running Spark unless you have to read data from HDFS and have to run Spark on YARN.
Spark supports reading data from different sources, e.g., cloud blob storage, relational databases, NoSQL systems like C* and HBase. NameNode is only required if your data is stored on HDFS, and that is not an essential problem of Spark.
As for scheduling, Spark can run in standalone mode without any YARN components. Actually, that is how Spark clusters run in Databricks.
> Rust offers much lower TCO compared to current industry best practices in this area.
The more commoditised the compute resources are, the more significant the cost of human expertise as part of TCO.
Does Rust really offer an easier path for the coder to model data analytics? Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?
That's kinda what I'm trying to do here. Ballista is definitely very much inspired by Apache Spark but not a direct port. Spark has some design choices that were very much driven from being a JVM-first platform e.g. the way lambdas are serialized. When I eventually get to support custom user code in distributed execution I don't want it to be limited to Rust, but that's a topic for another blog post later this year.
Andy, I see Spark as essentially a monad that lifts functions into it’s execution context, then sechedules the execution on remote nodes, and gathers the results.
How can you do that in an AOT language? Do you just express your functions at a higher level. Above the layer of containers?
Consider Just-in-Time compilation which is very much used in AOT languages [0]. That could be one avenue, as long as the compiled code works on each machine (common instruction set), or maybe something slightly higher-level like LLVM IR can be produced and transmitted and then compiled on the driver/executor.
Another approach could simply be to transmit the function source to the driver, compile it, then distribute it to the executors assuming they're the same architecture, or compiling it on executors if not.
It’s not possible to do a direct port. Since Spark relies on java serialisation to send executable code and data (closures) over the wire to workers. This relies on byte code. Rust is compiled ahead of time.
So you can sort of see how the essence of spark can be ported. It’s essentially a monad. But a direct port to Rust is not possible because Rust uses AOT and the JVM was built to load code dynamically.
But claiming that "X software is written in Y language/framework" says nothing about efficiency or safety. It's just meaningless marketting piggy-backing on popular buzzwords.
And claims about "the future" are simply absurd. Oddly enough, this link appears right next to another story on how Cobol powers the world's economy.
Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.
Check out my previous benchmarks on Rust-based DataFusion vs JVM-based Apache Spark workloads. It isn't just about raw performance numbers but also about resource requirements (especially RAM requirements when using JVM or similar GC-based languages). There are order-of-magnitude differences.
If you have tangible results then present your benchmarks. If you limit your marketing to empty claims regarding "the future" and vague assertions on performance then you're actually actively working to lower your credibility.
Benchmarks are rarely fair. They are often biased in favor of those producing the benchmarks, whether intentionally or not.
Whether you want to explore this more or not is your choice. Nobody is asking you to engage in this conversation. If you are not interested and/or are not enjoying the discourse here, feel free to move along.
Thanks eklavya but it's completely normal on HN to have trolls criticizing other people's work. Best thing is to just ignore them. They can move on to criticize other posts. Much easier than contributing.