I must confess I miss native execution of (big)Data jobs.
I know moving jvm bytecode between nodes to be run is portable, but nowadays nobody has mixed architecture (intel/mips/sparc/arm) servers, so... why do we need a bytecode execution layer?
Maybe I am too critic, but bare metal - hypervisor - jvm - app looks too much layers to me.
Later on the day I realized it could be done from the command line with 'cut' and 'grep'. Took less than 15 minutes.
Dask deploys pretty well on k8s - https://kubernetes.dask.org/en/latest/
Semi-related question: How do you expect Arrow to be integrated to the larger data science landscape?
Will it mostly be used as a go between format? Will new libraries using it internally and old libraries just reading and translating it to a native format? Do you think established libraries will change their back-end to arrow? Is that even feasible with e.g. Pandas (or are you too far from their governance now to say)?
Thanks for the link. I will read about their Kubernetes support.
I use Presto all the time, I love how fully-featured it is, but garbage collection is a non-trivial component of time-to-execute for my queries.
I do have a few years of experience with Spark and hadoop if that's worth anything.
I do however worry that rust has a high bar of entry.
In fact, I think the time to begin that type of project is now because by the time you do the ground work for such a project and integrate, etc. a wasm engine will be ready.
Apache Spark is written in Scala, and I wouldn't describe that as having an 'easy' learning curve, even compared to Rust! If you want something 'easy' on the JVM, Kotlin (and perhaps Eta or Frege) might be more appropriate.
If Scala is used as a "better Java" to start with, and then the developers explore FP at their own pace, I think there's a better outcome. Granted Java 8+ has done some to close the gap with Scala as well.
I wonder if Scala-Native will absorb some Rustish features as other languages have or are attempting to...
(What I said above echoes Odersky's ideas about Scala developer levels: https://www.scala-lang.org/old/node/8610.html)
So then, would Rust be better than a JVM language for a distributed compute framework like Apache Spark?
Based on what others said in this thread, these are the primary arguments for Rust:
1. JVM GC overhead
2. JVM GC pauses
3. JVM memory overhead.
4. Native code (i.e. Rust) has better raw performance than a JVM language
My take on it:
(1) I believe Spark basically wrote its own memory management layer with Unsafe that let's it bypass the GC , so for Dataframe/SQL we might be ok here. Hopefully value types are coming to Java/Scala soon.
(2) Majority of Apache Spark use-cases are batch right? In this case who cares about a little stop-the-world pause here and there, as long as we're optimizing the GC for throughput. I recognize that streaming is also a thing, so maybe a non-GC language like Rust is better suited for latency sensitive streaming workloads. Perhaps the Shenandoah GC would be of help here.
(3) What's the memory overhead of a JVM process, 100-200 MB? That doesn't seem too bad to me when clusters these days have terabytes of memory.
(4) I wonder how much of an impact performance improvements from Rust will have over Spark's optimized code generation , which basically converts your code into array loops that utilize cache locality, loop unrolling, and simd. I imagine that most of the gains to be had from a Rust rewrite would come from these "bare metal' techniques, so it might the case that Spark already has that going for it...
Having said that, I can't think of any reasons why a compute engine on Rust is a bad idea. Developer productivity and ecosystem perhaps?
Memory overhead of Spark particularly (not just JVM) is very high. In some cases close to 100x more memory than equivalent query execution with DataFusion . Also you might be interested to see my original blog post with some of my thoughts on this .
Even for batch processing, long GCs can be bad. It's not just processing the batch that stops, but the whole world. Anything trying to listen for more data, keep track of time elapsed, etc is going to run across more problems. It can also expose some race conditions that would normally be so unlikely that you'd never hit them.
Static JVM memory overhead isn't bad at all, probably even under your 100-200MB guess. The lack of compact primitives adds some additional proportional overhead, but the biggest factor is just the extra "slack" space needed for good GC performance. Depending on the circumstances and requirements that could be 30-400%.
SQL is great for relational algebra expressions to transform tables but its limited support for variables and control flow constructs make it less than ideal for complex, multi-step data analysis scripts. And when it comes to running statistical tests, regressions, training ML models, it's wholly inappropriate.
Rust is a very expressive systems programming language, but it's unclear at this point how good of a fit it can be for data analysis and statistical programming tasks. It doesn't have much in the way of data science libraries yet.
Would you potentially add e.g. a Python interpreter on top of such a framework, or would you focus on building out a more fully-featured Rust API for data analysis and even go so far as to suggest that data scientists start to learn and use Rust? (There is some precedence for this with Scala and Spark)
It's true that the Rust ecosystem for data science is not really there yet and I am trying to inspire people to start changing that. I think Rust does have some good potential here.
In the nearer term though I am exploring options around support user code in distributed query execution and I don't want to limit it to Rust.
We could probably even use a single node for some of our workflows just using DataFusion directly with same performance as a distributed Spark job.
Why would this be a desired feature? The Python ecosystem is a mess, simply having an interpreter in no ways means that you would be able to get your Python script to run, I think its much better to simply do a system call on a Python script against the Python stack in $PATH e.g. use a container based approach
1. there will be a GC to manage memory
2. memory will be managed
The GC slows things down, so spark tries to work around this by accessing "off heap" memory (which really just means off the JVM's heap and on the OS'). So you end up getting OOM errors if you give to little to the JVM or if you give too little to the OS. It's a hacky balancing act to get native access to memory, which comes for free with rust.
The first EA release for value types support is out now.
I firmly believe in the end tracing GC with value types and local ownership, like D, Swift and C# are pursuing will win, at least in the realm of userspace programming.
If Java can have struct access into native memory (without the presently complicated API ) that would open a lot of doors and get it to near feature parity with the CLR.
And follow the links from there, including the Wiki page describing the EA release, unfortunately my submission did not gather too much uptake.
It is taking all this this time because the team wants pre-value types world jars to keep running unmodified on the new world, not an easy engineering task.
I would like to see the industry move away from the JVM to less memory-hungry, AOT-compiled languages. Too much of the Hadoop world simply assumes that you'll be using a JVM language, and don't provide non-JVM APIs at all. It's been a while since I looked, but I think Apex and Flink were basically useless if you didn't use a JVM language or Python.
Of all the Apache projects related to Hadoop, I believe Beam is the only one that has Go support (for implementing processing logic), for example.
If you weren't using Kubernetes, what orchestrator would you be interested in?
I haven't had a single Kubernetes user who had it in production and did not have stability or performance issues with it. This is why I mentioned that I am not ready to trade my Hadoop outages and performance issues for Kubernetes ones.
Rust offers much lower TCO compared to current industry best practices in this area.
When embarking on a non-trivial project, it's good to know who the prospective end user is. Helps shape the product requirements. Looking at your feature list for V1, it can easily take a team of 5-7 excellent engineers with strong, relevant domain expertise a couple of years to do most of the items on it well if your end user is "anybody". Some items on it (such as distributed query optimization) are open research problems. Some (distributed query execution, particularly if you want to support high performance multi-way joins) can take years all on their own.
If it's something more specialized, it could take a lot less time and effort. I say this as someone who's "been there done that" (at Google). We had to support a broad range of use cases, but we also had some of the best engineers in the world, some with decades of experience in the problem domain, and it still took years to get the product (BigQuery) out the door. This was, to a large extent, because the requirements were so amorphous. First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.
It's much easier to not try to be all things to all people.
I have multiple goals here.
Goal #1 is to demonstrate how Rust can be effective at data science / analytics and evangelize a bit.
Goal #2 is to have fun trying to build some of this myself and improve my skills as a software engineer.
Goal #3 (getting a little more vague now) is to ensure that in some hypothetical future I can get a job working with Rust instead of JVM/Apache Spark.
I think the hard-ass responses in the thread (all unwarranted) simply misunderstand the goals.
Really enjoyed the article. Keep it up!
Is there more about this story that you could link me to? Or would reading the whitepaper provide most of the interesting detail of how a log query engine became a general-purpose query product?
Currently I rather wait for Java to get its value types story properly done.
For me an ergonomic future is a tracing GC, value types and manual memory management in hot code paths, like D, Eiffel and Modula-3 are capable of, improved with ownership rules.
Which is what is currently happening to C++ across all major OS SDKs, being driven down the OS stack.
Just to satisfy my curiosity, since I’m not that familiar with distributed computing, how is tail call optimization relevant in this area?
Spark supports reading data from different sources, e.g., cloud blob storage, relational databases, NoSQL systems like C* and HBase. NameNode is only required if your data is stored on HDFS, and that is not an essential problem of Spark.
As for scheduling, Spark can run in standalone mode without any YARN components. Actually, that is how Spark clusters run in Databricks.
The more commoditised the compute resources are, the more significant the cost of human expertise as part of TCO.
Does Rust really offer an easier path for the coder to model data analytics? Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?
time_to_implement * cost/hr + cost_of_bug_damage * P(bug) + time_to_fix_bugs * cost/hr + cost_of_deployment_hw
How can you do that in an AOT language? Do you just express your functions at a higher level. Above the layer of containers?
Another approach could simply be to transmit the function source to the driver, compile it, then distribute it to the executors assuming they're the same architecture, or compiling it on executors if not.
There are certainly options.
: For example see https://llvm.org/docs/tutorial/BuildingAJIT3.html
So you can sort of see how the essence of spark can be ported. It’s essentially a monad. But a direct port to Rust is not possible because Rust uses AOT and the JVM was built to load code dynamically.
And claims about "the future" are simply absurd. Oddly enough, this link appears right next to another story on how Cobol powers the world's economy.
Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.
For benchmarks, check out the past 18 months of posts on my blog. Here is the most recent:
Except for the fact that it does in some cases. Rust is definitely safer than C or C++ (unless the Rust code is explicitly using the unsafe keyword).