Rust offers much lower TCO compared to current industry best practices in this area.
When embarking on a non-trivial project, it's good to know who the prospective end user is. Helps shape the product requirements. Looking at your feature list for V1, it can easily take a team of 5-7 excellent engineers with strong, relevant domain expertise a couple of years to do most of the items on it well if your end user is "anybody". Some items on it (such as distributed query optimization) are open research problems. Some (distributed query execution, particularly if you want to support high performance multi-way joins) can take years all on their own.
If it's something more specialized, it could take a lot less time and effort. I say this as someone who's "been there done that" (at Google). We had to support a broad range of use cases, but we also had some of the best engineers in the world, some with decades of experience in the problem domain, and it still took years to get the product (BigQuery) out the door. This was, to a large extent, because the requirements were so amorphous. First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.
It's much easier to not try to be all things to all people.
I have multiple goals here.
Goal #1 is to demonstrate how Rust can be effective at data science / analytics and evangelize a bit.
Goal #2 is to have fun trying to build some of this myself and improve my skills as a software engineer.
Goal #3 (getting a little more vague now) is to ensure that in some hypothetical future I can get a job working with Rust instead of JVM/Apache Spark.
I think the hard-ass responses in the thread (all unwarranted) simply misunderstand the goals.
Really enjoyed the article. Keep it up!
Is there more about this story that you could link me to? Or would reading the whitepaper provide most of the interesting detail of how a log query engine became a general-purpose query product?
Currently I rather wait for Java to get its value types story properly done.
For me an ergonomic future is a tracing GC, value types and manual memory management in hot code paths, like D, Eiffel and Modula-3 are capable of, improved with ownership rules.
Which is what is currently happening to C++ across all major OS SDKs, being driven down the OS stack.
Just to satisfy my curiosity, since I’m not that familiar with distributed computing, how is tail call optimization relevant in this area?
Spark supports reading data from different sources, e.g., cloud blob storage, relational databases, NoSQL systems like C* and HBase. NameNode is only required if your data is stored on HDFS, and that is not an essential problem of Spark.
As for scheduling, Spark can run in standalone mode without any YARN components. Actually, that is how Spark clusters run in Databricks.
The more commoditised the compute resources are, the more significant the cost of human expertise as part of TCO.
Does Rust really offer an easier path for the coder to model data analytics? Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?
time_to_implement * cost/hr + cost_of_bug_damage * P(bug) + time_to_fix_bugs * cost/hr + cost_of_deployment_hw
How can you do that in an AOT language? Do you just express your functions at a higher level. Above the layer of containers?
Another approach could simply be to transmit the function source to the driver, compile it, then distribute it to the executors assuming they're the same architecture, or compiling it on executors if not.
There are certainly options.
: For example see https://llvm.org/docs/tutorial/BuildingAJIT3.html
So you can sort of see how the essence of spark can be ported. It’s essentially a monad. But a direct port to Rust is not possible because Rust uses AOT and the JVM was built to load code dynamically.
And claims about "the future" are simply absurd. Oddly enough, this link appears right next to another story on how Cobol powers the world's economy.
Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.
For benchmarks, check out the past 18 months of posts on my blog. Here is the most recent:
Whether you want to explore this more or not is your choice. Nobody is asking you to engage in this conversation. If you are not interested and/or are not enjoying the discourse here, feel free to move along.
Except for the fact that it does in some cases. Rust is definitely safer than C or C++ (unless the Rust code is explicitly using the unsafe keyword).