Hacker News new | past | comments | ask | show | jobs | submit login

This is for people who want to live in a future where we use efficient and safe system level languages for massively scalable distributed data processing. IMHO, Python and Java are not ideal language choices for these purposes.

Rust offers much lower TCO compared to current industry best practices in this area.




I get that Rust is exciting, but you aren't really building this for engineers, right?

When embarking on a non-trivial project, it's good to know who the prospective end user is. Helps shape the product requirements. Looking at your feature list for V1, it can easily take a team of 5-7 excellent engineers with strong, relevant domain expertise a couple of years to do most of the items on it well if your end user is "anybody". Some items on it (such as distributed query optimization) are open research problems. Some (distributed query execution, particularly if you want to support high performance multi-way joins) can take years all on their own.

If it's something more specialized, it could take a lot less time and effort. I say this as someone who's "been there done that" (at Google). We had to support a broad range of use cases, but we also had some of the best engineers in the world, some with decades of experience in the problem domain, and it still took years to get the product (BigQuery) out the door. This was, to a large extent, because the requirements were so amorphous. First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.

It's much easier to not try to be all things to all people.


Some great points there.

I have multiple goals here.

Goal #1 is to demonstrate how Rust can be effective at data science / analytics and evangelize a bit.

Goal #2 is to have fun trying to build some of this myself and improve my skills as a software engineer.

Goal #3 (getting a little more vague now) is to ensure that in some hypothetical future I can get a job working with Rust instead of JVM/Apache Spark.


Got it. Fun side project then! Good luck. Rust does seem like a good fit for something like this.

I think the hard-ass responses in the thread (all unwarranted) simply misunderstand the goals.


FWIW, I'm keeping your name in the back of my head for when my current company inevitably fails at their big data revolution. Biggest product produces 10-15 PB/day that must be processed. Our existing solution is 20 years old and relies on navigating C code with function line counts on the orders of 10k. Rust would've saved me a lot of time from living in gdb.

Really enjoyed the article. Keep it up!


> First version of what later became BigQuery was built in a few months by a much smaller team, because it was targeted at a much more specialized use case: querying structured application logs in (almost) realtime.

Is there more about this story that you could link me to? Or would reading the whitepaper provide most of the interesting detail of how a log query engine became a general-purpose query product?


Rust still has to do some work regarding language ergonomics.

Currently I rather wait for Java to get its value types story properly done.

For me an ergonomic future is a tracing GC, value types and manual memory management in hot code paths, like D, Eiffel and Modula-3 are capable of, improved with ownership rules.


Conventional tracing GC is antithetical to Rust's zero-cost goals. The community is currently working on things like deterministic collection of reference cycles, e.g. https://github.com/lopopolo/ferrocarril/blob/master/cactusre... which should address the same use cases perhaps even more effectively.


Right, but if productivity suffers, vs what tracing GC + value types + manual memory in unsafe code, are capable of, then Rust will mostly take over device drivers and kernel code, not userspace apps.

Which is what is currently happening to C++ across all major OS SDKs, being driven down the OS stack.


I hope so! The Rust leadership's quixotic quest to turn it into a language for the server-side may delay this, though.


> Rust offers much lower TCO

Just to satisfy my curiosity, since I’m not that familiar with distributed computing, how is tail call optimization relevant in this area?


It's total cost of ownership. Spark is a beast, half the resources is taken up by the platform itself.


Really? Can you explain better "half the resources is taken up by the platform itself". Do you have some numbers, experiments?


Running hadoop nodes, name nodes, yarn masters. These are massive complicated processes, that cumulatively are usually more complicated and expensive than the spark jobs themselves. At least from the cases I've seen.


Actually, none of these processes is essential for running Spark unless you have to read data from HDFS and have to run Spark on YARN.

Spark supports reading data from different sources, e.g., cloud blob storage, relational databases, NoSQL systems like C* and HBase. NameNode is only required if your data is stored on HDFS, and that is not an essential problem of Spark.

As for scheduling, Spark can run in standalone mode without any YARN components. Actually, that is how Spark clusters run in Databricks.


> Rust offers much lower TCO compared to current industry best practices in this area.

The more commoditised the compute resources are, the more significant the cost of human expertise as part of TCO.

Does Rust really offer an easier path for the coder to model data analytics? Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?


>Eg, for a given model, is the time to implement x cost/hr for a Rust coder less than that of a Java one?

    time_to_implement * cost/hr + cost_of_bug_damage * P(bug) + time_to_fix_bugs * cost/hr + cost_of_deployment_hw
The dream of Rust is that P(bug) and cost_of_deployment_hw are less.


Assuming what you're saying is correct, then the best thing would be to port Apache Spark from Scala to Rust (or Go?).


That's kinda what I'm trying to do here. Ballista is definitely very much inspired by Apache Spark but not a direct port. Spark has some design choices that were very much driven from being a JVM-first platform e.g. the way lambdas are serialized. When I eventually get to support custom user code in distributed execution I don't want it to be limited to Rust, but that's a topic for another blog post later this year.


Andy, I see Spark as essentially a monad that lifts functions into it’s execution context, then sechedules the execution on remote nodes, and gathers the results.

How can you do that in an AOT language? Do you just express your functions at a higher level. Above the layer of containers?


Consider Just-in-Time compilation which is very much used in AOT languages [0]. That could be one avenue, as long as the compiled code works on each machine (common instruction set), or maybe something slightly higher-level like LLVM IR can be produced and transmitted and then compiled on the driver/executor.

Another approach could simply be to transmit the function source to the driver, compile it, then distribute it to the executors assuming they're the same architecture, or compiling it on executors if not.

There are certainly options.

[0]: For example see https://llvm.org/docs/tutorial/BuildingAJIT3.html


It’s not possible to do a direct port. Since Spark relies on java serialisation to send executable code and data (closures) over the wire to workers. This relies on byte code. Rust is compiled ahead of time.

So you can sort of see how the essence of spark can be ported. It’s essentially a monad. But a direct port to Rust is not possible because Rust uses AOT and the JVM was built to load code dynamically.


But claiming that "X software is written in Y language/framework" says nothing about efficiency or safety. It's just meaningless marketting piggy-backing on popular buzzwords.

And claims about "the future" are simply absurd. Oddly enough, this link appears right next to another story on how Cobol powers the world's economy.

Frankly, I'm surprised blockchain wasn't shoved somewhere in the announcement.


Check out my previous benchmarks on Rust-based DataFusion vs JVM-based Apache Spark workloads. It isn't just about raw performance numbers but also about resource requirements (especially RAM requirements when using JVM or similar GC-based languages). There are order-of-magnitude differences.


If you have tangible results then present your benchmarks. If you limit your marketing to empty claims regarding "the future" and vague assertions on performance then you're actually actively working to lower your credibility.


This is a personal open source project. I'm not sure I'm "marketing" it since I make zero dollars from this work.

For benchmarks, check out the past 18 months of posts on my blog. Here is the most recent:

https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/


[flagged]


Benchmarks are rarely fair. They are often biased in favor of those producing the benchmarks, whether intentionally or not.

Whether you want to explore this more or not is your choice. Nobody is asking you to engage in this conversation. If you are not interested and/or are not enjoying the discourse here, feel free to move along.


please, stop it.


Don't you think you are being unreasonably harsh? Why are you attacking somebody's open source project with such passion?


Thanks eklavya but it's completely normal on HN to have trolls criticizing other people's work. Best thing is to just ignore them. They can move on to criticize other posts. Much easier than contributing.


"But claiming that "X software is written in Y language/framework" says nothing about efficiency or safety"

Except for the fact that it does in some cases. Rust is definitely safer than C or C++ (unless the Rust code is explicitly using the unsafe keyword).


[flagged]


I'm sensing that you are not familiar with Rust and the safety that it introduces compared to C/C++.


Some things definitely do become magically safe when the language you are using tightly restricts you from doing very common and unsafe things!


I admit I haven't benchmarked against COBOL blockchains.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: