I'm familiar with Java based tools for data processing, like Spring Reactor, Apa...

mingweisamuel · on June 7, 2023

Yes that is correct for how you would have to set up a multi-node system for now. This is in contrast with Beam, Spark, etc. where the runtime deployment management/coordination is perhaps the biggest most important part of the product. Hydroflow aims to be a lot lower-level than that, with no opinions on coordination and networking. For example we'd want it to be possible to implement the coordination mechanisms one of those systems (Beam, Spark, MapReduce, Materialize) in Hydroflow.

We do have a tool, Hydro Deploy, to setup and manage Hydro cluster deployment on GCP (other clouds later), but it's mainly for running experiments and not long-running applications.

The long term idea is that the Hydro stack will determine how to distribute and scale programs as part of the compilation process. Some of that will be rewriting single-node Hydroflow programs into multi-node ones.

splix · on June 7, 2023

Thank you. I actually like the idea that I can split the processing into independent binaries and scale them manually. And especially the fact that I can implement it in Rust.

Btw, another questions. Things like Flink provide a dashboard to analyze the processing flow / graph, to see the bottlenecks, etc. To my understanding Hydroflow doesn't provide it yet, but I'm curious if you're working on something like that, or if it provides other kind of metrics, perhaps something compatible with Grafana?

mingweisamuel · on June 8, 2023

Yeah, nothing like that provided and won't have anything like that for a while. Currently you would have to wire-in your own instrumentation in the dataflow graph.

splix · on June 9, 2023

I guess more a closer comparison would be with the Project Reactor https://projectreactor.io/ which is also a low level framework for data processing.

splix · on June 8, 2023

Ah, I see. Thank you. I will definitely try it