Hacker News new | past | comments | ask | show | jobs | submit login
Hydroflow: Dataflow Runtime in Rust (github.com/hydro-project)
94 points by adamnemecek 8 months ago | hide | past | favorite | 14 comments

Hey thanks for posting, I'm one of the main devs of Hydroflow. It is not built on top of timely nor differential (we have those deps for benchmarking). The design goals are a little different, Hydroflow aims to be faster and lower level, and with fewer unnecessary clocks. Hydroflow is also single-node, scaling is done with explicit networking rather than thru the runtime. It's the lowest level of the Hydro stack.

Hydro homepage: https://hydro.run/

For a fun easy demo check out the Hydroflow surface syntax playground: https://hydro.run/playground

More info on the Hydro project and stack, CIDR '21: https://hydro.run/papers/new-directions.pdf

Info on the "lattice flow" model in Hydroflow: https://hydro.run/papers/hydroflow-thesis.pdf

e: Also happy to answer any questions! :)

I'm familiar with Java based tools for data processing, like Spring Reactor, Apache Beam, etc., and trying to figure out how I can Hydroflow instead.

To my understanding, I just implement small pieces of the processing steps and connect them through a common bus like Kafka. All those steps are just independent binaries. And to make the final processing flow I run multiple instances manually scaled depending on the task. Like to make an aggregation / reduce I just make sure I have only one instance of that binary is running. And so on. All can be orchestrated and scaled using Kubernetes or similar. Is that how I supposed to design the processing?

Yes that is correct for how you would have to set up a multi-node system for now. This is in contrast with Beam, Spark, etc. where the runtime deployment management/coordination is perhaps the biggest most important part of the product. Hydroflow aims to be a lot lower-level than that, with no opinions on coordination and networking. For example we'd want it to be possible to implement the coordination mechanisms one of those systems (Beam, Spark, MapReduce, Materialize) in Hydroflow.

We do have a tool, Hydro Deploy, to setup and manage Hydro cluster deployment on GCP (other clouds later), but it's mainly for running experiments and not long-running applications.

The long term idea is that the Hydro stack will determine how to distribute and scale programs as part of the compilation process. Some of that will be rewriting single-node Hydroflow programs into multi-node ones.

Thank you. I actually like the idea that I can split the processing into independent binaries and scale them manually. And especially the fact that I can implement it in Rust.

Btw, another questions. Things like Flink provide a dashboard to analyze the processing flow / graph, to see the bottlenecks, etc. To my understanding Hydroflow doesn't provide it yet, but I'm curious if you're working on something like that, or if it provides other kind of metrics, perhaps something compatible with Grafana?

Yeah, nothing like that provided and won't have anything like that for a while. Currently you would have to wire-in your own instrumentation in the dataflow graph.

I guess more a closer comparison would be with the Project Reactor https://projectreactor.io/ which is also a low level framework for data processing.

Ah, I see. Thank you. I will definitely try it

Does it / will it support retractions?

Oh, I realize this comes out the Hydro Project that produced a previous HN submission I thought was cool:

Katara is a project to synthesize CRDTs from a C++ implementation of a regular plain-old data structure along with a few annotations



I'm looking for this but can't find it, how does this project compare to differential dataflow?

As a sibling commenter mentioned, it's built on timely dataflow (which is lower-level), but that already has differential dataflow[0] built on top of it by the same authors.

How do they differ?

(btw. I think dataflow is very cool as a computing model (not just timely dataflow), to the point of building OctoSQL[1] around it, so I'm really curious about the details here)

[0]: https://github.com/TimelyDataflow/differential-dataflow

[1]: https://github.com/cube2222/octosql

> how does this project compare to differential dataflow?

I may ask a stupid question, but from reading the description on the page you linked, how are timely and differential dataflow different from implementation details of 80s reactive programming such as SIGNAL (https://ieeexplore.ieee.org/document/97301) ?

> Differential dataflow programs look like many standard "big data" computations, borrowing idioms from frameworks like MapReduce and SQL. However, once you write and run your program, you can change the data inputs to the computation, and differential dataflow will promptly show you the corresponding changes in its output. Promptly meaning in as little as milliseconds.

I think it's best to read the paper[0].

[0]: https://sigops.org/s/conferences/sosp/2013/papers/p439-murra...

This seems to be based on timely dataflow from Frank Sherry, CTO at Materialize.

Another project I’m excited about in this space is VMware’s steaming database processor https://github.com/vmware/database-stream-processor

EDIT: I said that after checking the cargo.lock for Timley/Differential, but the author says those are just around for benchmarking! Nevermind :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact