I like the comparison page in Hamilton, and in their examples they operate in the asset level, whereas Bruin crosses the asset level into the orchestrator level as well, effectively bridging the gap there. What Bruin does is beyond a single asset that might be a group of functions, it is basically being able to build and run pipelines of that.
In terms of distributed execution, it is in our roadmap to support running distributed workloads as simple as possible, and Postgres as a pluggable queue backend is one of the options as well. Currently, Bruin is meant as a single-node CLI tool that will do the orchestration and the execution within the same machine.
great point about the transformation pipeline, that's a very strong part of our motivation: it's never "just transformation", "just ingestion" or "just python", the value lies in being able to mix and match technologies.
as per the lateness: ingestr itself does the fetching itself, which means the moment you run it it will ingest the data right away, which means there's no latency there. in terms of loading files from S3 as an example, you can already define your own blob pattern, which would allow you to ingest only certain files that fit into your lateness criteria, would this fit?
in addition, we will implement the concept of a "sensor", which will allow you to wait until a certain condition is met, e.g. a table/file exists, or a certain query returns true, and continue the pipeline from there, which could also help your usecase.
feel free to join our slack community, happy to dig deeper into this and see what we can implement there.
I didn't know about Ray Data before, but just gave a quick look and it seems like a framework for ML workloads specifically?
Bruin is effectively going a layer above individual assets, and instead takes a declarative approach to the full pipeline, which could contain assets that are using Ray internally. In the end, think of Bruin as a full pipeline/orchestrator, which would contain one or more assets using various other technologies.
I did look into CUE in the very early days of Bruin but ended up going with a more YAML-based configuration due to its support. I am not familiar with their flow package specifically, but I'll definitely take a deeper look. From a quick look, it seems like it could have replaced some of the orchestration code in Bruin to a certain extent.
One of the challenges, maybe specific to the data world, is that the userbase is familiar with a certain set of tools and patterns, such as SQL and Python, therefore introducing even a small variance into the mix is often adding friction, this was one of the reasons we didn't go with CUE at the time. I should definitely take another look though. thanks!
I like the comparison page in Hamilton, and in their examples they operate in the asset level, whereas Bruin crosses the asset level into the orchestrator level as well, effectively bridging the gap there. What Bruin does is beyond a single asset that might be a group of functions, it is basically being able to build and run pipelines of that.
In terms of distributed execution, it is in our roadmap to support running distributed workloads as simple as possible, and Postgres as a pluggable queue backend is one of the options as well. Currently, Bruin is meant as a single-node CLI tool that will do the orchestration and the execution within the same machine.
reply