Yeah Terraform does have some sharp edges, but we chose it because it's the most widely used IaC language and lets you review any changes the agent wants to make before actually applying them in your cloud. One issue with an agent making click-ops changes is there's little to no visibility into what actually changed other than going through the console to verify it yourself. Terraform also lets you do more static checks like cost estimates and policy enforcement before deploying the changes.
So there was originally BuildFlow[0], then LaunchFlow[1], and now infra.new? How many pivots are you folks going to do? What happened to all the customers you had on these products?
We do want to support clouds beyond AWS, GCP, and Azure. That's one of the reasons we choose Terraform since it can be extendable to any cloud, and also why we plan on focusing on Kubernetes next.
Good question, I would say we're more focused on being a data pipeline engine as opposed to workflow orchestration. So you could use something like Airflow or Dagster to trigger your BuildFlow pipeline.
Thanks! Currently you can't, right now your only option is to use our ray runner. But we have talked about supporting different runner options similar to how Beam can be run on Spark, Dataflow, etc. And ultimately it would be nice if folks could implement their own runners, but I think we're still a ways out on that.
Thanks! These are all great questions, apologies for the wall of text
1. We're definitely more of a generic streaming framework. But I could see ML being one of those use cases as well.
Why Ray?
One of our main drivers was how "pythonic" ray feels, and that was a core principal we wanted in our framework. Most of my prior experience has been working with Beam, and Beam is great but it is kind of a whole new paradigm you have to learn. Another thing I really like about ray is how easy it is to run locally on your machine and get some real processing power. You can easily have ray use all of your cores and actually see how things scale without having to deploy to a cluster. I could probably go on and on haha, but those are the first two that come to mind.
2. We really want to support a bunch of frameworks / resources. We mainly choose BQ and Pub/Sub because of our prior experience. We have some github issues to support other resources across multiple clouds, and feel free to file some issues if you would like to see support for other things! With BuildFlow we deploy the resources to a project you own so you are free to edit them as you see fit. BuildFlow won't touch already created resource beyond making sure it can access them. In BuildFlow we don't really want to bake in environment specific logic, I think this is probably best handled with command line arguments to a BuildFlow pipeline. But happy to hear other thoughts here!
3. I'm not sure I understand what you mean by "glue", so apologies if this doesn't answer your question. The BuildFlow code gets deployed with your pipeline so it doesn't need to run remotely at all. So if you were deploying this to a single VM, you can just execute the python file on the VM and things will be running. We don't have great support for multi-stage pipelines at the moment. What you can do is chain together processors with a Pub/Sub feed. But we do really want to support chaining together processors themselves.
Congratulations on the launch, I love the focus on ease of use and making it easy to get started, and it's exciting to see impressive products being built with Ray!
I'm one of the Ray developers. It is true that Ray focuses a lot on ML applications (in particular, the main libraries built on top of Ray are for workloads like training, serving, and batch processing / inference). That said, one of our long-term goals with Ray is to be a great general-purpose way to build distributed applications, so I hope it is working out for you :)
Thanks for the kind words Robert! Our experience with Ray has been great so far, we're excited to see how we can use ray to help improve stream processing.
We don't support any snapshotting or checkpointing directly in BuildFlow at the moment, but these are great features we should support.
But we do have some fault tolerance baked into our I/O operations. Specifically for Google Cloud Pub/Sub the acks don't happen until the data has been successfully processed and written to the sink, so if there is a bug or some transient failure the message will be resent later depending on your subscriber configuration.
All of our processing is done via Ray (https://www.ray.io/). Our early benchmarks are about 5k mesesages per second on a single 4 core VM, but we believe we can increase the with some more optimizations.
This bench mark was consuming a Google Cloud Pub/Sub stream and outputting to BigQuery.
Great question! We actually looked at using the workflow abstraction for batch processing in our runner, but ultimately didn't because it was still in alpha (we use the dataset API for batch flows).
I think one area where we differ is our focus on streaming processing which I don't think is well supported with the workflow abstraction, and also having more resource management / use case driven IO.
Makes a ton of sense! I was present at the demo for this at last year's Ray conference and I definitely got the sense that a lot of the orchestration details were still being thought through, and that it was not yet a first-class streaming product.
Definitely like seeing more streaming-focused orchestration tools out there - it's a growing niche with not enough alternatives to Beam