
Cloud Dataflow SQL and Cloud Dataflow Batch Flexible Resource Scheduling Release - faizshah
https://cloud.google.com/blog/products/data-analytics/streaming-analytics-now-simpler-more-cost-effective-cloud-dataflow
======
faizshah
So I have a bad history with Cloud Dataflow.

I like BigQuery and Flink a lot, but a lot of companies seem to be switching
over to Cloud Dataflow. In my experience Cloud Dataflow's cost is really
unpredictable and DataPrep's pipelines (on DataFlow) were far more inefficient
compared to an equivalent DataProc Spark pipeline, Spark on Compute Engine, or
just plain BigQuery.

In addition:

Apache BEAM's overhead itself is a lot:
[https://arxiv.org/pdf/1907.08302.pdf](https://arxiv.org/pdf/1907.08302.pdf)

Switching from Dataflow can incur massive cost savings:
[https://lugassy.net/how-moving-from-pub-sub-to-avro-saved-
us...](https://lugassy.net/how-moving-from-pub-sub-to-avro-saved-
us-38-976-year-ec6c33ea7d08)

Right now I'm sorting out a solution either using GKE, Managed Instance
Groups, or Cloud Run to create a completely preemptible scale to 0 pipeline
instead. But it's sort of like reinventing the wheel, the compute choice is
just the task manager in flink and the "orchestration engine" that you set up
to manage your Cloud Tasks/Pub Sub and Cloud Build solution is reinventing
Flink's Job Manager. I miss the clarity of Flink and Beam but I can't find a
cheap scale to 0, preemptible instance only, solution.

Whats your experience been?

