Only because you have chosen to introduce configuration and maintenance complexi...

annexrichmond · 2024-08-12T22:28:51 1723501731

We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic is separated. Eg., if your biz logic knows what S3 it talks to in an external task, how can Airflow? So now its Dataset feature is useless.

There are a number of blog posts echoing a similar critique[1].

Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.

According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...

> Airflow has a lot of dependencies - direct and transitive > The important dependencies are: SQLAlchemy, Alembic, Flask, werkzeug, celery, kubernetes

Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?

[1] Anecdotes from various posts - https://medium.com/bluecore-engineering/were-all-using-airfl... - https://eng.lyft.com/orchestrating-data-pipelines-at-lyft-co... - https://dagster.io/blog/dagster-airflow

> Airflow, in its design, made the incorrect abstraction by having Operators actually implement functional work instead of spinning up developer work.

> By simply moving to using a Kubernetes Operator, Airflow developers can develop more quickly, debug more confidently, and not worry about conflicting package requirements.

> Airflow lacks proper library isolation. It becomes hard or impossible to do if any team requires a specific library version for a given workflow

> There is no way to separate DAGs to development, staging, and production using out-of-the-box Airflow features. That makes Airflow harder to use for mission-critical applications that require proper testing and the ability to roll back

> Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow.

nooorofe · 2024-08-13T02:15:46 1723515346

Airflow is far from perfect, but I don't understand your concerns. I work in a big and messy company and even messier department. We have jobs running in Databricks, Snowflake, sometimes we read data from API end points, or even files uploaded to SharePoint (my group is not building DW). Airflow lets me organize it in a single workflow. At least I know that every failed job is reported by email and I don't need to search multiple systems - all starts from Airflow.

> Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?

Webserver is mostly UI. Scheduler service triggers the jobs.

We have groups which run everything as Bash Operator, no dependency issues that way.

You maybe have a very specific use case in mind, the main points of using Airflow for me

* Single orchestration center: manual job control (stop, pause, rerun), backfill; automated scheduler/retry; built-in notification

* Framework built around "reporting period" - it enforces correct abstraction, if a data batch is broken, I can rerun it and rerun all dependent downstream. How do you fix data in event driven workflow?

* managing dependencies

In most cases all Airflow does is running your job with passing it "date" parameter. You can test your code without Airflow - just pass it a date and run from command line.