I can't imagine why people use Airflow these days. Its DAG DSL means you have to fit your biz logic to their paradigm. I expect a framework to fit nicely with existing biz logic. And that means you're essentially stuck. It has no interoperability; hence the need to build custom operators for everything.
It doesn't scale for teams because the package is incredibly bloated, so once you need to run multiple images via K8s operators, you lose out on a lot of other Airflow functionality because it all assumes you have your biz logic embedded within DAGs, but at the same time I don't know why anyone would develop this way.
DAGs are fine. Their DSL is not because it's abstracting the wrong things in the wrong place. It's a global file with static definitions; why do you hardcode KubernetesOperator when maybe you don't want a KubernetesOperator in a test env? There is also no type safety between tasks/operators. And it's an extremely dependency heavy package with no client/server isolation so bundling Airflow for multiple teams is just not viable.
Why do you think you define DAGs in Python? The point is to be dynamic, exactly to do things like switching between operator types based on things like the environment. Sorry you really don't seem to know a lot about airflow for your strong opinions against it, I'm out, no offense intended at all.
I'm well aware that's "possible", but if you have to build your own abstractions and CI/CD to make it usable this way, it doesn't seem very well designed.
> I can't imagine why people use Airflow these days.
You lose out on a lot of other Airflow functionality because it all assumes you have your business logic embedded within DAGs.
As an old Airflow user, I can relate with that.
The contributors are great and the speed of the new features and fixes are great also.
I see two main issues with Airflow: (i) the lack of good interoperability with a cloud native stack, specifically with the k8s operators, that sounds quite hackish and (ii) lack of a better version control in the DAGs in way that we can have actually real lineage.
The point is that for most of the veterans in ~ETL~ Data Engineering in Cloud Native stacks, we just want to send a cli command or a Make command to a container to execute the thing that we want.
Apache Airflow does that, but as most of the things in the Python ecosystem at the time that you have a great UI (not confuse with UX) become way easier to embed business logic in the DAGs and have your application embedded with Airflow logic.
I will never bash Airflow since I got a lot of money in consulting migrating from and to it, but the general feeling of working on a daily basis for almost 7 years made me realize the sad reality around orquestrators. In the past, those used to be kind of the bedrock technology that you would never want to migrate, but today, due to this feature bloat and lack of vision, I started to see orquestrators as disposable as Java Script frameworks.
Prefect has more polish and is easier to get started than any of the existing options. We've been running their self-hosted for over three years and it basically stays out of the way.
We looked at Dagster as well as Airflow. I really, really liked Dagster but the BI team didn't.
I cannot imagine using Airflow for anything meaningful and respecting myself at the end of a work day. The local development experience was abysmal. Deployments sucked.
That being said, if you're not using anything except maybe cron right now, and if you don't care about the solution being a proper data pipelines orchestration platform trademark symbol, I'd recommend starting with Windmill.
I've done a lot evaluation of such frameworks, and I hope to publish more on it. It really depends on your requirements. I would look at Prefect, Flyte, Dagster, and Temporal ahead of Airflow, though.
Airflow is kind of the default platform. If you do not have a goto alternative which is superior in multiple dimensions today, I think that says the problem itself is hard.
People use it because it comes with a fully functional web UI.
(That's really the only reason why it's used. If it was just ad-hoc bash scripts underneath the UI I don't think anybody would mind.)
> I expect a framework to fit nicely with existing biz logic.
I wouldn't, that's just describing a custom codebase. Airflow comes with bells and whistles that can be taken advantage of if you fit your execution in their DAG model, that's all. Those bells and whistles can be an excellent pattern to work with. You can go all in on operators, or you can just have an operator call out to an external task that does all the work and only rely on Airflow for its retry/alerting mechanism while keeping the external task in language of your choice.
Airflow owning the scheduling, retry, task branching/dependencies is fine. But a few issues: to take advantage of Airflow's core features (XCOMs, task mapping, etc) your biz logic/tasks must be in the same venv, which is not sustainable as teams grow. Once you have external tasks, you now have a more complex system to operate and test, and your Airflow installation is now very overkill (you need a pool of workers just to monitor external... workers?).
You don't need to have your biz tasks in the same venv as your airflow worker to take advantage of task mapping or XCOM.
You just need a well defined interface on how to call your biz tasks, and then you can use any of ExternalPythonOperator, DockerOperator, DockerSwarmOperator, KubernetesPodOperator, etc. etc. or write your own to pass in values or data to your task however you want.
Airflow is quite complex, and I don't recommend it as people's go to, but IMO that's in large part because it is so unopinionated about how you call and run your tasks and leaves the configuration up to you. But this also means it ends up being a lot of people's choice because they are able to get it to fit what they need.
> hence the need to build custom operators for everything
But isn't that the point? I never got the impression that you were supposed to build everything with the built-in operators, they are just the "batteries included" part you can wrap or extend. I really don't understand the criticism.
My criticism was always xcom, but that's a moot point now that we have TaskFlow. Airflow is awesome and very flexible you just have to adapt to how it's supposed to be used instead of fighting against it I find.
That might be fine for small teams, but any 100+ person company would already have their own abstractions and don't need to reinvent those wheels just for Airflow. But even if you're smaller I still think you're setting up for failure if all your biz logic is within Airflow unless you are careful about making it reusable/shareable for other contexts. Eg, what if you want to convert some scheduled pipeline to some event-driven architecture with other systems? That means needing to refactor everything out. It's not interoperable or modular and that's why it should be avoided.
But in the core it is built around data pipeline concept, event driven pipeline will much more fragile. Airflow intentionally doesn't manage business logic, it works with "tasks".
Yes, but that means you are forced to build EDA on top of Airflow, which may not be ideal for many cases. You are stuck managing your pools/workers within Airflow's paradigm, which means all workload must (a) be written in Python and (b) have Airflow installed on the venv (very heavy pkg) and (c) be k8s pod or Celery (unless you write your own).
We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic is separated. Eg., if your biz logic knows what S3 it talks to in an external task, how can Airflow? So now its Dataset feature is useless.
There are a number of blog posts echoing a similar critique[1].
Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.
> Airflow, in its design, made the incorrect abstraction by having Operators actually implement functional work instead of spinning up developer work.
> By simply moving to using a Kubernetes Operator, Airflow developers can develop more quickly, debug more confidently, and not worry about conflicting package requirements.
> Airflow lacks proper library isolation. It becomes hard or impossible to do if any team requires a specific library version for a given workflow
> There is no way to separate DAGs to development, staging, and production using out-of-the-box Airflow features. That makes Airflow harder to use for mission-critical applications that require proper testing and the ability to roll back
> Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow.
Airflow is far from perfect, but I don't understand your concerns. I work in a big and messy company and even messier department. We have jobs running in Databricks, Snowflake, sometimes we read data from API end points, or even files uploaded to SharePoint (my group is not building DW). Airflow lets me organize it in a single workflow. At least I know that every failed job is reported by email and I don't need to search multiple systems - all starts from Airflow.
> Why should biz logic that just needs to run Spark and interact with S3 now need to run a web server?
Webserver is mostly UI. Scheduler service triggers the jobs.
We have groups which run everything as Bash Operator, no dependency issues that way.
You maybe have a very specific use case in mind, the main points of using Airflow for me
* Single orchestration center: manual job control (stop, pause, rerun), backfill; automated scheduler/retry; built-in notification
* Framework built around "reporting period" - it enforces correct abstraction, if a data batch is broken, I can rerun it and rerun all dependent downstream. How do you fix data in event driven workflow?
* managing dependencies
In most cases all Airflow does is running your job with passing it "date" parameter. You can test your code without Airflow - just pass it a date and run from command line.
If you need to convert a scheduled pipeline into some event driven architecture then yes, it will need a rewrite. Is there any case in which this wouldn't be true? What does it mean to be "interoperable"? Airflow drags can be triggered by events or they can trigger events if needs be. I admit it is not designed to do event streaming though.
It doesn't scale for teams because the package is incredibly bloated, so once you need to run multiple images via K8s operators, you lose out on a lot of other Airflow functionality because it all assumes you have your biz logic embedded within DAGs, but at the same time I don't know why anyone would develop this way.