Hacker News new | past | comments | ask | show | jobs | submit login
Lessons learned from running Apache Airflow at scale (shopify.engineering)
285 points by datafan on May 23, 2022 | hide | past | favorite | 182 comments

We've also been running airflow for the past 2-3 years at a similar scale (~5000 dags, 100k+ task executions daily) for our data platform. We weren't aware of a great alternative when we started. Our DAGs are all config-driven which populate a few different templates (e.g. ingestion = ingest > validate > publish > scrub PII > publish) so we really don't need all the flexibility that airflow provides. We have had SO many headaches operating airflow over the years, and each time we invest in fixing the issue I feel more and more entrenched. We've hit scaling issues at the k8s level, scheduling overhead in airflow, random race conditions deep in the airflow code, etc. Considering we have a pretty simplified DAG structure, I wish we had gone with a simpler, more robust/scalable solution (even if just rolling our own scheduler) for our specific needs.

Upgrades have been an absolute nightmare and so disruptive. The scalability improvements in airflow 2 were a boon for our runtimes since before we would often have 5-15 minutes of overhead between task scheduling, but man it was a bear of an upgrade. We've since tried multiple times to upgrade past the 2.0 release and hit issues every time, so we are just done with it. We'll stay at 2.0 until we eventually move off airflow altogether.

I stood up a prefect deployment for a hackathon and I found that it solved a ton of the issues with airflow (sane deployment options, not the insane file-based polling that airflow does). We looked into it ~1 year ago or so, I haven't heard a lot about it lately, I wonder if anyone has had success with it at scale.

If your team is comfortable writing in pure python and you're familiar with the concept of a makefile you might find Luigi a much lighter and less opinionated alternative to workflows.

Luigi doesn't force you into using a central orchestrator for executing and tracking the workflows. Tracking and updating tasks state is open functions left to the programmer to fill in.

It's probably geared for more expert programmers who work close to the metal that don't care about GUIs as much as high degrees of control and flexibility.

It's one of those frameworks where the code that is not written is sort of a killer feature in itself. But definitely not for everyone.

It’s worth noting that Luigi is no longer actively maintained and hasn’t had a major release in a year.

Toil is pure Python, but I'm not sure how the feature set compares https://github.com/DataBiosphere/toil

Really interesting to see a bioinformatics tool be proposed. I've worked in bioinformatics for over 20 years, written several workflow system for execution on compute clusters, used several other people's and been underwhelmed by most. I was hoping that AirFlow might be better, since it was written by real software engineers rather than people who do systems design as a means to their ends, but AirFlow was completely underwhelming.

The other orchestrator besides Toil to check out is Cromwell, but that uses WDL instead of Python for defining the DAG, and it's not a super powerful language, even if it hits exactly the needs for 99% of uses and does exactly the right sort of environment containment.

I'm also hugely underwhelmed by k8s and Mesos and all those "cloud" allocation schemes. I think that a big, dynamically sized Slurm cluster would probably serve a lot of people far better.

I did a proof of concept in luigi pretty early on and really liked it. Our main concerns were that we would have needed to bolt on a lot of extra functionality to make it easy to re-run workflows or specific steps in the workflows when necessary (manual intervention is unavoidable IME). The fact that airflow also had a functional UI out of the box made it hard to justify luigi when we were just getting off the ground.

Very similar experience to yours. Adopted Airflow about 3 years ago. Was aware of Prefect but it seemed a bit immature at the time. Checked back in on it recently and they were approaching alpha for what looked like a pretty substantial rewrite (now in beta). Maybe once the dust has settled from that I'll give it another look.

creator of prefect was an early major airflow committer. anyone know what motivated the substantial rewrite of prefect? i had assumed original version of prefect was already supposed to fix some design issues in airflow?

I'm a heavy Prefect user and was also very confused about the initial rewrite, even after reading several summaries. My best advice is to just try using 2.0 (Orion). Here's how I'd summarize the difference:

Prefect 1.0 feels like second-gen Airflow--less boilerplate, easy dynamic DAGs, better execution defaults, great local dev, etc etc. It's more sane but you still feel the impedance mismatch from working with an orchestrator.

Prefect 2.0 is a first-principles rewrite that removes most of the friction from interacting with an orchestrator in the first place. Finally, your code can breathe.

I think you mean prefect orion/v2[0]. I'm curious too.

[0] https://www.prefect.io/orion/

Yes, the original stack 'Prefect' was written to address issues in airflow. The DAG on prefect was built using decorators in a context which was pretty cool and worked well but they moved to DAG generation as code on Orion.

Prefect very cleanly written, good design and flexible. IMHO it is a platform that will be the next big thing in the area.

How I know, I deployed prefect as a static config gathering system across 4000 servers, both Linux and Windows. No other software stack came close, as one of the core concepts of prefect is 'expect to fail'. Things like Ansible Tower die really quick with large clusters due to the normal number of failures and the incorrect assumption that most things will work (as you can for a small cluster).

I wish I got to use it in my current work but there is no use case. Yet.

You mean you used prefect to fetch nodes "system parameters/config" ?

Interesting use case, I use prefect for data pipelines, never thought about that kind of use case.

I had many thousands of machines. I needed to collect disk size, ram, software inventory, some custom config, if present. Some machines are Linux, some windows.

With prefect I created a task 'collect machine details for windows', another 'collect machine details for Linux', another 'collect software inventory'.

I have a list of machines in a database so I create a task to get them. That task is an sqlalchemy query so I can pass the task a filter.

I get a list of linux machines and pass that to a task to run. I get a list of windows machines and pass that to a task.

Note that the above don't depend on each other.

I have a task that filters good results from bad. I have another task that writes a list to a database.

Other tasks have credentials.

Another task puts errors to an error table, the machines that failed get filtered from the results and run into this task.

I plumb the above up with a Prefect flow and it builds a DAG that runs the flow. Everything that can be run in parallel does so, everything that has some other input waits for the input.

Tasks that fail can be retried by Prefect automatically. Intermediate results cached. And, I get a nice gui for everything. I can even schedule it in the gui.

Very interesting ! Thank you for the details.

If you could go back and use something else instead what would you choose?

It's a good question. I believe airflow was probably the right choice at the time we started. We were a small team, and deploying airflow was a major shortcut that more or less handled orchestration so we could focus on other problems. With the aid of hindsight, we would have been better off spinning off our own scheduler some time in the first year of the project. Like I mentioned in my OP, we have a set of well-defined workflows that are just templatized for different jobs. A custom-built orchestration system that could perform those steps in sequence and trigger downstream workflows would not be that complicated. But this is how software engineering goes, sometimes you take on tech debt and it can be hard to know when it's time to pay it off. We did eventually get to a stable steady state, but with lots of hair pulling along the way.

dbt tool. getdbt.com

Can dbt run arbitrary code? If it can, it's not well advertised in the documentation. Every time I've looked into dbt, I found that it's mostly a scheduled SQL runner.

The primary reason we run Airflow is because it can execute Python code natively, or other programs via Bash. It's very rare that a DAG I write is entirely SQL-based.

dbt has just opened a serious conversation about supporting Python models. I'm sure they'd value your viewpoint! https://github.com/dbt-labs/dbt-core/discussions/5261

You’re right. I think the strength of dbt is in the T part of ELT. I wrote ELT to make a distinction in principle from the traditional ETL. (E)xtract and (L)oad is the data ingestion phase that would probably be better served by Dagster, where you could use Python.

(T)transform is decoupled and would be served in set-based operations managed by dbt.

Dbt is great, but solves only a small part of what Airflow does.

I would check out Astronomer.io. Just recently came out with a new managed service that specifically targets these exact pain points.

Feel free to ping at xavier.marcos@astronomer.io if you're open to a chat.

If you’re in kubernetes land, I’d suggest checking out argo workflows. Very similar DAG primitives like airflow.

It essentially abstracts workflows as custom resources. Works phenomenally well and quite stable.

I've used Airflow for a few years and here's what I don't like about it:

- Configuration as code. Configuration should be a way to change an application's behavior _without_ changing the code. Make me write a workflow as JSON or XML. If I need a for-loop, I'll write my own script to generate the JSON.

- It's complicated. You almost need a dedicated Airflow expert to handle minor version upgrades or figure out why a task isn't running when you think it should.

- Operators often just add an API layer over top of existing ones. For example, to start a transaction on Spanner, Google has a Python SDK with methods to call their API. But with Airflow, you need to figure out what _Airflow_ operator and method wraps the Google SDK method you're trying to call. Sometimes the operator author makes "helpful" (opinionated) changes that refactor or rewrite the native API.

I would love a framework that just orchestrates tasks (defined as a command + an image) according to a schedule, or based on the outcome of other tasks, and gives me a UI to view those outcomes and restart tasks, etc. And as configuration, not code!

>If I need a for-loop, I'll write my own script to generate the JSON.

That's how you end with extreme mess in logs, UI and metrics.

> But with Airflow, you need to figure out what _Airflow_ operator and method wraps the Google SDK method you're trying to call.

Or you can use PythonOperator with hooks, that generally integrate external APIs with Airflow connection system.


I think the bigger problem with Airflow's Operator concept is N*M problem of integrating multiple systems. That's how you end with GoogleCloudStorageToS3Operator and stuff like that.

What you're asking for is basically Argo Workflows, I think.

Not that I recommend it. It's quite lovely in principle, but really flawed in practice. It's YAML hell on top of Kubernetes hell (and I say that as someone who loves Kubernetes and uses it for everything every day).

Having worked with some of these tools, what I've started to wish for is a system where pipelines are written in just plain code. I'd like to run and debug my pipeline as a normal, compiled program that I can run on my own machine using the tools I already use to build software, including things like debuggers and unit testing tools. Then, when I'm really to put it into production, I want a super scalable scheduler to take my program and run it across dozens of autoscaling nodes in Kubernetes or whatever.

The only thing I've come across that uses this model is Temporal, but it's got a rather different execution model than a straightforward pipeline scheduler.

That's one of the things I really like about Spark (though it's probably a lot less general than what you're after - running a pipeline of steps that are Scala code rather than arbitrary commands). I can run, test, and debug my pipeline locally in my IDE where it's just plain old Scala. Then I can change one line and run my same pipeline on the company Hadoop cluster.

Looks Python-specific? I don't use Python.

Prefect might work for your use-case:


IMO the issue with things like Prefect (and the sinking comment suggestion of Toil) is the inextricable coupling with Python.

We use Argo at my work, because we made the “unit of work” just a container that’s run by K8s. There isn’t a Python runtime around to run this stuff, and that’s a good thing.

Same as with Pulumi, I’ve seen what devs do when you let them write arbitrary code anywhere-you end up with code that lives everywhere, confusingly cuts across different infrastructural layers with wild-abandon. It becomes difficult to on-board, difficult to maintain, the blast radius becomes unknown, and just because it’s Python, it will randomly break in 6 months because nobody pinned a transient dependency.

YAML is awful. Argo could do with some trimming in its configs, but I’ll take it, and other “constrained” alternatives over arbitrary-code-in-Python-that-controls-resources-that-run-more-arbitrary code.

I don't see the attraction of a language-specific framework. I don't use Python, and adding yet another language into the mix would just add complexity.

The flexibility of code as configuration is indeed somewhat of a footgun at times. That’s why with Orchest we went with a declarative JSON config approach.

We take inspiration from the Kubeflow project and run tasks as containers. With a GUI for editing pipelines and managing scheduled runs we come pretty close to what you’re asking for (bring an image and run a command). And it’s OSS, of course.


> If I need a for-loop, I'll write my own script to generate the JSON.

OK, so you are using some service that expects JSON to config. You have a script to generate the JSON. Now you gotta make sure to run the script on changes. You also need to check in the JSON into your VCS despite it being a generate artifact. This leads to the perpetual "somebody fixes an issue in the artifact, and it gets wiped after someone uses the script" issues.

What you're saying is theoretically very good! But it doesn't play well with the combo of assumptions made everywhere in hosted services and some PL infrastructure:

- every project is 1 git repository, and vice versa

- commit security is an all or nothing proposition (so it's hard to have permissions like "this file can only be edited by this user" or "this commit can be made directly to `main`, _if_ it only touches this file")

- configuration as configuration

- version numbers have to be written into code

And obviously this is a problem, because everyone keeps on inventing YAML-based programming idioms for each of their services! Because static configuration doesn't map to even the simplest of setups very well.

People come and them mention stuff like Dhall. But really people want to be able to write programs that read files, make HTTP requests, and then make decisions about what to do based on that. And things like "if this test suite fails, do this other orchestration". The ideal CI flow is not staticly definable at the start of execution!

There is a killer protocol out there, and it's ... something like Skylark, with a standard library that gives file reading, checking of statuses, probably HTTP and the like.... and an executable that allows for configuring permissions in certain ways. The bonus would be just re-doing Bazel but with less guardrails. If someone showed up with CI/CD that was more in this vein they would become the dominate CD host very quickly in my opinion.

Strongly disagree.

IMO I think GUI variables are a big weakness of Airflow.

I, simply, never want to be responsible for a typo in a prod config variable. Ever.

I'm all for version controlling every, single, variable.

Lots of the comments here seem to be commenting on their own experiences of complexity and frustration with Airflow, but I'd venture to say that's most data orchestration tools. In fact, that sort of feedback is so consistent that I'm half tempted to start a podcast for "orchestration horror stories" (contact if interested).

What I've found while building out Shipyard, a hosted lightweight orchestration platform, is that teams want something that "just works". Servers that "just scale". Observability that doesn't require digging. Notifications and retries that work automatically. Workflows that don't mix business logic with platform logic. Code and workflows that sync with git. Deployment that only takes a few minutes.

For the straightforward use cases, where you need to run tasks A -> G daily, with a bit of branching logic, Airflow is overkill. Yes, Airflow has a lot of great complex functionality that can help you down the road. But Airflow keeps getting suggested to everyone even if it's not best suited to their use case, resulting in lots of lost time and engineering overhead.

While I have definitely have bias, there are a lot of other high quality alternatives out there to explore nowadays!

I’d 100% listen to an “orchestration horror stories” podcast.

I’d also like to throw Matillion into the mix of “horror story tools”. I have far too many matillion-induced bad stories.

* Teammates making inscrutable ETL flows because it lets you make nice/visual based flows, and instead of laying them out sanely, they draw pictures with them. One was particularly fond of making flowers and fish, which-while amusing-was entirely unhelpful when there were pressing prod issues and you’re trying to follow the flow around some flower petals.

* nigh incomprehensible generated SQL

* a design that let users mix orchestration and transformation, so prior teammates created jobs that would stomp on each other’s outputs, because they didn’t orchestrate them sanely

* sometimes the runtime would just…stop for reasons we were still unable to resolve.

* lets you run arbitrary Python/bash/etc script, so people would put all sorts of wild scripts in the flow. Oh also, it’s not Python it’s actually Jython, and it could mutate variables in the shared environment-an old teammate would (ab)use this functionality to set the date-time of some variable that another ETL-flow would use to make a decision (see issues above about mixing orchestration + transformation).

+1 for the orchestration horror stories podcast.

Unless you have extremely complex dependency graphs, I really don't think airflow is worth it. It's very easy to end up essentially writing an "orchestrator" using airflow, it allows for very flexible low level operations. The added complexity has minimal benefit, and like something like apache spark, what looks simple becomes hard to reason about in real world scenarios. You need to understand how it works under the hood, and get the best practices right.

As mentioned elsewhere, AWS step functions are really the best in orchestration.

> As mentioned elsewhere, AWS step functions are really the best in orchestration.

AWS Step Functions is a proprietary service provided exclusively by AWS, which reacts to events from AWS services and calls AWS Lambdas.

Unless you're already neck-deep in AWS, and are already comfortable paying through the nose for trivial things you can run yourself for free, it's hardly appropriate to even bring up AWS Step Functions as a valid alternative. For instance, Shopify's articles explicitly mention they are running their services in Google Cloud. Would it be appropriate to tell them to just migrate their whole services to AWS just because you like AWS Step Functions?

> already comfortable paying through the nose for trivial things you can run yourself for free

But fault tolerant workflow engine is not trivial thing, it may cost you many engineer hours to build, monitor and maintain it, so outsourcing it to someone else is totally viable solution.

> But fault tolerant workflow engine is not trivial thing,

The complexity and risk of migrating cloud providers eclipses whatever problem you assign to "fault tolerant workflow engines".

Any mention of AWS Step Functions makes absolutely no sense at all and reads at best like a non-sequitur.

I read it as a comment on the UX / developer experience, which can superior with Step Functions vs competition regardless of whether Step Functions is an appropriate (or even physically possible) option for non-AWS projects.

That was one the reasons we do "bring your own compute" with https://iko.ai so people who already have a billing account on AWS, GCP, Azure, DigitalOcean, can just get the config for their Kubernetes clusters and link them to iko.ai and their machine learning workloads will run on whichever cluster they select.

If you get a good deal from one cloud provider, you can get started quickly.

It's useful even for individuals such as students who get free credits from these providers: create a cluster and you're up and running in no time.

Our rationale was that we didn't wanted to be tied to one cloud provider.

https://github.com/checkr/states-language-cadence allows you to define workflows in states language over cadence.

This is another symptom of a person who doesn’t know what they’re talking about really.

It’s like those stackoverflow answers that tell the user to stop using PHP and rewrite it in Python or something.

> AWS step functions are really the best in orchestration.

At our company, AWS Step is a disaster. You're effectively writing code in JSON/YAML. Anything beyond very simple steps becomes 2 pages of YAML that's very hard to read or write. There is no way to debug, polling is mostly unusable. Changes need to be deployed with CF which can take forever, or worse hang.

It's the most one of the most annoying technologies I've used in my 20+ years of engineering.

I feel you. That's why we wrote a little library on top of SFN so that we can program SFN with Clojure instead of YAML https://github.com/Motiva-AI/stepwise. Application code sits with SFN definition and SFN Tasks are automatically integrated as polling Activities from Clojure code.

Thoughtworks made a case for this distinction in https://martinfowler.com/articles/cant-buy-integration.html#...

yeah, to me the idea that anyone would want to write orchestration in something that isn't a programming language where you can write tests in the same language is just so backwards.

Definitely this.

Our teams have many 1 or 2 step DAGs that are idempotent. They could have been lambdas and they're already pulling from SQS already. It could be just my misfortune, but in AWS, MWAA is kind of janky. It's difficult to track down problems in the logs (task failures look fine there) and the Airflow UI is randomly unavailable ("document returns empty", "connection reset" kind of things).

Lambdas have resource constraints that Airflow DAGs don't. Most notably, Airflow DAGs can run for any arbitrary length of time. And the local storage attached to the Airflow cluster is actual disk space, and not just a fake in-memory disk, making it possible process files larger than the amount of memory allocated to the DAG.

There's certainly some functionality overlap, but I don't see Lambda and Airflow as competitors. Each has capabilities that the other doesn't.

I remember reading that you can attach EFS to lambdas. That would solve some of the storage issues.

I read some of these of massively complex data architecture posts and I almost always come away asking "What the hell is this for?" I know Shopify is a huge business but I see this kind of engineering complexity and all I think is it has to cost tens of millions to build and operate and how could they possibly be getting ROI. There are ten boxes on that diagram and none of them have a user interface for anyone except other developers.

A lot of times this is used for data warehousing so product managers and otherwise can query the database of one app joined with another, especially in an environment with microservices. You might join a table containing orders with another table that was from a totally different DB, like payments, to find out which kinds of items are best to offer BNPL or something.

The author also mentions that it’s used for machine learning models which will ultimately feed back into Shopify’s front end, for instance.

I know what a data warehouse is for but this whole situation has doesn't even cover the reporting system which itself is a level removed from any actual product decisions and whether those decisions result in incremental revenue to justify this cost. My company is a fraction the size of Shopify but we have a robust data pipeline and reporting on millions of users run by two people and off the shelf software.

I’ll never understand why individuals always default to cloud offerings when they are extremely expensive compared to a dedicated tool.

It’s easy to understand when you have lots of money, but no time. Cloud is simple and expensive, self managed is complex and cheap. Time’s money and all that!

Except for this workflow.

You won’t get around any of the problems of airflow by moving to a cloud offering.

I don’t need to manage the cloud offering and that management time is expensive. Your befuddlement at this simple economic calculation is, well, befuddling.

Why hire an expensive janitorial service to clean your office? Why hire a mechanic to fix your car?

It's shocking that some people cannot fathom that in certain scenarios cloud offerings make sense.

They don't always make sense, in certain scenarios it is worth taking an open source, cloud independent tool, in some scenarios you can roll your own, but there are circumstances where it's a good choice using a tool your cloud provider gives you.

Because they don’t know what they’re doing and aren’t the ones paying the bill.

“Oh I have to learn how to use and setup this tool? I think I’ll just pay the equivalent salaries and be locked in…”

You're going to pay the bill regardless, whether the employee is hired by your team or hired by the cloud vendor.

I don't know how management works through this math, maybe managing people gets exhausting and they just want to out-source it so leadership doesn't have to deal with it and then they can just focus on the core product.

And the above "I don't want to deal with it" reason isn't spoken of, the more more commonly touted benefit is cloud's "flexibility". Sure, but this is actually _really_ expensive. Every cloud migration effort I've experienced is only just worthwhile to begin to talk about because the costs are based on long-term contracts of cloud resources, not the per-hour fees. Nice flexibility.

With that said, the cloud may be a good place for prototyping where the infrastructure isn't the core value add and it's uncertain. A start-up is a prototype and so here we are. But, for an established company to migrate to the cloud and fire the staff that's maintaining the on premise resources.. I'm skeptical. More than likely, this leads to maintaining both cloud and on premise resources, not firing anyone, and thus, actually increasing costs for an uncomfortably long time.

And for the folks on the ground, who don't pay the bills, the increase of accidental complexity is rather painful.

I'm very much paying my own cloud bills but there is no chance I would be able to orchestrate some of the workflows I want to orchestrate if it were not for Step Functions.

For a one person shop like me, AWS is a force multiplier. With it, I can do (say) 30% of what a dedicated engineer in a specific role could do. Without it, I'd be doing 0%.

I really like this tradeoff for my particular situation.

It's easy to get started and you don't need to worry about infra.

I've been a one-man army at places because leveraging these cloud offerings allows me to crank out working software that scales to the moon without much thought.

I'd rather pay AWS/GCP to handle infra, so that I can get 2-3x as many project done.

None of these problems in airflow in this thread are due to infrastructure so how does using a cloud service solve anything?

Fast time to market with a fraction of the effort?

> As mentioned elsewhere, AWS step functions are really the best in orchestration.

Why? Where else is this mentioned?

very basic use cases:

* run job for given reporting_period * backfill data for reporting_period between N and M * lookup failed jobs

Nothing in Databricks supports that.

Databricks PM here.

Backfills are on our roadmap We are previewing looking up failed jobs soon.

Email me at bilal dot aslam at Databricks dot com if you want more info

If you have the headcount for people just to build/support Airflow, please do yourself a favor and give that money to Astronomer.io. Their offering is stupid good. There's 20 different reasons why paying them is a much better idea than managing Airflow yourself (including using MWAA), and it's dirt cheap considering what you get.

Last time I checked, they asked for a significant minimum $ + 1 year commitment.

I wish they had a "start small", self service, clear pricing option.

I'm bullish about Dagster nowadays. Though, I don't have a lot of experience with Airflow. Figured I'd ask if anyone has switched from Airflow to Dagster and has any comments?

I had participated in migrating around 100 fairly complicated pipelines from Airflow to Dagster over six months in 2021. We used k8s launcher, so this feedback does not apply to other launchers e.g. Celery.

Key takeaways roughly those:

- Dagster's integration with k8s really shines as compared to Airflow, it is also based on extendable Python code so it is easy to add custom features to the k8s launcher if needed.

- It is super easy to scale UI/server component horizontally, and since DAGs were running as pods in k8s, there was no problem scaling those as well. For scheduling component it is more complicated, e.g. builtin scheduling primitives like sensors are not easily integrated with state-of-art message queue systems. We ended up writing custom scheduling component that was reading messages from Kafka and creating DAG runs via networked API. It was like 500 lines of Python including tests, and worked rock-solid.

- networked API is GraphQL while Airflow is REST, both are really straightforward, however in Dagster it felt better designed, maybe due to tighter governance of Dagster's authors over the design.

- DAG definition Python API, e.g. solid/pipeline, or op/graph in a newer Dagster API, is somewhat complicated as compared to Airflow's operators, however it is easy to build custom DSL on top of that. One would need custom DSL for complicated logic in Airflow as well, and in case of Dagster it felt easier to generate its primitives, than doing never ending operators combinations in case of Airflow.

- Unit and integration testing are much easier in Dagster, the authors put testing as a first-class citizen, so mocks are supported everywhere, and the code tested with local runner is guaranteed to execute in the same way on k8s launcher. We never had any problems with test environment drift.

The biggest caveat was full change of internal APIs in 0.13, which forced the team to execute a fairly complicated refactor, due to deprecation of the features we were depending on e.g. execution modes. Had we spent more time on Elementl slack, it would be easier to put less dependencies on those features ^__^

At my previous employer, we were running self-hosted Airflow in AWS, which really was a nightmare. The engineer that set it up didn't account for any kind of scaling and all the code was a mess. We would also get issues like logs not syncing correctly in our environment or transient networking issues that somehow didn't fail the given Airflow task. Eventually, we did a dual migration: temporarily switching to AWS managed Airflow (their Amazon Managed Workflows for Apache Airflow product) while also rewriting the DAGs in Dagster.

Dagster was a great solution for us. Their notion of software defined assets allowed us to track metadata of the Redshift and Snowflake tables we were working with. Working with re-runs and partitioned data was a breeze. It did take a while to onboard the whole team and get things working smoothly, which was a bit difficult because Dagster is still young and they were often making changes to how parts of the system worked (although nothing that was immediately backwards incompatible).

We also enjoyed some of the out of the box features like resources and unit testing jobs. Overall, I think it made our team focus more on our data and what we wanted to do with it rather than feeling like we had to wrangle with Airflow just to get things running.

Thanks for your comment! Ditto last time I ran Airflow locally it took like 5 Docker containers. Then I forgot about the project and for a while was furious at Docker for randomly taking 100% CPU. Then I realized it was because of the Airflow containers that would restart along with Docker. I didn't get much further with Airflow.

Dagster, on the other hand, seems to let you scale from using it locally as a library all the way to running on ECS/K8s etc. Along with that there's unfortunately a ton of complexity in setting it up but that's not Dagster's fault and it seems like Dagster works once you get it set up. Agree about it being young and there being some rough spots but it's got lots of good ideas. We were nearly done setting it up but got pulled off onto more urgent things, so I haven't run it in production yet. I'm glad to hear it worked well for you!

out of curiosity, why was it hard to onboard the team to Dagster?

Dagster is extremely nice to work with. I did a bakeoff of Prefect vs Dagster internally at my current employer, and while we ended up going with Prefect for reasons, I am still so impressed with the way Dagster approaches certain pain points in the orchestration of data pipelines and its solution for them.

> for reasons

I'd love to hear more on this. I've not evaluated Prefect, and am currently keeping an eye on Dagster. What trade-offs does Prefect win?

The reasons were related more to accessibility and the data team's ability to fold the orchestration framework into their workflow and not be constrained by it. A lot of that was on me not having the time to make it easy to adopt, but Prefect just offered immediate adoption (being able to shell out, run notebooks, arbitrary Docker containers or k8s pods, in addition to a very unobtrusive decorating pattern) that was too great to pass up.

What Dagster has going for it in this space is pragmatism. It really nails all of the problem points of data ops (with resources & sensors specifically). If I was consulting for a shop that needed data pipelines and they had good eng, I'd recommend Dagster in a heartbeat.

I did a baby bakeoff internally in my prior role ~18mo ago now. Prefect felt nicer to write code in but perhaps not as easy to find answers in the docs (though their Slack is phenomenal). Ended up going with Prefect so I could focus on biz/ETL logic with less boilerplate, but I'm sure Dagster is not a bad choice either. Curious to hear about parent's experience

Could anyone comment on Temporal vs Airflow?

After having a lot of pain points with an (admittedly older and probably not best-practices) Airflow setup, I am now at a different job running similar types of workflows on Temporal - we're pretty happy with it so far, but haven't done anything crazy with it.

I'm also curious about this. The folks I hear about Temporal from seem to be very disjoint from Airflow users, and Temporal's python client is still alpha-stage.

It seems notable to me that the big Prefect rewrite mentioned elsewhere [0] leans into the same "workflow" terminology that Temporal uses. I have to wonder if Prefect saw Temporal as superceding the DAG tools in coming years and this is them trying to head that off.

That post's discussion of DAG vs workflow also sounds a lot like why PyTorch was created and has seen so much success. Tensorflow was static graphs, pytorch gave us dynamism.

[0] https://www.prefect.io/blog/announcing-prefect-orion/

I know airbyte.io (elt platform) is built on top of Temporal, but I haven't used it.

Yes, Airbyte is using Temporal. Here is a blog post they wrote a few weeks ago that goes into more detail about it: https://airbyte.com/blog/scale-workflow-orchestration-with-t...

I’d love to hear that too :)

Hi, Tom from Temporal here. I don't have a lot of experience with Apache Airflow personally, but I was at Cloudera when it was added to our Data Engineering service, so I learned about it at the time. Here are a few things that come to mind:

* Both Apache Airflow and Temporal are open source

* Both create workflows from code, but the approach is different. With Airflow, you write some code and then generate a DAG that Airflow can execute. With Temporal, your code is your workflow, which means you can use your standard tools for testing, debugging, and managing your code.

* With Airflow, you must write Python code. Temporal has SDKs for several languages, including Go, Java, TypeScript, and PHP. The Python SDK is already in beta and there's work underway for a .NET SDK.

* Airflow is pretty focused on the data pipeline use case, while Temporal is a more general solution for making code run reliably in an unreliable world. You can certainly run data pipeline workloads on Temporal, but those are a small fraction of what developers are doing with Temporal (more here: https://temporal.io/use-cases).

Do you see Temporal as being a super-set of DAG managers like Airflow/Dagster/Prefect, or do you see uses where those tools would be a better choice than Temporal?

All three are relatively similar, in terms of being designed around a DAG model and focused on data pipelines. If your workflow fits neatly into that box and your team already has experience using one of them successfully, then the effort involved in switching might outweigh the gain.

While my experience with those specific tools is limited, I have used other DAG-based systems (notably Apache Oozie) quite a bit. I can't think of anything I did with those that I couldn't do with Temporal.

Temporal is distinct in that it's neither limited to DAGs nor to data pipelines. I suppose it is therefore a super-set, even though I don't really think of it that way (much as I don't think of Go or Rust as a super-set of bash).

while you're here, we'd love for Temporal to have a way to control max concurrent executions of a given workflow type!

I echo other comments. Running and managing Airflow beyond simple jobs is complicated. But then if you are running and managing Airflow for simpler jobs, then you might not need Airflow.

One data center company that I know of uses airflow at scale with docker and k8s. They have a huge team of devops just to manage the orchestrator. They in turn have to fine tune the orchestrator to run smoothly and efficiently. Similar to what shopify has noted here, they have built on top of and extended airflow to take care of pain points like point 4. For companies like this it makes sense to run airflow.

Another issue I see companies/engineers who adopt airflow is that they use it as a substitute for a script than as an orchestrator. For example, say you want to download files from an API, upload to s3, load it to your warehouse (say snowflake) and do some transformations to get your final table - instead of writing separate scripts for each step of fetch/upload/ingest/transform and call each step from the dag, they end up writing everything as a task in a dag. A huge disadvantage is there is a lot of code duplication. If you had a script as a CLI, all your dag/task has to do is call the script with the respective args. I agree that airflow comes with a lot of convenience wrappers to create tasks for many things but I feel this results in losing flexibility.

This also results in them tying their workflow with airflow and any change they might need they have to modify their airflow code directly. If you want to modify how/what you upload to s3, you end up writing/modifying python functions in the respective dags' code. This removes the flexibility to modify/substitute any component of the workflow with something else or even change the orchestrator from airflow to something else. Additionally, different teams might write workflows in different ways - standardization of practice is really hard. This in turn results in pouring more investments to maintaining and hiring "airflow data engineers". Companies fall into steep tech debts.

Prefect/dagster are new orchestrators in town. I'm yet to try them out but I've heard mixed reviews about them.

EDIT: Forgot about upgrades. Lot of upgrades are breaking changes esp the recent change from 1->2. You end up spending a lot of time just trying to debug what went wrong. Just installing and running it is a pain.

We've established a rule that all "custom" code (anything that isn't a preexisting operator in airflow) needs to be contained in a docker image and run through the k8s pod operator. What's resulted is most folks do exactly what you said. They create a repo with a simple CLI that runs a script and the only thing that gets put in our airflow repo is the dependency graph/configuration for the k8s jobs.

AFAICT this is the now-recommended way to use Airflow: as a k8s task orchestrator. Even the Astronomer team (original Airflow authors) will tell you to do it this way.

Love your observation about tying the workflow to Airflow.

One of my biggest annoyances in the orchestration space is that teams are mixing business logic with platform logic, while still touting "lack of vendor lock-in" because it's open source. At the point that you're importing Airflow specific operators into your script and changing the underlying code to make sure it works for the platform (XCom, task decorators, etc.), you are directly locking yourself in and making edits down the road even more difficult.

While some of the other players do a better job, their method of "code as workflow" still results in the same problems, where workflows get built as a "mega-script" instead of as modular components.

I'm a co-founder at Shipyard, a light-weight hosted orchestrator for data teams. One of our core principles is "Your code should run the same locally as it does on our platform". That means 0 changes to your code.

You can define the workflow in a drag and drop editor or with YAML. Each task is it's own independent script. At runtime, we automatically containerize each task and spin up ephemeral file storage for the workflow, letting you can run scripts one after the other, each in their own virtual environment, while still sharing generated files as if you were running them on your local machine. In practice, that means that individual tasks can be updated (in app or through GitHub sync) without having to touch the entire workflow.

I'm biased, but it seems crazy to me that so many engineers are willing to spend hours fighting the configuration of their orchestration platform rather than focusing on the solving the problems at hand with code.

I tried to run airflow; I found pretty much everything about it to be wrong for my usecase. Why can't I easily upload a workflow through the UI? Why doesn't it handle S3 file staging for me?

It definitely takes some time getting used to the quirks of Airflow. I know it took 6 months of running it at my last gig to really understand what was happening underneath the UI.

With great control comes great responsibility.

actually I concluded it was just not that great a workflow engine. It's probably just intended for a different use case than mine.

Would love to hear more about your use case and your issues -- you can sign up on our website (magniv) or send me an email jon at our domain.

We operate a (small?) Airflow instance with ~20 DAGs but, one of those dags has ~1k tasks. It runs on k8s/aws setup with a MySQL backing it.

We package all the code in 1-2 different Docker images and then create the DAG. We've faced many issues (logs out of order, missing, random race conditions, random task failures, etc.)

But what annoys me the most is that for that 1 big DAG, the UI is completely useless, tree view has insane dupplication, graph view is super slow and hard to navigate through and answering basic questions like, what exactly failed and what nodes are around it are not easy.

At Airbnb, we were using SubDAGs to try to manage large number of tasks in a single DAG. This allowed organizing tasks and drilling down into failures more easily but came with its own challenges.

In more recent versions of Airflow, TaskGroups (https://airflow.apache.org/docs/apache-airflow/stable/concep..., https://www.astronomer.io/guides/task-groups/ ) were made to help this a little bit. Hopefully that helps a bit.

At ~1k nodes in the graph introspection becomes hard anyway, as others have suggested, breaking it down if possible might be a good idea.

We had a similar DAG that was the result of migration a single daily Luigi pipeline to Airflow. I started identifying isolated branches and breaking them off with external task sensors back to the main DAG. This worked but it's a pain in the ass. My coworker ended up exporting the graph to graphviz and started identifying clusters of related tasks that way.

I've not had the best luck with ExternalTaskSensors. There have been some odd errors like execution failing at 22:00:00 every day (despite the external task running fine).

Also, the @task annotation provides no facilities to name tasks. So if you like to build reusable tasks (as I do), you end up with my_generic_task__1, my_generic_task__2, my_generic_task__n. I've tried a few hacks to dynamically rename these, but I just ended up bringing down my entire staging cluster.

`your_task.override(task_id="your_generated_name")` not working for you?

I got pretty excited when I read this response, but no, it doesn't work. I'm not sure how this would work since annotated tasks return an xcom object.

Can you point me to the documentation on this function? It's possible I'm not using it correctly.

I can do something like this, which works locally, but breaks when deployed:

    res = annotated_task_function(...)
    res.operator.task_id = 'manually assigned task id'


def my_func():


This still has the problem that, when you call my_func multiple times in the same dag, the resulting tasks will be labelled, my_func, my_func__1, my_func__2, ...

How about the dynamic task mapping that is now available in 2.3?

Does this imply file metadata content can effect the access performance of those files even for operations that do not directly concern the metadata?

Airflow brought one of the best tools with nice UI for running pipelines back in 2014-2016. But now days engineers should be aware about easier to use options and don’t choose Airflow blindly as default choice. IMHO for 80-90% of cases orchestration system should not use code at all - it should be DAGs as a config code. Airflow is popular and teams keep choosing it for building simple DAGs and incurring avoidable otherwise Airflow maintenance costs.

Databricks Orchestration pipeline, AWS Step Functions - good examples of DAGs as a configuration.

I've used AWS Step Functions extensively over the past several years and give me code every day of the week over the Stepfunctions json config. Once you get beyond a few simple steps, it gets very hard to look at the config and understand what's going on with it. Especially true when you haven't look at the config in awhile. The DAG visualizer definitely helps, but as soon as things get beyond the trivial I long for a different tool.

AWS offers a service for managed Airflow: https://aws.amazon.com/managed-workflows-for-apache-airflow/

Makes me wonder if Amazon internally was using Step Functions, ran into issues trying to scale to larger graphs, realized multiple teams were using Airflow, and created the Managed Airflow service.

Why not just model the json as objects in (insert favorite language) and then use that code to generate the json?

Ah yes, a home-made framework to generate configurations for your framework that's supposed to make your life easier. That way you can maintain your code that maintains your configs that make it easier to run your code that you have to maintain!

Actually, yes. It allows for easier unit and integration testing as well. The original complaint is that things were getting hard to read and they wished there was code for this. It seems logical to create a framework for the json configuration files so that they can be easily mocked and tested. As someone who greatly values spending time on automated testing, it seems weird to not think of it this way.

Quick google shows that others have done things like this already...

[1] https://noise.getoto.net/2021/10/14/using-jsonpath-effective...

[2] https://aws.amazon.com/about-aws/whats-new/2022/01/aws-step-...

[3] https://docs.aws.amazon.com/step-functions/latest/dg/sfn-loc...

This can be a much better approach than upgrading the DAG description language to a true programming language. It forces anything complex to happen at build time where it can do less damage. Plus, we can often use the same library to do static analysis on the output

Metaflow provides a similar concept to interface with Step Functions and Argo Workflows in Python - https://docs.metaflow.org/going-to-production-with-metaflow/...

I was until a week or two ago part of a team that build datasets with extensive dependencies (thus, complicated DAGs)

v1 of the system built before I joined was Step Functions and the like. It gets hairy just as you say.

v2 I built and designed with the lead data engineer, we called it Coriaria originally. We're hoping/planning to open source it eventually, although it's a little wrapped around our company's internal needs & systems.

It chooses neither "config" strictly speaking nor "code" for the DAG, instead the primary representation/state is all in the PostgreSQL database which tracks the dataset dependencies and how each dataset is built. It's a DAG in PostgreSQL as well.

To make dataset creation and management easier, I also wrote a custom Terraform provider for Coriaria. This made migrating datasets into the new system dramatically faster. The provider is really nice, supports `terraform import` and all that. Currently we have it setup so that there are separate roles/accounts that can modify an existing dataset, but reading state only requires authentication, not authorization. This enables one team to depend on another team's dataset as an upstream data source for their datasets without granting permission to modify it or create a potentially stale copy of the dataset. Terraform's internal DAG representation of the resource dependencies is leveraged because "parent_datasets" references the upstream datasets directly, including the ones we don't build.

We're able to depend on datasets we don't build ourselves because the system has support for Glue catalog backends to track and register partition availability.

Currently, it builds most of the datasets using AWS Athena & S3, however this is abstracted over a single step function. There's no DAG of step functions, it's just a convenient wrapper for the Athena query execution.

The system also explicitly understands dataset builds and validations as separate steps. The dashboard makes it easy to trace the DAG and see which datasets are blocking a dataset build.

We're adding more integrations to it soon so that other ways of kicking off dataset builds and validations are available.

If people are interested in this I can begin lobbying for open sourcing the system. My colleague wanted to open source it as well.

All else fails, I'll rebuild it from scratch because I don't like the existing solutions for managing datasets. We've been calling it a data-flow orchestration system or ETL orchestration system, not sure what would be most meaningful to people.

I think the main caveat to this system is that I'm not sure how much use it'd be for streaming data pipelines, but it could manage the discretization of streaming into validated partitions wherever streamed data is sunk into. Our operating assumptions are that you want validated datasets to drive business decisions, not raw event data streamed in from Kafka. Making sure the right data is located in each daily (or hourly) partition is part of that validation.

Do you have more examples for better tools, ideally open source (unlike AWS Step functions)?

Flowable is a BPNM system. You can do a lot of async calls with it. https://www.flowable.com/open-source

We use it for a complex pricing process that invokes 30-40 micro services that can take up to minutes per step.

Kamelets and the Karavan UI in combination with k8s and Knative for "serverless" integrations looks interesting.

We at magniv.io are building an alternative.

Our core is open source https://github.com/MagnivOrg/magniv-core

We can set you up with our hosted if you would like to poke around!

We are using prefect+dbt and I like it. Altought they are doing huge rewrite at the moment.

> teams keep choosing it for building simple DAGs

I am part of one such team. We were using Windows Task Scheduler on a windows VM to run jobs, we figured it would be a nice idea to (dramatically) modernise and move to airflow, but we grossly underestimated the complexity, learning curve, and surrounding tools it requires. In the end we (data science team) didn't get a single production task up and running. The data engineers had much more success with it thought, probably because they dedicated much more time to it.

Will look forward to trying AWS Step Functions.

I tried installing Airflow locally to just play around with it and make sense of what it's good for and finally gave up after a few days - the install alone is insanely complicated, with lots of tricky hidden dependencies.

Did you try installing with docker? You would just download docker, `docker-compose up --build` and you'll be good to go locally (usually)

I can second this. We were up-and-running with Docker on our dev machines in just a few minutes. A native installation involves substantially more setup (Python, databases, Redis and/or Rabbit, etc.). The published docker-compose file will handle all of that for you. We have a very small data engineering team and have been able to move very quickly with Docker and AWS ECS (for orchestrating containers in test and prod environments).

download the astronomer CLI and it will spin up a local dockerized instance of airflow in about 1 minute tops https://docs.astronomer.io/astro/install-cli

Well written article. One question I always have when reading such an article. Is it really worth it for these kinds of companies to run Airflow on Kubernetes. You could also run it for example on AWS Batch with Spot instances.

Running Airflow on Kubernetes has been one of the most painful data engineering challenges I've worked on.

We kept hearing this from our users. We’ve just released our k8s operator based deployment of Orchest that should give you a good experience running an orchestration tool on k8s without much trouble. https://github.com/orchest/orchest

(We extended Argo, works fantastically well by the way!)

How so? Did you have any existing Kubernetes knowledge? We found it fairly easy to deploy using the community Helm chart (official chart wasn't out yet).

Did you have any previous experience running workloads in k8s before?

Running the Airflow Helm is pretty straightforward, even with more "complex" use cases like heterogenous pods for different task sizes.

One of our engineers had to make 2 contributions to the Bitnami Helm chart for airflow for some reason.

... the service seems to be centrally managed. A lot of the pain points are clearly "everyone running in the same instance" or kind of similar. Sure makes for big brag points in the numbers.

Sounds like basic SaaS needs to be provided as a capability, while the teams spin up their instances and shard to their needs.

One of the problems with enterprise workflows is putting everything together. Workflows are already cacophonous. A cacophony of cacophonies is madness.

Tangentially to this thread....what sites, sources, etc. are people who work on modern data pipelines (engineering and analysts) going to follow the latest news, products, techniques, etc. It's been hard to keep up without having Meetups and such the last couple years. I'm finding a lot of people's comments here pretty interesting, and showing me things I haven't heard of. Thanks.

I've had really great success from engaging with the Locally Optimistic Slack community.

Also, Cristophe Blefari has an excellent data newsletter. https://www.blef.fr/

And Modern Data Stack has a newsletter, tool information, Q&A www.moderndatastack.xyz

I'm also interested in this topic, but can't find anything other than "Top 10 things you should STOP doing as a data engineer" etc. content-mill, clickbait on Medium and other sites.

Yes, this. I'd like to get less of the "Marketing sales stuff", and more in the trenches with the actual engineering teams.

Slack groups have filled in the meetup space in my life, mlops.community and locally optimistic are two of the best for what it sounds like you're looking for

I follow the Analytics Engineering Roundup weekly email. It's published by dbt Labs but isn't overtly promotional.


Thanks. We're starting to use DBT, too. I know the forums over at DBT are pretty good, too.

Data Twitter and Linkedin are great, there are a lot of people putting out some really good content. There are also a lot of substacks you can sign up for. Data Engineering Weekly is my fave

If your flow is more linear looking than a complex DAG and you want to get a full featured webeditor (with lsp), automatic dependency handling, typescript(deno) and python support, I am building an OSS, self-hostable airflow/airplane alternative at: https://github.com/windmill-labs/windmill

You write the modules as normal python/deno scripts, we infer the inputs by statically analyzing your script parameters and we take care of the rest. You can also reuse modules made by the community (building the script hub atm).

Thank you! Exactly what I was looking for.

So interesting, a lot of comments seem to be negative experiences. I haven’t used Airflow at scale yet but would love to convert our extremely limited, internally built orchestrator + jobs over to Airflow. I think it would allow us TO scale, at least for some time. I think a lot of companies are still really behind the times. Our DAGs are fairly simple, and Airflow has been a major improvement in my testing. The UI is great for helping me debug jobs / monitoring feed health / backfilling. DAG writing has been a bit frustrating but is much improved format over the internal systems we have. Am I just naive? Is everyone writing extremely complex graphs? Is this operational complexity due mostly to K8s (I’ve just been playing with Celery)? Anyone enjoying using Airflow?

The problems in this article and in the comments are some of the stuff we have heard at Magniv in the passed few months when talking data practitioners. We are focused on solving some subset of these problems.

Personally, I think Airflow is currently being un-bundled and will continue to be with more task specific tools.

At the very least, if un-bundling doesnt occur, Prefect and Dagster are working hard to solve lots of these issues with Airflow.

Evolution of products and engineering practices is not linear and sometimes doesnt even make sense when looking at a-posteriori (as much as I would like it to follow some logical process). Will be interesting how this space will develop in the next year or so.

Is anybody out there doing anything interesting with Airflow monitoring?

At my startup Cronitor we have an Airflow sdk[0] that makes it pretty easy to provision monitoring for each DAG, but essentially we are only monitoring that a DAG started on time and the total time taken. I keep thinking about how we could improve this and it would be great to hear about what’s working well today for monitoring.

[0] https://github.com/cronitorio/cronitor-airflow

I'm working at https://databand.ai — a full-fledged solution for Apache Ariflow monitoring, data observability and lineage. We have airflow sync, integrations with Spark/Databricks/EMR/Dataproc/Snowlake, configurable alerts, dashboards, and a much more. Check it out.

Can someone enlighten me whether Apache Airflow is suitable as a business process engine?

We have something like orders. So people put orders into our system, some orders are imported from external system. We have something around 100-1000 orders per day, I think. Each order goes through several states. Like CREATED, SOME_INFO_ADDED, REVIEWED, CONCLUSION_CREATED, CONCLUSION_SENT_TO_EXTERNAL_SYSTEM and so on. Some states are simple to change, like few milliseconds to call some web services, some states are 5 minutes from operator, some states are few days. This logic is encoded into our program code. We have plenty of timers, every timer usually transfers orders from one state to another. This is further complicated by the fact that this processing is done via several services, so it's not a single monolith but some kind of service architecture.

Our management wants something to have clear monitoring, so you can find a given task by some property values, monitor its lifetime, check logs for every step, find out why it's failing, etc.

What I usually see is that Apache Airflow is used more like cron replacement. I've read some articles but it's still not clear whether it could be used as a business process engine. I had some experience with Java BPMN engines in the past, it was not very pleasant, but I guess time moved on.

Apache Airflow is tool for *analytical data* pipelines. If you want to create a daily/monthly data feed from your operational DB, move to it into Data warehouse, run jobs to process/aggregate that data and create daily/weekly/monthly reports, Airflow may help with all those.

This is exactly the usecase Temporal meant for. I've used the Java SDK and its a pretty pleasant experience. Definitely worth taking a look at it.

A friend of mine wanted an ETL (SQL Server to BQ for analysis and dashboarding) set up and I ended up stumbling across Airflow. I spun up two VMs on GCP, one for Airflow and the other for the Postgres DB to store the metadata.

- A few things I've noticed is Airflow generates a tonne load of logs that will fill up your disk quite fast. I started with 100GB and I'm now at 500GB, granted disk space isn't expensive, but still even with a few DAGs i'm surprised at how quickly. Apparently you need a DAG to run to clear those logs but I was too lazy so I just purge the logs using a cron job.

- The SQL Server Operator is buggy, I filed an issue with the Airflow team but I had to do some hacky stuff to get it to work.

- Even with a few DAGs, Airflow will spike the CPU utilization of the VM to 100% for X minutes (in my case about 15 minutes) which is quite interesting. My tasks basically query SQL Server -> dump to CSV (stored on GCS) -> import to BQ.

- My DAGs execute every hour, and if Airflow is down for X hours and I resolve the issue, it will try to run all the tasks for the hours it was down which isn't ideal because it will take hours to catch up. So I've had to delete tasks and only run the most recent ones.

Granted my set up is pretty simple and YMMV, but Airflow has done what it needs to do albeit with some pain.

> Even with a few DAGs, Airflow will spike the CPU utilization of the VM to 100% for X minutes (in my case about 15 minutes) which is quite interesting. My tasks basically query SQL Server -> dump to CSV (stored on GCS) -> import to BQ.

Have you checked why that is? Airflow does Reimport every few seconds. We've had an issue where it didn't honor the airflowignore file making it execute our tests everx few seconds. The easy solution was to put them into the docker ignore.

You might also be having too much logic in your root levels. It's recommended to not even import at root level to make importing faster.

Not saying it's not an odd tool though.

FWIW if you don't need Airflow to catch up and backfill missed tasks, you can either set catchup=False on the DAG or use a LatestOnlyOperator.

I have catchup=False set in the DAG but that hasn't stopped Airflow from back filling missed tasks, not sure why this is the case?

So being completely transparent, we're the creators of Direktiv (https://github.com/direktiv/direktiv). We're genuinely curious to have users who have previously used Airflow and other DAGs (mentioned in here is Argo workflows) try Direktiv and give us more feedback.

- direktiv runs containers as part of workflows from any compliant container registry, passing JSON structured data between workflow states

- JSON structured data is passed to the containers using HTTP protocol on port 8080

- direktiv uses a primitive state declaration specification to describe the flow of the orchestration in YAML, or users can build the workflow using the workflow builder UI

- direktiv uses jq JSON processor to implement sophisticated control flow logic and data manipulation through states

- Workflows can be event-based triggers (Knative Eventing & CloudEvents), cron scheduling to handle periodic tasks, or can be scripted using the APIs

- Integrated into Prometheus (metrics), Fluent Bit (logging) & OpenTelemetry (instrumentation & tracing)

If you have time please try and jump on Slack to give us feedback!

I think airflow ends up creating as many problems as it solves and kind of warps future development patterns/designs into its black hole when it wouldn't otherwise be the natural choice. There's the sort of promise of network effects -- "well of course it's better if /everything/ is represented and executed within the DAG of DAGs, right?" -- but it ends up being the case that the inherent problems it creates plus the externalities of using airflow for the wrong use cases start to compound, especially as the org grows.

I think it slowly ends up being sort of isomorphic to the set of problems that sharing database access across service and ownership boundaries has, and my view is increasingly of the "convince me this can't be an RPC call, please" camp, and when it really can't (for throughput reasons, for example), "ok, how about this big S3 bucket as the interface, with object notification on writes?"

I've had the pleasure of working with a world class infra team that managed our Airflow setup on Kubernetes and this made working with it a wonderful experience. At the same time, early in my career, I used Airflow with a team that didn't understand the technology (myself included) and it was a nightmare.

I'm now at a tiny company without the infra to support Airflow, so I tested all the managed airflow providers (Composer, MWAA, and Astronomer) and ultimately settled on Astronomer. I also spent some time creating a framework that generates Dags from configuration and now 90% of my new etls just involve creating a config and a sql query. Pretty happy with my current setup and my only complaint is that Astronomer won't let you exec into a Task pod, which is actually kind of a big deal, but still better than having to manage a k8s cluster.

airflow is one piece of software that i hate very much, especially the aspect that my job definition is intertwined with the actual job code. if my job depends on something that conflicts with airflow's dependency, it gets ugly.

i actually like azkaban a lot better. of course, writing a plain text job config could also be painful. i think ideally you could write job def in python or other lang but it gets translates to plain text config and does not interfere with your job code in any way.

What sort of workflows do you run in Apache Airflow? Are they automating interactions with partners/clients or internal communications? How can it become so scaled up that they (and many people in the comments here as well) have trouble managing the hardware? How can it become so complex that the workflows need to be expressed in DAG's? What's a workflow?

I don't think I ever worked anywhere that had automated workflows, though my I only worked for small startups so far.

When it comes to scale and DS work I'd use the ploomber open-source (https://github.com/ploomber/ploomber). It allows an easy transition between dev and production, incrementally building the DAG so you avoid expensive compute time and costs. It's easier to maintain and integrates seamlessly with Airflow, generating the DAGs for you.

I ran Airflow on a small scale (+50 dags), and have used it for like 5 years. Yeah, it's painful. I consider myself an Airflow veteran after many db upgrade failures, daily operations... but still failed to make it work sometimes.

But at the end of the day, what is the alternative? Dagster seems to be a good one, but after I tried it, it seems just not worth it to move.

Can anyone ELI5 the value proposition of airflow?

The biggest pain point in Airflow I have experienced is the horrible and completely lacking documentation. The community support (Slack) won't (or can't) help with anything beyond basic DAG writing.

That sore point makes running and using the software needlessly frustrating, and honestly I won't ever be using it again because of it.

That’s one of the things we’re working on at Astronomer - check out the Astronomer Registry! registry.astronomer.io

Make sure to checkout ploomber, our support is seamless tons of docs (https://docs.ploomber.io/) and we take our users seriously. P.S. We integrate with airflow and other orchestrators if you still need to tackle those.

I agree with this. The slack is just the core developers discussing further development and tickets. The documentation is lacking big time. The only response to this is to raise PRs to improve docs.

Agreed, I just would like to add the documentation got a lot better in the past couple of years.

Can I ask more about your use case that you could not find an answer for?

I am wondering why they still use the celery executor while the kubernetes executor is the go-to one for large deployments. I have used the celery executor and had so many issues and stuck tasks in the past and frequently fine-tune the celery configuration in airflow config.

In practice it should be better, but it introduces its own awfulness. It requires retuning the scheduler loops, if there needs to be a scaling increase, the tasks run the risk of timing out.

Surely there is a really simple distributed scheduler that is simple. Do I need to write one? Ie has dependencies, no database, flat files, single instance but trivial to fail over to a backup. I can even live without history or output.

What are you trying to do? Distributed scheduler with a single instance? No database? Are you sure you don't just mean "a scheduler" ala Luigi? https://github.com/spotify/luigi

And what kind of scheduler? Again, for "a single instance" it doesn't need to be distributed. For distributed operation, Nomad is as simple and generic as you can get. If you need to define a DAG, that's never going to be simple.

Thanks is Luigi still maintained?

Is Airflow good for an ETL pipeline? Right now a client uses Jenkins, but it is quite clunky and difficult to automate, though they've managed to. Cloud not an option.

Airflow is generally brought in when you have a DAG of jobs with many edges, and where you might want to re-run a sub-graph, or have sub-graphs run on different cadences.

In a simplistic ETL/ELT pipeline you can model things as "Extract everything, then Load everything, then Transform everything", in which case you'll add a bunch of unnecessary complexity with Airflow.

If you're looking for a framework to make the plumbing of ELT itself easier, but don't need sub-graph dependency modeling, Meltano is a good option to consider.

Thanks. Really just looking for a new routine/on-demand scheduler to run jobs with a nice interface. There might be a dependency or two, but not a lot. Also the jobs themselves are thousands of lines of code and not going to be substantially changed.

Isn't there an Uber workflow product? Also that scales on top of Cassandra?

You're probably thinking of Temporal (https://temporal.io/), which is a fork of the Cadence project originally developed at Uber.

I think the main lesson should be not to use it, especially at scale.

we run airflow with ... considerably more dags than this. our main "lesson learned" is that airflow should not be used "at scale".

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact