Hacker News new | past | comments | ask | show | jobs | submit login
Airflow's Problem (stkbailey.substack.com)
261 points by cloakedarbiter 56 days ago | hide | past | favorite | 122 comments

I was at Airbnb when we open-sourced Airflow, it was a great solution to the problems we had at the time. It's amazing how many more use cases people have found for it since then. At the time it was pretty focused on solving our problem of orchestrating a largely static DAG of SQL jobs. It could do other stuff even then, but that was mostly what we were using it for. Airflow has become a victim of its success as it's expanded to meet every problem which could ever be considered a data workflow. The flaws and horror stories in the post and comments here definitely resonate with me. Around the time Airflow was opensource I starting working on data-centric approach to workflow management called Pachyderm[0]. By data-centric I mean that it's focused around the data itself, and its storage, versioning, orchestration and lineage. This leads to a system that feels radically different from a job focused system like Airflow. In a data-centric system your spaghetti nest of DAGs is greatly simplified as the data itself is used to describe most of the complexity. The benefit is that data is a lot simpler to reason about, it's not a living thing that needs to run in a certain way, it just exists, and because it's versioned you have strong guarantees about how it can change.

[0] https://github.com/pachyderm/pachyderm

i want to be able to trigger datasets to be rebuilt automatically when their dependencies change, which as i understand it is a large part of pachyderm's value proposition, but it is unclear how to integrate pachyderm into the larger data ecosystem. my users expect data to be available through a hive metastore or aws glue data catalog. they expect to be able to query it with aws athena, snowflake (as external tables), and other off the shelf tools. i need to be able to leverage apache iceberg (or delta lake or hudi etc) to incrementally update datasets that are costly to rebuild from scratch. it doesn't seem that pachyderm can do any of these things, but maybe i am just missing how it would do them? i would love to have a scheduler that is just responsible for triggering datasets to update when their dependencies change, but it seems that pachyderm is built around a closed ecosystem which makes it incompatible with tools outside that ecosystem.

Cool. Are there any published benchmarks on how the data versioning engine scales?

We are doing a whole bunch of performance testing on the new 2.0 engine that we released at the end of 2021. We'll be publishing those.

The end of 2021?

Yes, Airflow 2 was released last year.

To address a point the author makes: I’m entirely unconvinced the “shift left” mentality of data democracy (aka business operators should write sql) is actually shifting left or a worthy path to pursue for most businesses. More recently this 2010s fad seems to be dying and in favor we’re seeing centralized data efforts that produce data products.

One of the most significant pitfalls of data is failing to interrogate the value it provides and assuming that if you give everyone access all the time the magic will happen. The truth is value does not simply materialize just as value does not magically spring from computers by a human powering it on (okay sure, you may have already automated the value but that’s actually the point I’m about to make). In both cases it requires an experienced practitioner who collaborates with a larger team to intersect their work with the business needs.

Data is tricky, all the more so because it’s often seen as a panacea by business leaders who aren’t connected with the work of extracting that value.

With all credit due to Google's excellent and under-appreciated paper Machine Learning: The High Interest Credit Card of Technical Debt [1], I submit that Big Data is the high interest home equity line of credit of business operations debt.

It's not that big data tools aren't useful. It's that, when you just start amassing huge piles of data without a clear up-front plan for how it will be used, and assume that a whole bunch of people who have never heard of sampling bias or multiple comparisons bias or Coase's Law [2] can figure out what to do with it later, you're setting yourself up for a Bad Time.

  1: https://research.google/pubs/pub43146/ 
  2: "If you torture the data long enough, it will confess."

I'd say that Big Data is the Collateralized Debt Obligations of business operations. It looks fabulous from afar but it can blow things up quickly if there's no understanding of the internals.

Yet, we abide by data-oriented conclusions outside of software engineering all the time. From Academics papers to FDA to crime statistics.

I won't say any of those are perfect. But there's at least a little more effort toward responsible data analysis in academia. The FDA brings an interesting example to mind. Take a look at how, on paper, drugs suddenly magically became less effective when the FDA started requiring clinical trial pre-registration in 2007.

It's also worth noting that, over the past few decades, most academic fields have been getting increasingly skeptical of the value of correlative research on pre-existing data sets. Even among people who have been extensively trained in how to do it properly. And yet, the vast majority of big data business plans I've seen in practice boil down to "collect a huge data set and then let people do correlative research on it."

Agreed, I want more scrutiny than some entity flashing “Here is the data”. It can easily be exploited behind the veneer of data-based-credibility.

> I submit that Big Data is the high interest home equity line of credit of business operations debt.

I like this but it's kinda like the payday loan of business operations.

>that if you give everyone access all the time the magic will happen

There's much ongoing discussion about this is the data world, often revolving around "self-service analytics".

Unless you're talking about "our analysts don't have to clean data all the time", which, for a large enough organization makes sense, "self-service" for non-technical folks is futile and pointless. They need specific answers to specific questions, not the ability to infinitely explore the data. Organizations should desire that kind of focus, not prevent it.

They idea was that they were going to hire an army of data scientists and become google...magically.

Reality smacked that shit down hard. I left data engineering because the projects were all over the place, wildly undisciplined and unfocused.

You were lucky to have source control let alone an understanding from the business that these projects were in fact software development.

I switched back to software engineering because at least there is a faint realization that we are...building software.

I might go back when the dust clears.

"Why do we need to hire programmers...I thought we needed data engineers?"

"Because the data pipelines are all built with thousands of lines of code. Java, python, Fortran, you name it...and your job post only mentioned SQL and data modelling"

I could go on forever.

This is the constant argument I have with people about data products.

You don't need to expose more dimensions or get the users more access to the raw data. You need to understand what their business is and what their business problems are and help them answer those specific questions quickly and succinctly.

Yes, there are certainly times where people use huge amounts of raw data to uncover the answer to a question they didn't know they had. But it's rare, it's expensive to support, and most businesses are going to be able to do anything with it anyway (a whole org built to do X isn't suddenly going to shift to do Y because you discovered some insight in a random report).

I've seen data errors because of joins and aggregations. Data democratization can be a net negative, especially if people don't question the graphs they see.

Do you know what the author means by 'left' here? Probably not moving bits around in a way that equivalent to multiplying by powers of 2?

Dismissing Airflow for not being Astronomer is like dismissing Linux for not having the capabilities of a large-scale hypervisor.

Replace “Airflow” with “Linux,” “data engineers” with “systems programmers,” and “Astronomer” with your hypervisor of choice (Xen/VMWare/etc.), and you can see how absurd the author’s point is:

   My problem is that ~Airflow~ Linux was not designed to address [high-level systems architecture] problems. We don’t need a better [Linux], but we need a higher-level one: a system that enables ~data engineers~ systems programmers to think at a platform level.

   In fact, [Linux] is already displaced. [Linux] qua [Linux] is already obsolete, and it happened right within the [Linux] ecosystem. It’s called ~Astronomer~ Xen/VMWare/etc.

  If it sounds like you could simply replace [Linux] with basically any other ~job execution engine~ operating system, that’s because you could.
This is where the argument falls apart. Yes, for very large, complex deployments, higher-level orchestration is important, but the choice of low-level execution engine is also still hugely relevant, just as the choice of guest OS is still hugely relevant when discussing large deployments of VMs.

Furthermore, very few people actually need very large scale deployments; user experience and capabilities at the low-level are what most users actually care about.

Managed Airflow doesn't even solve any of the author's outlined frustrations. It keeps the "obscene" syntax, it's still stateless, it's not "decentralized" etc.

Honestly, the article is so disingenuous that it comes off like a paid-for puff piece for Astronomer. It's the article-equivalent of the late-night infomercial guy who rips open a bag of potato chips like the hulk because he doesn't have this special tool that's just four easy payments of $9.99.

FYI, the infomercials with the strange tools fixing strange problems are usually focused on old or disabled people. Opening bag of chips with ridiculius tool sounds stupid, but it might help a stroke survivor or someone with one arm - but the sellers don't want to show those struggle on the screen to avoid humiliating people, so you see pefectly healthy looking young people spilling things like they have some neurodegenerative disorder or something. Because the target audience might.

Not saying infomercials people are angels, of course, but I wanted to sharethus somewhat nonobvious context.

(To stretch the metaphor, Airflow management system that gives everyone their own Airflow might be ridiculous but make sense for companies where cooperation is difficult :))

Interesting, Astronomer was actually my last choice for orchestrator. We went with Dagster, but I didn't want to make the takeaway "Dagster solves these problems", because it doesn't directly. Astronomer was just the best foil for the "meta-orchestrator" space that seems to be evolving, and which _can_ address these problems.

The new TaskFlow API has been part of AirFlow 2.0 since its release in 2020: https://airflow.apache.org/docs/apache-airflow/stable/tutori...

Agreed. The author is blaming Airflow for what are ultimately poor architecture decisions.

I will admit it's not easy to figure out best practices with Airflow, but if you make bad decisions and your system doesn't scale with the problem, you didn't understand the problem or how to solve it in the first place. The tools you chose are second to that.

You may not know very precisely the time constants you are dealing with in your problem until you give it a shot.

Honestly, we have to set up airflow at my job for some datalog collection and treatment. Which is fine, only i'm pretty sure we had exactly the same issue at my old job that we fixed in half a day, including testing and deployment, with a perl script. And i think in this particular instance (gitlab logs) it was treated with 90% Awk. Meanwhile my coworkers still have issues after almost a week (not all of this is on airflow, but still).

I'm not saying Airflow is bad (we did set up a lot of hadoop clusters and other apache products at my old job, and our clients used airflow a lot), but i think the evangelists are so good they push airflow for everything, and this is bad. OP did use airflow for something it was not really designed for, and it sucked, but i do have this impression that tech writers and apache evangelists deserve some of the blame.

Author here - appreciate the comments and reads. To add a bit of color -- I spent about a month looking into orchestrators to migrate Whatnot's data platform onto earlier this year, and it was a miserable experience. We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.

In fact, I did end up doing all those things, but we opted for Dagster Cloud, because of their focus on improving developer efficiency. Their team provided pre-built Github actions for CI/CD and recently introduced PR-specific branch deployments, which has been amazing. They're moving towards serverless execution, built-in ECR repositories, managed secrets. Prefect and Astronomer I expect are moving in this direction, too, but I liked the Dagster project's energy quite a bit.

As I've waded into the MLOps world as well, it just keeps looking like every platform basically devolves into : an orchestrator that provisions compute resources and logs metadata into an opinionated data model. Catalog tools like Atlan are metadata sinks that are trying to build out orchestration/workflow capabilities. dbt Cloud of course is just an orchestrator for a specific type of data product that is aiming to operationalize metadata with its metrics layer.

Orchestration + a metadata data model is a common denominator here, and I think the fact that Airflow is so inevitable has made it really hard for people to imagine the category as anything other than a scheduler, but perhaps some of these new companies can break new ground.

Thanks for the experience report - I have Dagster and Prefect on my shortlist to evaluate next time I need to build this, and Dagster seems the most promising, so it’s good to get another datapoint.

One Q - it seems to me that another possible solve (and probably how the big guys tend to do it) is to use a dataflow engine like Spark/Flink. Did you compare a managed platform like Google Dataproc? They also have serverless if you don’t want a heavy managed cluster, which might make this approach more viable for non-huge companies that wouldn’t utilize a min-spec cluster. (When I last evaluated this they didn’t have serverless which was a dealbreaker for my small scale).

We didn't look into a dataflow engine specifically, in part because we have a heterogeneous set of workfloads. Our core use case is loading mission critical data in chunks, but it is also coordinating SaaS tools and managed services like Sagemaker. So the sort of "just run this arbitrary code" reliably and scalably is an important role in our case, not just the dataflow part of things.

>We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.

This sounds like an issue not with Airflow but with integration.

Yep, that's what I tried to point out in the article.

“We were on AWS Managed Airflow, but to stay on it and have a solid platform, I would have been writing Github Actions for CI/CD, standing up ECR and IAM roles with Terraform, setting up EKS to run Kubernetes jobs, managing infra monitoring with Datadog, etc., etc.”

DAGs can be published to S3 for cutting down on like half of these dependencies. And the nice thing about MWAA is log & stats publishing over cloudwatch, which should flow into any existing amazon integrated tooling.

For our team setting up terraform for iam & mwaa, some deploy pipelines to s3, and connecting some config bits to wire up splunk logs / monitoring pieces was not that much work. Initiating a separated vendor relationship & pricing out data ingress/egress costs would blow that work out of the water but maybe it’s a difference in company size/placement.

In your investigation did you try out Flyte at all?

Nope -- just MWAA, Astronomer, Dagster Cloud, and Prefect Cloud. In the past I used Argo Workflows pretty extensively and have talked about its pros and cons here: https://www.youtube.com/watch?v=-cyr_kL-9fc

I despise airflow and how cemented it is as data infrastructure. It such a useful and basic concept but a nightmare to manage, and it works like junk. It's taken me 3 separate jobs over 7 years to realize that it's probably not our fault. Everyone seems to struggle with the same things: flaky scheduler that is slow to run tasks, confusing and redundant sounding settings that apply at up to three different levels (environment, job, task). It invites less experienced users to write a sea of spaghetti code in a monolithic DAGs repo. People wind up doing heavy data munging in python operators, which clobbers scalability and reliability. It also can't handle a large number of parallel tasks or frequent runs. It seems to have miserable scalability for the resources given, and bad controls for auto scaling. The UI feels dated and unintuitive. XComs seem useful to everyone but work like crap and actually an anti-pattern.

I've also tried it on Cloud Composer (google managed) and automated upgrades always trashed the cluster. It's not well designed for GKE because it writes logs to files and requires stateful sets. Testing the code is a huge burden due to the vast environment and dependencies needed to make it work locally.

I'm eager to rid my life of it and test out temporal for some of the high concurrency/frequency cases we have.

The idea behind airflow is great. What sucks is people using it to do heavy processing. Maybe with serverless/k8s airflow could fan out the processing to a cluster to allow for flexibility. But then, I guess you end up re-writing spark et-al.

Try gcsfuse to write logs to a bucket.

It's the one thing I like about our airflow. Everything else you said is echoed.

Also, the toil of dealing with many airflow instances when you have engineers who don't want to automate it.

Not experienced here but as a genuine interest can you tell what problems airflow solves that can't be handled by celery and rabbitmq?

I have not used celery + rabbitmq but I assume that combo is like sidekiq + redis, or any other job queue + worker system.

Airflow packages those things together and adds some additional features - UI with Graph, gantt, logs and other views of the workflow - Users and permissions - Places to store config - Mechanisms for passing small data between tasks - Various "sensors" for triggering workflows - Various operators that interact with common data-oriented systems (bigquery, snowflake, s3, you name it). These are basically libraries that expose a config-forward API.

Probably the main selling point is the pre-made operators, but in short it is a complete solution with bells and whistles that aligns itself with the data ecosystem.

An analogy is "can you tell what problems Django solves that can't be handled by wsgi and psycopg?" Nothing fundamentally different, but life is a whole lot easier with Django. Honestly if you're doing data engineering and you haven't spent time with a good DAG runner, you're doing yourself a real disservice.

My sibling comment did a good job explaining, but the UI + configurable storage + configurable triggers all out of the box make life a lot easier.

Django is easier when you want to do things only the "Django" way. However once you need something done differently it quickly shows its truly rigid and brittle self, and you'll find yourself fighting a great and challenging battle.

Perhaps unwittingly, you've hit upon people's exact frustrations with Airflow! :)

Expressing the problem you are trying to solve as a DAG is idiomatic in Airflow, but expressing your problem in terms of queue processing is idiomatic in Celery.

    a b c d vs. a (bc) d
They make different design decisions about what to surface via UX and what to make easy as a consequence of thinking of the problems in terms of different data structures.

Airflow with a celery backend is a pretty sweet combination. In that instance, airflow just gives you a nice scheduler to manage all the celery jobs.

We tried to set up Airflow in our team in the past. The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that.

In the end we replaced our data orchestration with a stateless lambda that for a configured time interval 1/ looks at what output data is missing, 2/ cross-references that with running jobs (in AWS Batch), and 3/ submit jobs for missing data that has no job. Jobs themselves are essentially stateless. They are never restarted and we don't even look at their status. If one fails we notice because there will be a hole in the output and we therefore submit a new one. Some safety precautions are added to prevent a job from repeatedly failing, but that's the exception.

Maybe Airflow has moved on from when we last tried it. But this was our experience.

> The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that.

That sounds more like an architecture-at-scale problem than something that is Airflow's 'fault.' Airflow may never have been the right tool for the job but it's getting all the blame.

What do you do for repeated failures? Does it get flagged for a manual debug or does it kick into a different mode of automation?

We notice repeated failures because we have metrics on our "up to dateness", and those metrics will stall. We also send logs to CloudWatch logs and alarm on certain threshold of errors. Once an alarm fires, we investigate manually and see why the job is failing. This happens occasionally but not too much. While we are investigating, we are spinning up repeat jobs with some frequency, but this hasn't proved to be a problem.

The post feels like a bait-and-switch in the sense that it presents itself as about Airflow's shortcomings but focuses mostly on problems that Airflow doesn't attempt to solve.

Airflow can certainly be frustrating and it doesn't solve _all_ workflow orchestration problems. Surely the same thing can be said of many tools? This seems mostly like a mismatch of expectations.

FWIW the author is pretty direct about this. After the cute beginning, he basically says that his problem is with Airflow's scope, not it's execution.

Hence the bait and switch. IMHO increasing the scope of Airflow (or any tool) is a challenging proposition. Would you rather use very few mega-tools with very broad scope (and potentially more challenging domain to navigate) or fewer more specialised tools that interoperate well together?

Obviously there are trade-offs with either approach, but then I'd argue that making Airflow solve more problems will introduce more trade-offs too.

Having a poor scope is a problem because people will just choose not to use you.

When we engage in complex work, it is important to keep our options open so we can direct our best effort to the hardest problems.

It is rarely clear what the hard problems will be when new to a domain. Only as scale kicks in.

We are constantly pitched frameworks that sell themselves as a good approach to a domain, but then obstruct engagement with the hardest problems when it matters. The developer becomes captive of the system that claimed it would steer them right.

This is particularly true of fields where the hard problems are integration problems which, by their nature, cannot be outsourced to frameworks.

100% agreed. By analogy, an article titled “pthreads’ problem” should be about shortcomings in the POSIX multithreading model, not an article saying that the implementation of machine-level parallelism is irrelevant because Kubernetes exists.

A few years ago a new guy at our DWH team tried to sell Airflow to the rest of the team. They invited me to listen to his talk as well, and I was baffled why something so trivial as Airflow was being sold as a critically important piece of infrastructure.

Why would I need a glorified server-side crontab if something like MS DTS from 1998 could do the same, but better? Sure, Python is probably better than whatever DTS generated, but the ops don't care either way, since Airflow doesn't care what it's running.

Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem.

"What's the overall trend for job D's finish time, what is the main reason for that?" isn't Airflow's problem, it's your problem. "What jobs are on the critical path for job E?" isn't Airflow's problem, it's your problem.

"Job F failed for date T and then recursively restart everything that uses its results for date T" isn't Airflow's problem, it's your problem.

>and I was baffled why something so trivial as Airflow was being sold as a critically important piece of infrastructure.


>Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem.

I guess that's one approach to job security. And why not make data egress manual too? Why transfer data through the network, when you can print them, mail the papers, and type them back in? Data input is not the computer's problem, it's your problem!

>Something as simple as "job A must run after job B and job C, but if it doesn't start by 2am, wake up team X. If it doesn't finish by 4am, wake up team Y" isn't Airflow's problem, it's your problem. "What's the overall trend for job D's finish time, what is the main reason for that?" isn't Airflow's problem, it's your problem. "What jobs are on the critical path for job E?" isn't Airflow's problem, it's your problem. "Job F failed for date T and then recursively restart everything that uses its results for date T" isn't Airflow's problem, it's your problem.

The whole idea of writing programs is making things automatable. That is, making them the computer's problem, not our problem. We get the higher level problem of writing the automation once, and fixing any bugs in our code, then we get to enjoy putting it to work for us...

I think you've misunderstood my point. I don't want all these problems to be mine, I want whatever job orchestrator I choose to solve them. Airflow very explicitly doesn't try to solve them, doesn't even try to solve them badly, it just runs jobs when scheduled.

Well, Airflow is not that great, but does try to solve these problems: it has retries, it shows "jobs are on the critical path" for another job, and how they're doing, and so on...

> isn't Airflow's problem, it's your problem

This is a baffling statement.

It is, that's why I was so baffled by the feature set of Airflow. It doesn't even try to take a stab at these problems.

Oh! I took that to mean that you thought Airflow shouldn't do it, not that Airflow thought Airflow shouldn't do it.

> Shift 1: “We know the lineage” to “We know what in god’s name is happening”

Bro I can't even get my company to the _first_ part, and we're collectively already having issues with the second? What is everyone else's read on this situation in general? Do you all have row and table level lineages for your data? For pipelines that people are actively using? Every company I've ever been in can hardly figure out where finance gets last years "magical excel sheet", let alone be close to a spot where they're actively using data lineage tools.

I also don't like Airflow, but for somewhat different reasons.

I think it couples orchestration and transformation too tightly, I don't understand the desire to integrate everything with your actual runtime Python code - I think it's markedly the wrong level of abstraction/integration and limits your engineering capacity. There's undoubtedly some good engineering, it's come a long way, and it's mighty popular, but every time I look at a repo that uses it, the only read I get is "cross-cutting-chaos".

In some fields its more important than others.

In life sciences research to support synthetic control arms, the FDA is caring more about the lineage/manipulation of the data than the data science models used to predict X/Y/Z.

IE - what was the data originally, what did it end up as prior to ingestion into AIML, why was it changed, what steps were involved, etc.

There are not a ton of good out of the box solutions for data lineage and its driving me nuts.

We have Apache NIFI which promises data lineage out of the box and _appears_ to deliver. I've never implemented it though.

We have pachyderm which has some support here but I don't know about it.

Besides that it appears roll-your-own.

I kind of wish there was an accepted best practice for data lineage but its - surprisingly - wild west. And its completely 100% required for industry use.

DBT does pretty well?

> magical excel sheet

I honestly have no idea how SaaS billing isn't so buggy customers leave. Those data pipelines can be pretty complicated with lots of nuances around the data, and hand-wavy consequences for getting it wrong.

More recent tools such as Dagster and Prefect have much more to offer. One simple example is communication between tasks. Airflow has a clunky system for that called XCom. The actual author of XCom says you probably should not use it due to the level of hackery it has under the hood [1]:

On Dagster and Prefect you communicate between tasks as if you were writing pure Python. On Airflow on the other hand ...

[1] https://www.youtube.com/watch?v=TlawR_gi8-Y&t=740s

I'm not sure if linking to a talk from 2018 in 2022 for a project that is being actively worked on (a bunch of Python abstractions for xcom were added in 2.0) is fair.

Yes, curious if the Taskflow API introduced as part of Airflow 2.0 reduces this pain. It appears much easier/saner than working with XCOMs directly - less coupling and removes the need for lots of boilerplate code.

There's also Flyte, which is natively data aware and schedules tasks around data dependencies. The syntax is essentially pure python too.

I'm kinda meh on this article, but it did lead me to this goldmine[1]. We don't use XCOMs at all so a lot of these aren't applicable but other parts absolutely are. We run Airflow at a pretty massive scale and not all of these boundaries were enforced so now it's a huge mess.

[1] https://towardsdatascience.com/apache-airflow-in-2022-10-rul...

I have my own opinion on Airflow's pain points and created Typhoon Orchestrator (https://github.com/typhoon-data-org/typhoon-orchestrator) to solve them. It doesn't have many stars yet but I've used it to create some pipelines for medium sized companies in a few days, and they've been running for over a year without issues.

In particular I transpile to Airflow code (can also deploy to Lambda) because I think it's still the most robust and well supported "runtime", I just don't think the developer experience is that good.

I really agree with Shift 2 (“We unblock analysts” to “We enable everyone”). The problem is that Airflow (and most other OSS orchestrators) are overkill for the majority of data practitioners. They lock workflow development into Python, forcing you to mix platform logic with executional business logic. The complexity to get started building workflows is too high, infrastructure challenges always crop up, and the system itself is a black box for anyone non-technical.

> The tool data engineers need to be effective in this new world does not run scripts, it organizes systems. 100%. You'll still need to run independent scripts, but today's data challenges focus on "how do I connect the stages of data operations together". Teams need to figure out how to connect data ingestion -> data transformation -> data visualization -> alerting and reporting -> ML model deployment -> metadata + catalogs -> data augmentation -> API actions.

The larger goal of orchestration is to prevent downstream processes from running if the data being processed upstream fails. Each stage could be performed with a series of scripts, a SaaS tool, or a mix. Each team is responsible for their own stages, but they need to know how their work connects to the larger picture so when something goes wrong, there's ownership and clarity that drives a quick resolution. Unfortunately, this still doesn't exist in most organizations because the current tooling isn't solving the orchestration and visualization of connected systems super effectively. It's instead enabling one-off, disconnected data processes.

Disclaimer: I built Shipyard (www.shipyardapp.com) to address many of these concerns of simplifying the ability to connect data tools and quickly automate and action on data.

I don’t understand the desire to describe DAGs in Python… it’s a fine scripting language but pretty horrible for this declarative description stuff.

My current company has been having a lot of success with Dagster. It seems to give a lot more flexibility thanAirflow in terms of defining the pipeline and where to run it. It's also a bit friendlier when things fail and need to be backfilled or retried IMO. Airflow feels like it's somewhat legacy at this point in time. It served a need well but the needs have changed now.

The problem I've had with Airflow is that it tries to do way too much: UI/logging/config management

I've really enjoyed using taskflow (https://github.com/taskflow/taskflow) it allows us to employ our existing logging and deployment paradigms.

Snowflake is the future. We shouldn't be writing code at all, we should be throwing data into a big hole in the ground and then querying the hole, or attaching a query to the hole so as data goes in the query does something wjth it, and chaining those queries. The fact that anyone is writing anything more complex than SQL to do this is a failure of imagination. Snowflake is intended to remove all the unnecessary engineering from you just putting data somewhere and doing something useful with it easily.

Fascinating article and to the heart of my being as a DE engineer. I had the same conclusion a while back and wrote about the trends in data orchestration towards a declarative pipeline, which more modern tools, as you mentioned, will do, but I'm afraid not Airflow.

I'd say data consumers, such as data analysts and business users, care primarily about the production of data assets. On the other hand, data engineers with Airflow focus on modeling the dependencies between tasks (instead of data assets). How how can we reconcile both worlds?

In my latest article, I review Airflow, Prefect, and Dagster and discuss how data orchestration tools introduce data assets as first-class objects. I also cover why a declarative approach with higher-level abstractions helps with faster developer cycles, stability, and a better understanding of what’s going on pre-runtime. I explore five different abstractions (jobs, tasks, resources, triggers, and data products) and see if it all helps to build a Data Mesh. If that sounds interesting, make sure to check out https://airbyte.com/blog/data-orchestration-trends.

I've recently had the sales team from Magniv.io pestering me to try it as an alternative to Airflow with a "shift left" perspective for automating jobs. I wasn't convinced enougy by the value prop to dive deeper. I think it was a language problem - I was just having trouble understanding and relating to the problems they solve and then figuring out whether or not I have those problems too.

I'm running into the same issue with this guy's post, although a little less so. The question he seems to ask is "With a complex pattern of data flows, if something breaks, how do you recover?" His argument is that Airflow does not offer enough visibility into the full data trace nor enough tools to apply recovery rules for repairing broken bits.

I think I agree, but prometheus doesn't really solve that. Nor necessarily does better management of automated job queue backlog management and job retries.

He also complains about some syntax and design choices that predate MyPy and Pydantic and modern Async Python coding. Those seem fairly easy things to drag Airflow forward with in future releases.

I need to deal with system with complex dependency relationship. The system need to stop executing when some step fails. I used to look at Airflow. I remember at the time it was the first result if you Google DAG something. But it has the same problem as data-engineer centric system that it was somewhat over-engineered and it relies too much on a center server. It’s heavily relying on a Python runtime is also something I don’t like, feels like everything else is a second-class citizen. The nature of our job is that we have deal with complex legacy codes, some Fortran program, some need a conda environment etc. Later I found what I should look at is just some makefile-like system, and I settled on Snakemake, which has a nice DSL that forces you to be explicit about input / output / etc.. Probably Airflow is just not the right tool for a one man team.

Airflow is super clunky, but it gets the job done (mostly).

I'm kind of a fan of Prefect as an alternative: https://docs-v1.prefect.io/core/about_prefect/why-not-airflo...

Recent perspectives from the creators of Prefect, Dagster, Flyte, and Orchest => https://gradientflow.com/summer-of-orchestation/

Seems to be missing Temporal/Cadence, which I'm very excited about, but I've never heard of Flyte or Orchest.

Having been forced to work with an obsolete version of Airflow at work, I can attest to how narrow-minded the project's focus was when it was originally created. The scheduling quirks and UTC defaults are enough to paint the picture here.

Not completely sure if most of the issues I've faced were resolved in the future releases, but I don't fully agree with the take of the article. Like go with the scheduler that works for your current and potential future needs. The reason why we continue to use Airflow despite the issues is because it works so well with our workflows. This does mean that I would recommend it to another team.

Just put everything in a warehouse( via fivetran or some such thing) and just use DBT.

Use airflow as cron runner for dbt.

If you don't need realtime metrics, this formula works way better than convoluted airflow dags.

We have had a great experience scheduling Meltano/dbt inside Orchest for our Metabase dashboards. As a pattern, combining these declarative/configuration CLI tools with a flexible orchestration layer (Orchest can run any containerized task, it will containerize transparently for you) really shines.

If you need cron, use cron or a cronjob pod. Airflow is a poor cron scheduler at any sort of scale.

yea for sure. Not sure what a good cron runner with web ui with logs is, jenkins?

Airflow helped my team out a lot a couple years ago mainly for the simplicity of the topdown UI-based view of a complicated ETL AND the ability to retry parts of the ETL.

We had lots of lessons learned. For instance, why does PythonOperator even exist? It takes a callable and thus you're likely not going to see good coding pattern emerge for something that needs to be 1000+ LoC. Instead, we just subclassed BaseOperator and used tried-and-true OO principles.

> If it sounds like you could simply replace Airflow with basically any other job execution engine, that’s because you could.

Has anyone tried Luigi for data engineering pipelines?

I've been using airflow for about 2 years now in production. It's been mostly good - the few times things go wrong, it's a huge pain in the ass to figure out why... but it's significantly better than just straight cron on Linux. Airflow 2 has improved a lot of speed and catching up issues from airflow 1.x

I don't have time to investigate other solutions like dagster and prefect and migrate jobs to it for testing.

If it is just you, you are fine, and I'm not sure other tools would have that much benefit.

Trouble with Airflow starts when multiple teams and user types start to share it.

I've definitely noticed more issues after adding users, but it's more that they don't actually understand a lot of what they're trying to do and cause problems when writing dags.

Yeah, Airflow isn't multi-tenant.

People can potentially overwrite each other's DAGs. Credential management is complicated. Broken DAG can stop whole Airflow. Slow DAG can impact performance of whole Airflow. Getting DAGs to wait for each other (like one team prepares data up to a point and then other team builds on that) is kind of a nightmare. Sometimes people want features from newer Airflow, but some other team built DAG that isn't forward compatible. Etc etc.

But I'm not sure there actually is a better solution elsewhere. At least I have not seen it yet, maybe Dagster is on a good road.

But as I said, for centralized solutions it works really well.

Some of these were the core problems that we wanted to address as part of https://flyte.org. We started with a team first and multi-tenant approach at the core. For example, each team can have separate IAM roles, secrets are restricted to teams, tasks and workflows are shareable across teams, without making libraries. and it is possible to trigger workflows across teams. Each teams workflows are tasks and grouped using a construct called projects. It is even possible to separate execution clusters per team, per workflow onto separate k8s clusters. Also the platform is built to be managed and easily deployed.

Wouldn't it make sense to decouple the orchestration later from the authoring layer for the dags? That way you could solve the authoring problems separately from the lower level orchestration problems. We're trying this over at ZenML (https://zenml.io) but have yet to get feedback

This is why I though the shift from "orchestrate jobs" to "keep track of state of assets" that Dagster is trying to do is pretty important. But it sems it might not be enough - it still keeps clunky (pythonic) interface and I don't know how well it does multi-tenancy.

Could you clarify what "keep track of state of assets" means?


Of course, nothing stops Airflow or other tools fron thinking this way as well.

It's a mindset shift to a more declarative model. The idea has also popped up in other niche orchestrators.

This is an oversimplification but IMO the easiest way of picturing it is instead thinking of defining your graph as a forward moving thing w/ the orchestrator telling things they can be run you shift to defining your graph nodes to know their dependencies and they let the orchestrator know when they're runnable.

Is there any love for the Argo [1] project suite (Workflows, Events, CD) for this type of use case? I haven’t tried it out myself yet however it does look interesting.

[1] https://argoproj.github.io

I've never used their workflows thing, but having been forced to live with ArgoCD it sounds horrifying.

Argo is another over-engineered "CNCF" thing trying to ride the Kubernetes hype train. It's all "eventually consistent", which makes it extraordinarily difficult to see when any particular thing actually happened. Is my code deployed? Who knows, Argo is "syncing".

Check out these great docs: https://argoproj.github.io/argo-workflows/rest-api/

> API reference docs :

> Latest docs (maybe incorrect)

> Interactively in the Argo Server UI.<https://localhost:2746/apidocs> (>= v2.10)

Yes, that is a localhost URL on their website.

How do you know if anything is deployed if it hasn't come back and confirmed it's deployed? Manual only?

rsync returns status code 0.


Well, no-one's going to accuse that of being overengineered :D

Argo is pretty amazing if you want to take advantage of the work Kubernetes has done to scale resource efficiently across a cluster of compute nodes.

If you’re looking for something that’s a bit more high level and friendly to expose directly to your data team (data scientists/data engineers/data analysts) you can check out https://github.com/orchest/orchest

You can think of it as a browser UI/workbench for Argo scheduled pipelines. Disclaimer: author of the project

The problem with airflow is its scalability (or lack thereof) and DAG dependency management.

Airflow successors must figure out how to distribute the cron and all dependencies should be self contained in a Docker image.

Like a lot of software, it's matured and along the way has put on some weight. I still love it for certain use cases, but tools like Dagster are peeking my interest.

I don't really agree with decentralized ETL. You can't imagine the mess people proudly wrote. And yeah I have been on the other aide of the trench too.

I remember having this feeling a few years ago. What I realized is that airflow has taught us a few bad habits and also brought ahead an interesting paradigm of the vertical workflow engine.

I agree airflow is old, legacy and ideally folks should not use it, reality is there is a lot of pipelines already built with it - sadly. I think as a community we have to start moving away from it for more complicated problems.

Disclaimer: I created Flyte.org and heavily believe in decentralized development of DAGs and centralized management of infrastructure

All the issues described in this post lead me to create Kestra [0] . Airflow was a true revolution when it was open-source and we need thanks its innovation. But I totally agree that a large static dag is not appropriate in the actual data world with data mesh and domain responsibility.

[0] https://github.com/kestra-io/kestra


ETL seems just like one of those perennial challenges that resist humanity's efforts to categorize the world into need and tidy boxes

I've had good success with https://dagster.io, which is much more opinionated about your pipelines, including properly typing inputs and outputs.


Is an Apache open source project considered a hipster-corp?

So to confirm, you are annoyed that the name of this project is a word.

Definitely. I'd like the name of this project to be changed to be a name, not a common word. Thanks.

I'd suggest to raise the issue in the devlist or on GitHub Issues to get more visibility.

I'd be okay with the article headline on HN being "Apache Airflow's Problem" so that I know it's about a piece of apache software and not something interesting about airflow.

The Apache Software Foundation would also prefer this, per their trademark policy: https://www.apache.org/foundation/marks/faq/#guide

I think context matters and the title "Airflow's Problem" doesn't make much sense when talking about the physical phenomenon of flowing air.

Not really, the past two years have seen probably a order of magnitude increase in interest and articles about industrial-scale ventilation and air quality.

You still wouldn't phrase it like "Airflow's Problem" because the concept of airflow is incapable of having a problem. It just exists. Some theories _about_ airflow may have problems, but then you would specify that in the title.

I disagree. The concept of airflow is a human model to describe a physical phenomenon, it might be that we have nailed it.. It might be someone has found a problem with the model, and or some new insight, that they wanted to share..

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact