Hacker News new | past | comments | ask | show | jobs | submit login
The Unbundling of Airflow (fal.ai)
164 points by gorkemyurt on Feb 15, 2022 | hide | past | favorite | 58 comments



I think you're 100% right that the tasks that can be accomplished in Airflow are currently being unbundled by tools in the modern data stack, but that doesn't erase the need for tools like Airflow. Sure, you can now write less code to load your data, transform it, and send it out to other tools. As the unbundling occurs, the end result is more fragmentation and fragility in how teams manage their data.

Data teams I talk to can't turn to any single location to see every touchpoint their data goes through. They're relying on each tool's independent scheduling system and hoping that everything runs at the right time without errors. If something breaks, bad data gets deployed and it becomes a mad scramble to verify which tool caused the error and which reports/dashboards/ML models/etc. were impacted downstream.

While these unbundled tools can get you 90% of the way to your desired end goal, you'll inevitably face a situation where your use case or SaaS tool is unsupported. In every situation like this I've ever faced, the team ultimately ends up writing and managing their own custom scripts to account for this situation. Now you have your unbundled tool + your custom script. Why not just manage all of the tools and your scripts from a singular source in the first place?

While unbundling is the reality, this new era of data technology will always still have a need for data orchestration tools that serve as a centralized view into your data workflows, whether that's Airflow or any of the new players in the space.

(Disclosure: I'm a co-founder of https://www.shipyardapp.com/, building better data orchestration for modern data teams)


Agreed. Even when they do support one's intended use case, these unbundled tools seem like classic examples of the inner platform effect: https://en.wikipedia.org/wiki/Inner-platform_effect

No amount of tooling will make data transformation a painless process; all you end up doing is burying the business logic under so many layers of abstraction that it becomes impossible for anyone to understand.


Isn't the main selling point of airflow the bundling in the first place? Why would you want many different specialized tools to manage scheduled tasks?


Cue the famous Jim Barksdale quote: "There’s only two ways to make money in business: One is to bundle; the other is unbundle"


I think there's two factors at play here:

1) Specialized tools reduce the amount of engineering overhead. As a business, I primarily care about time to value. If I can use specialized SaaS to get my data centralized, clean, and synced across my tools in a week, why would I want to spend months building all of these processes from scratch?

Sure, I lose control, visibility, and more... but I was able to deliver value 3 months ahead of schedule.

2) Existing tools like Airflow are highly technical to get started with. You can't just focus on building out scripted solutions. You have to set up and manage the infrastructure. You have to sift through the tool's documentation to understand how to effectively build DAGs. You have to inject your business logic with platform logic to make sure your code will run on Airflow.

Because the demand for data professionals is high and the supply is low, the technology ends up trying to offset the need for those highly technical skills in your organization.


I get what you're saying but trying to make sure your code will run on airflow is the wrong way of thinking about it IMHO. You should be trying to get airflow to make sure your code runs (could be in airflow, could be anywhere else).

A lot of the stuff we do with airflow is just basically sending commands and looking at the result (and handling any errors), this part is generic enough that you usually only need to implement it once for whatever platform your code is running on.

The tricky bit is when your DAG crosses platforms, but that's always a problem. If anything it's easier to solve when the tool scheduling tasks isn't part of the platform (note however that airflow is not a tool for solving dataflow, though some glue code in python does often work wonders).


Exactly my thoughts as well. I have one point where I can see if all the remote services that I am using are operating correctly. I don't need to connect to various other apps to figure this out.


This is why we moved to airflow vs lots of Cron jobs. Centralized place to look, logging, etc.


I ran my last startup on cron-jobs. Many times I didn't notice if the job didn't execute. That was an immediate value proposition for me.


As a newcomer to the world of data, I have no strong opinions about Airflow. It replaced a bunch of disparate cron jobs, so it's definitely better than what was there before.

There are things I like and things I don't about it. The UI is awful -- I don't know anyone that likes it, unlike what the article states. I like that it's centralized and that it's all Python code.

Deploying it and fine-tuning the config for a variety of workloads can be a pain. Sometimes sensors don't work right. Tasks sometimes get evicted and killed for obscure reasons. Zombie tasks are a pain big enough you'll see plenty of requests for help online.

That said, replacing it with a bunch of disparate tools again? Seems like a step backwards. And now instead of a single tool, your org has to vet, secure, understand and monitor a bunch of different tools? It's bad enough with only one...

What am I missing?

PS: data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.


What you're missing is that for much of enterprise software before Airflow, everything was steaming rubbish.

Airflow is... not amazing. But by the standards of horrible enterprise software we've all been subjected to, it's not that bad.

If you're complaining about Airflow, wait for the day you're forced to use an internally built database client.

That's Afghanistan.

Our proprietary AWS wrapper takes 45 damn minutes on a good day to allocate a VM. The AMI is built in two minutes. TWO.

I'm sure in 5 years Dagster and Prefect will have improved gradually in lots of incremental ways. For now Airflow is pretty solid.


> If you're complaining about Airflow

Wait, maybe I explained myself badly: while I am complaining about some things I dislike about Airflow, at the same time I'm saying it's better than the random assortment of cron jobs we had before, and pushing back against the idea of "unbundling" it and going back to disparate tools by separate vendors.

I like writing Python code, I feel in control.


I have memories of pasting 10 line powershell scripts into one of those tiny windows XP text entry boxes, and being happy I could do so!


Thanks for saying this. I also have been tasked to introduce airflow at my company. I decided to use 2.0 so it's more python dags. But for the most part the dags are JUST triggered via web service by other processes.

so... it's nothing more than processing plus a queue. I mean we already have rabbit and typescript. We also already have Typescript + Agenda (over mongo).

We have gotten to the point where a single company is implementing queuing at least 4 different ways because "microservices".


> data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.

I disagree with you, data engineering as a field has been there for a very long time. Good practices exists and are good enough to accommodate for new ones, like MLOps and data versioning.

However for every great DE setup, you can find at least ten other that are complete pile of shit, featuring mission-critical scripted SQL reports that no one understand anymore and closed source orchestration products with millions-dollars support contrats that only one person has access to.

As always, tooling is rarely an issue. Data Engineers are rarely working on the overall "big picture" and are often given tasks without context. Embedding data engineering with product and infrastructure teams are the solution to that issue.


This post is hard to follow. But I'll give my unsolicited opinion on airflow:

Its too complex to run as a single team and there are far better tools out there for scheduling. Airflow only makes sense when you need complex logic surrounding when to run jobs, how to backfill, when to backfill, and complex dependency trees. Otherwise, you are much better off with something like AWS step functions.


Everyone's context is different, but I've found the exact opposite to be true. Airflow is simple and dumb enough that it can be easily understood and managed by a small team, but it's also flexible and powerful enough that we can't come up with a good enough reason to switch to anything else.*

*We are, however, becoming more and more reliant on dbt, and the article makes a good point about Airflow providing no visibility for what's going on in a dbt node. So we're ending up with an increasingly simpler Airflow dag, with most of the complexity hidden inside a single dbt node.


This reflects how I often deploy Airflow as well (usually on GCP as Composer)

We use DBT to manage the DAG for the BQ transformations, put this in a container and deploy it into the kubernetes cluster that airflow is running on as a single node.

Airflow can then handle the scheduling and DAG nodes for non DWH dependencies such as loading/checking for files, kicking off tasks that need to run after the DWH refresh and the like.

I find once it is set up it is extremely easy for small teams to follow the pattern, and the single view of all the pipelines running is a great benefit - as well as handling the logic around last successful runs etc., that would need to be implemented manually if using simple cron jobs.


I'm not too familiar with the use of dbt but what was the reason you chose to have a single dbt node rather than translating the dependencies into an airflow dag?


I understand it is subjective. But I use a forked version of https://github.com/puckel/docker-airflow on our managed K8s cluster and it points to a cloud managed Postgres. It has worked pretty well for over 3 years with no-one actually managing it from an infra POV. YMMV. This is driving a product whose ARR is well in the 100s of Millions.

If you have simple needs that are more or less set, I agree Airflow is overkill and a simple Jenkins instance is all you need.


I run Airflow even for my local trading setup. For large teams, I often go with managed solutions like Astronomer.


Hi, I am the author of the post which parts did you find hard to follow?


> there are far better tools out there for scheduling

Really? Which ones? The only thing vaguely fitting this case is Jenkins, but using Jenkins to run ETL/ELT is a serious impedance mismatch.


Dagster/Prefect are the alternatives.

But yes, I'm confused. Triggering a dag and having it exit based on complex logic is a perfectly normal pattern.


Interesting, I wouldn’t say that I’ve found it difficult to run in even a small team.

The problem I’ve always had with Airflow has been with non-cron-like use cases, for example data pipelines kicked off when some event occurs. Sensors were often an awkward fit and the HTTP API was quite immature back when I was using it


Agreed about sensors. We still have some trouble figuring them out and understanding why they sometimes don't trigger when they should.


I manage and run our airflow instance - outside of migrating from 1.X to 2.x I haven't really had any problems. Learning curve was a bit higher than I hoped, but being able to set tasks downstream and backfill is so much nicer than regular cron / windows task manager script running.


There really aren't many alternatives out there after cron. Maybe lambda jobs count? What are you thinking of as alternatives?


do you have recommendations on alternatives that are not tied to a cloud provider?


We are trying to build something like this at https://www.magniv.app/.

Would love to have you join our beta if you are interested!


Shipyard, Prefect, Dagster are all good options. Lots of newcomers in the orchestration space.


jenkins


If I put each select statement in its own Airflow task, I get the same lineage dbt gives me, except I can see it and administer to it alongside all my other E and L-type tasks.

Also, I can write my T in plain ol’ SQL (granted, with some jinja) instead of this dbt-QL that I can’t copy and paste into my database console or share with a non-dbt user.

So, folks who have adopted dbt: what am I missing by being a fuddy-duddy?


I don't think you are missing anything, but allowing DBT to contain all the models that make up your various pipelines and reference each other mean that you can schedule your various pipelines at different cadences and use tags to refresh the relevant DBT models from a single code base.

It sounds like in your approach this would be writing this dependency logic into each DAG you schedule on airflow.

In the same way you would interpolate your jinja SQL before copying it into the database, you would use dbt compile or the output from a dbt run from the target/ folder and copy that SQL into your DB console or to share.

EDIT: This means your T is a single airflow node in each DAG, though I then still use airflow for the E/L tasks around it


Biggest factors in dbt adoption are:

- Automatic DAG generation based on dbt-QL declared dependencies.

- The structure of where (db/schema) and how (table/view/temporary) things are built is defined in a YAML configuration, not the individual SQL statements.

- Testing/documentation baked in.

Sure, you can manage every select statement as its own task, but it becomes pretty infeasible once things scale.

dbt can still be administered alongside all other E and L-type tasks. It's just a Python CLI wrapped around SQL SELECT statements.


Looks like dbt Labs is working on a dbt Server that can translate your dbt-QL type queries into SQL. That way you can use different "building blocks" of SQL with the rest of your team. This becomes pretty powerful with the metrics layer that dbt just introduced.


I love Airflow. Plenty of data businesses I've built are nothing more than just one DAG.

As for the article, I don't think we are yet at the point in which a competing stack comprised of individual specialized components do things better since Airflow is more than the sum of its parts imho.


Well written. I think that airflow is being enforced in organizations as the main orchestrator even though it's not always the right too for the job. In addition, organizations has to enforce a micro-services approach to have modular components. Besides that managing those frameworks is a nightmare. We built Ploomber (https://github.com/ploomber/ploomber) specifically for this reason, modular components and easy deployments. It standardize your pipelines and allows you to deploy seamlessly on Airflow, Argo (Kubernetes), Kubeflow and cloud providers.


I strongly disagree, having been part of efforts to use microservices in this space.


I have used airflow with two different organizations over the past couple years. When we had a complex orchestration with critical pipelines and enough human-power to manage the system, it was great. Trying to deploy it for a small team with no critical pipelines has been overkill and we recently migrated to dagster which is still in beta but accomplishes 90% of what airflow does with a much smaller footprint.


You can "unbundle" Airflow into different components. What is it called when you take one thing and break it into many pieces? Distributed (sometimes decentralized) computing. What do you get when you take a single system and distribute/decentralize it? Complexity. And what's the best way to simplify complexity? Consolidate the complexity into one system.

The Circle of Computing Complexity.


The author even mentions they hope to see this consolidated into dbt Cloud right there at the end of the article!


I like this post, because in many ways it highlights the importance of how Airflow has helped shape the modern data stack.

Like mentioned in this thread, managing Airflow can quickly become complicated. Its flexibility means that you can stretch Airflow in pretty interesting ways. Especially when trying to pair container orchestrators like k8s with it.

To combat that complexity and reduce the operational burden of letting a data team create & deploy batch processing pipelines we created https://github.com/orchest/orchest

We suspect that many standardized use cases (like reverse ETL) will start disappearing from custom batch pipelines. But there’s a long tail of data processing tasks for which having freedom to invoke your language of choice has significant advantages. Not to mention stimulating innovative ideas (why not use Julia for one of your processing steps?).


"You can use these 24 proprietary paid tools instead of Airflow"

Thanks.


Tools like prefect.io, IMHO, are just this. A 'modular' Airflow where you can pick and choose to use the just the DAG with no GUI, workflow GUI, scheduling, runners of all types from local to k8s.


It's ok, but seems to be a bit too complex for what it does. It was pretty janky running it locally (pegged the CPU), and now that we have it in MWAA we've got several support issues on it with AWS for unkillable task instances and scheduler problems.


I also suffered that cpu bug many years ago but I'd hope it has been fixed by now! The scheduler stealing all of the cycles.....


I found Airflow to be extremely buggy when I started deploying it for use with a small/medium sized team. Did anybody else have a similar impression, or has it gotten better with later versions?


2.0 fixed a lot


This is fine and will allow Airflow to focus on it's core functionality of being a distributed job scheduler.

FWIW, last I looked at Airflow I thought the schedule+task model could be made tighter as their was numerous ways to enter inconsistent states. For example, changing the schedule after tasks had already been run would allow to rerun jobs (in the past) at dates that were never scheduled in the first place.


High Resolution Version of the diagram if anyone is interested

https://drive.google.com/file/d/1btZ0yck9SdgsUdNom0WXgHcSQvO...


Funny enough this post mirrors quite a bit of our thinking over at Dagster! https://dagster.io/blog/rebundling-the-data-platform


I had a pretty terrible experience doing devops to automate the setup of an Airflow setup in 2020. This was before 2.0; I assume a lot of the bugs and issues may have been at least partially addressed.

My main gripes:

- The out of the box configuration is not something you should use in production. It's basically using python multiprocess (yikes) and sqlite like you would on a developer machine. Instead, you'll be using dedicated workers running on different machines and either a database or redis in between.

- Basically the problem is that python is single threaded (the infamous gil) and has synchronous (IO). And that kind of sucks when you are building something that ought to be asynchronous and running on multiple threads, cores, cpus, and machines. It's not a great language for that kind of job. Mostly in production it acts as a facade for stuff that is much better at such things (kubernetes, yarn, etc.).

- Most of the documentation is intended for people doing stuff on their laptops, not for people trying to actually run this in a responsible way on actual servers. In our case that meant referring to third party git repositories with misc terraform, aws, etc. setup to figure out what configuration was needed to run it in a more responsible way.

- Python developers don't seem to grasp the notion that installing a lot of python dependencies on a production server is not a very desirable thing. Doing that sucks, to put it mildly. Virtual environments help. Either way, that complicates deployment of new dags to production. That severely limits what you should be packaging up as a dag and what you should be packaging up with e.g. docker.

- What that really means is that you should be considering packaging up most of your jobs using e.g. Docker. Airflow has a docker runner and a kubernetes runner. I found using that to be a bit buggy but we managed to patch our way around it.

- Speaking of docker, at the time there was no well supported dockerized setup for Airflow. We found multiple unsupported bits of configuration for kubernetes by third parties though. That stuff looked complicated. I quickly checked and at least they now provide a docker-compose for a setup with postgresql and redis; so that's an improvement.

- The UI was actually worse than jenkins and that's a bit dated to say the least. Very web 1.0. I found my self hitting F5 a lot to make it stop lying about the state of my dags. At least Jenkins had auto reload. I assume somebody might have fixed that by now but the whole thing was pretty awful in terms of UX.

- Actual dag programming and testing was a PITA as well. And since it is python, you really do need to unit test dags before you deploy them and have them run against your production data. A small typo can really ruin your day.

We got it working in the end but it was a lot of work. I could have gotten our jobs running with jenkins in under a day easily.


* > - Python developers don't seem to grasp the notion that installing a lot of python dependencies on a production server is not a very desirable thing. Doing that sucks, to put it mildly. *

I do find this pet particularly annoying, since this project sits uncomfortably between library and appliance.

In an appliance, yeah sure you can pick and lock down whatever dependencies you want. But as a library you need to be lean and hyper flexible in what’s an acceptable dependency.

Airflow invites you to put a lot of logic into what runs in their venv, which may mean your project’s dependencies must include all of theirs. Being in that state is rather unfun.


What's the state of the art in terms of "distributed cron" these days?


Since this might catch the eye of someone knowledgeable …

Does anyone know if the community docker images for airflow can be run using podman?


Do you mean like the Docker operator? Or the actual Airflow images themselves?

On the first one, I suspect "probably", on the second "yes".


Tl;dr 2nd to last paragraph says that Airflow's unbundling is better than writing a better Airflow. Final paragraph says that DBT Cloud will become the better Airflow.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: