There's not a lot of writing about this. Folks seem content to fight Airflow's deficiencies. Most of them are too young to know any better. The critiques you'll find are generally written by competitors, or folks adopting a competitor.
Here's the big one I see lots of folks get wrong: do NOT run your code in Airflow's address space.
Airflow was copied from Facebook Dataswarm and comes with a certain set of operational assumptions suitable for giant companies. These assumptions are, helpfully, not documented anywhere. In short, it is assumed that Airflow runs all the time and is ~never restarted. It is run by a team that is different from the team that uses it. That ops team should be well-staffed and infinitely-resourced.
Your team is probably not like that.
So instead of deploying a big fleet of machines, you are probably going to do a simple-looking thing: make a docker container, put Airflow in it, then add your code. This gives you a single-repo, single-artifact way of deploying your Airflow stuff. But, since that's not how Airflow was designed to work, you have signed yourself up for a number of varieties of Pain.
First, you are now very tightly coupled to Airflow's versioning choices. Whatever version of Python runs Airflow, runs your code. Whatever versions of libraries Airflow uses, you must use. This is bad. At one point I supported a data science job that used a trained model serialized with joblib. That serialization was coupled to Python 3.6 and some precise version of SciKitLearn. We wanted to upgrade Python! We couldn't! Don't use PythonOperator. You need separation between Airflow itself and Your Code. Use a virtualenv, use another container, use K8s if you must, but please please do not run your own code INSIDE Airflow.
Second, you cannot deploy without killing jobs. Airflow's intended "deployment" mechanism is "you ship DAG code into the DAG folder via mumble mumble figure it out for yourself". The docs are silent. It is NOT intended that you ship by balling up this mega-container, terminating the Airflow that's running, and starting up the new image. You can do this, to be sure. But anything running will be killed. This will be fine right up until it isn't. Or maybe not, maybe for you it'll be fine forever, but just please please realize that as far as Airflow's authors are concerned, even though they didn't say so, you are Doing It Wrong.
Color me very interested about footguns in Airflow as well.
We're currently considering it as a self-hosted process engine to automate technical business processes and to coordinate a few automation systems like jenkins. Crap like, trigger a database restore via system A, wait until complete, update tickets, trigger some data migration from that database via System B, update tickets. Maybe bounce things back to human operators on errors and wait for fixing / clarification. Trigger a bunch of deployments in parallel and report on success.
Systems in this space are either (a) huge, hugely expensive enterprise applications designed to consume all business processes, which is a bit overkill for our current needs (Camunda, SAP, Stackstorm, ...), (b) overfitted onto a very specific data analysis setup (aka: if you don't have hadoop, don't touch it) or (c) overly simplistic and offering no real benefit beyond investing into self-hatred, guiness and making jenkins work.
Airflow seemed like a bit of a decent middle ground there for workflows a bit beyond what you can sensibly do via jenkins jobs.
If you want an Airflow-ish approach without punishing your future self, pick Prefect. Otherwise go with Temporal. Above all do not adopt Airflow for the use cases you describe in 2023
Prefect seems like a really good suggestion, thank you.
We're pretty much committing to python as our language of choice in the infra-layer. Most of the team is sent onto courses over the next month, too. So I have a whole lot of python scripts popping up over the infrastructure.
And this approach of slapping some @task and some @flow onto scripts or helper functions seems to work really well with what the team is doing. It took me like 30 - 40 minutes to convert one of those scripts into what seems a decently fine workflow. very intrigued.
Jenkins is honestly Really Good as a scheduling system. The best, frankly.
If you can combine it with a library- or tool-style DAG system (vs. server-style like Airflow), like make or Luigi or NextFlow or even Step Functions, that is a great sweet spot.
In my experience Jenkins is like shell scripting- in the hands of hands of someone who understands its strengths and weaknesses and is very disciplined in how it is used and maintained, it's both performant and flexible. If you follow the path of least resistance it becomes a mess.
I agree. For some reason, Airflow really sucks at running something on a cron-like timer. I can't remember the all issues I had with it exactly (this was several jobs ago), but getting it to run a job at the same time everyday was a nightmare. One issue I do recall was that it treated the first run differently, basically ignoring your defined schedule. And I think "first run" meant every time Airflow was restarted. So if Bad Things happened if that job ran the wrong time of day, you had to add extra logic into the job itself to abort if it was being called at the wrong time. How they managed to make this so difficult is mind boggling. By contrast, Jenkins will reliably execute a job on a timer without issues.