
Introduction to Apache Airflow - bhavaniravi
https://bhavaniravi.com/blog/apache-airflow-introduction
======
dmayle
I use Airflow, and am a big fan. I don't think it's particularly clear,
however, as to _when_ to use airflow.

The single best reason to use airflow is that you have some data source with a
time-based axis that you want to transfer or process. For example, you might
want to ingest daily web logs into a database. Or maybe you want weekly
statistics generated on your database, etc.

The next best reason to use airflow is that you have a recurring job that you
want not only to happen, but to track it's successes and failures. For
example, maybe you want to garbage-collect some files on a remote server with
spotty connectivity, and you want to be emailed if it fails for more than two
days in a row.

Beyond those two, Airflow might be very useful, but you'll be shoehorning your
use case into Airflow's capabilities.

Airflow is basically a distributed cron daemon with support for reruns and
SLAs. If you're using Python for your tasks, it also includes a large
collection of data abstraction layers such that Airflow can manage the named
connections to the different sources, and you only have to code the transfer
or transform rules.

~~~
javajosh
Yes, this seems to be yet another tool that falls prey to what I think of as
"The Bisquick Problem". Bisquick is a product that is basically pre-mixed
flour, salt, baking powder that you can use to make pancakes, biscuits, and
waffles. But why would you buy this instead of its constituent parts? Does
Bisquick really save that much time? Is it worth the loss of component
flexibility?

Worst of all, if you accept Bisquick, then you open the door to an explosion
of Bisquick options. Its a combinatorial explosion of pre-mixed ingredients.
In a dystopian future, perhaps people stop buying flour or salt, and the ONLY
way you can make food is to buy the right kind of Bisquick. Might make a kind
of mash up of a baking show and Black Mirror.

Anyway, yeah, Airflow (and so many other tools) feel like Bisquick. It has all
the strengths, but also all the weaknesses, of that model.

~~~
jacobr1
The art of software engineering is all about finding the right abstractions.

Higher-order abstractions can be a productivity boon but have costs when you
fight their paradigm or need to regularly interact with lower layers (in ways
the designs didn't presume).

Airflow and similar tools are doing four things:

A) Centralized cron for distributed systems. If you don't have a unified
runtime for your system, the old ways of using Unix cron, or a "job system"
become complex because you don't have centralized management or clarity for
when developers should use one given scheduling tool vs another.

B) Job state management. Job can fail and may need to be retried, people
alerted, etc ... Most scheduling system has some way to do deal with failure
too, but these tools are now treating this as stored state

C) DAGs, complex batch jobs are often composed of many stages with
dependencies. And you need the state to track and retry stages independently
(especially if they are costly)

D) What many of these tools also try to do, is tie the computation performing
a given job to the scheduling tool. This now seems to be an antipattern. They
also try to have "premade" job stages or "operators" for common tasks. These
are a mix of wrappers to talk to different compute systems and actual compute
mechanisms themselves.

If you have the kind of system that is either sufficiently distributed, or
heterogeneous enough that you can't use existing schedulers, you need
something with #A, but if you also need complex job management, you need #A,
#B and #C, and having rebuilt my own my times, using a standard system is
better when coordinating between many engineers. What seems necessary in
general is #D.

~~~
jacobr1
I meant to say D seems unnecessary

------
sdepablos
There're so many alternatives to Airflow nowadays that you really need to make
sure that Airflow is the best solution (or even a solution) to your use case.
There's plenty of use cases better resolved with tools like Prefect or
Dagster, but I suppose the inertia to install the tool everyone knows about is
really big.

BTW, here's [https://github.com/pditommaso/awesome-
pipeline](https://github.com/pditommaso/awesome-pipeline) a list of almost 200
pipeline toolkits.

~~~
walleeee
I've had a wonderful experience with Dagster so far. I love that it can deploy
to Airflow, Celery, Dask, etc, I love the Dagit server and UI and that I can
orchestrate pipelines over HTTP, I love the notebook integration via
Papermill, I love that it's all free (looking at Prefect here...), and the
team is extremely responsive on both Slack and GitHub

~~~
_frkl
Didn't Prefect open source their orchestration component recently, or am I
mistaken? What part of Prefect is still closed?

~~~
walleeee
Oh, I was saying it wasn't free. I think you're right and it is fully open
source

~~~
rywalker
It's not quite "open source", better labelled as "source available"

------
anordin95
I've been using Airflow for nearly a year in production and I'm surprised by
all the positive commentary in this thread about the tool. To be fair, I've
been using the GCP offered version of Airflow - Composer. I've found various
components to be flaky and frustrating. For instance, large scale backfills
don't seem to be well supported. I find various components of the system break
when trying to large scale backfills, for instance 1M DAG runs. As another
note, the scheduler also seems rather fragile and prone to crashing. My team
has generally found the system good, but not rock-solid nor something to be
confident in.

~~~
rywalker
Yes, we need to improve the backfills in Airflow.

We're working on making the scheduler HA and more performant, reach out to me
if you'd like to collaborate on your use case (ry at astronomer dot io)

------
Peteris
Airflow is an incredibly powerful framework to use in production, but a little
unweildy for anything else.

You can use something like Kedro
([https://github.com/quantumblacklabs/kedro](https://github.com/quantumblacklabs/kedro))
to get started building pipelines with pure Python functions. Kedro has its
own pipeline visualiser and also has an Airflow plugin that can automatically
help you generate airflow pipelines from Kedro pipelines.

[https://github.com/quantumblacklabs/kedro-
airflow](https://github.com/quantumblacklabs/kedro-airflow)

------
zukzuk
Airflow is great, right up to the point where you try to feed date/time-based
arguments to your operators (a crucial bit of functionality not covered in the
linked article). The built-in API for that is a random assortment of odd
macros and poorly designed python snippets, with scoping that never quite
makes sense, and patchy and sometimes misleading documentation.

~~~
diogofranco
Agree that this is a bit confusing. I ended up writing a small guide on how
date arguments work in airflow ([https://diogoalexandrefranco.github.io/about-
airflow-date-ma...](https://diogoalexandrefranco.github.io/about-airflow-date-
macros-ds-and-execution-date/)) and I always end up consulting it myself, as I
just can't seem to memorize any of these macros.

------
Grimm1
Airflow is great, honestly the biggest gotchas are passing time to operators
which someone has mentioned in the thread already and setting up the initial
infra is a bit annoying too. Other than that though as a batch-etl scheduler
and all around job scheduler it's pretty great, it's really a very user
friendly and it's graphical interface simplifies a lot of the management
process. I see a lot of people prefer the non graphical libs here like Luigi
or Prefect and to each there own but I really do prefer having that interface
in addition to the pipelines as code line of thinking.

I also see a lot of people saying it's a solution for big companies and the
like, I heavily disagree it's useful for any size company that wants to have
better organization of their pipelines and provide an easy way for non
technical users to check on their health.

~~~
hn2017
FYI - Prefect has a GUI too.

I also agree Airflow is good for smaller companies too if they're familiar
with Python.

~~~
Grimm1
That's good to know, I had been trying to figure that out but the doc pages I
landed on didn't really make it clear when I was looking at it.

------
naveedn
I don’t think this blogpost provides any value over the official
documentation. You can run through the airflow tutorial in about 30 minutes
and understand all the main principles pretty quickly.

~~~
bthomas
I recently went through the airflow docs for the first time - agree. But this
comment thread has much more helpful than any of the docs!

------
zomglings
I'm actually considering using Airflow. Have never used it before, and I have
the impression that setting up the required infrastructure could be
problematic.

Since a lot of you use Airflow, I am curious about your experience with it:

1\. Are you hosting Airflow yourselves or using a managed service?

1\. a. If managed, which one? (Google Cloud Composer, Astronomer.io, something
else?)

1\. b. If self-hosted, how difficult was the setup? It seems daunting to get a
stable setup (external database, rabbit or redis, etc.).

2\. Do you use one operator (DockerOperator looks like the right choice) or do
you allow yourself freedom in operators? Do you build your own?

3\. How do you pass data from one task to the next? Do the tasks themselves
have to be aware of external storage conventions or do you use the built-in
xcom mechanism? It seems like xcom stores messages in the database, so you run
the risk of blowing through storage capacity this way?

~~~
pyrophane
1\. Managed, Cloud Composer. Cloud Composer is getting there. It feels much
less buggy then just 8 months ago when I started using it, and it is improving
rather quickly.

One downside with Composer, though, is that it must be run in its own GKE
cluster, and it deploys the Airflow UI to App Engine. These two things can
make it a bit of a pain to use alongside infrastructure deployed into another
GKE cluster if you need the two to interact.

I would probably still recommend Composer over deploying your own Airflow into
GKE, as having it managed is nice.

2\. Freedom. For some tasks we run containers in GKE, for other we use things
like the PythonOperator or PostgresOperator.

A note here: Using containers with Airflow is not trivial. In addition to
needing some CI process to manage image building/deployment, having the
ability to develop and test DAGs locally takes some extra work. I would only
recommend it if you are already invested in containers and are willing to
devote the time to ops to get it all working.

3\. X-com is useful for small amounts of data, like if one task needs to pass
a file path, IDs, or other parameters to a downstream task. For everything
else have a task write its output to something like S3 or a database that
another task will read from.

All in all, I would say use Airflow if you need the visibility and dependency
management. Don't use it if you could get away with something like cron and
some scripts or a simple pool of celery workers.

Also, don't use it if your workflows are highly dynamic. For example, if you
have a situation where you need to run a task to get a list of things, then
span x downstream tasks based on the contents of the list. Airflow wants the
shape of the DAG to be defined before it is run.

Hope that helps.

~~~
hn2017
Your last point about highly dynamic workflows was a particular pain point for
me and I think for many others. One recommendation for Airflow is to create a
list of use cases with sample DAGs to show best practices.

------
nojito
Airflow is a great example of technology being used at a massive company for
massive company problems...which is now being pushed as a solution to
_everything_

Papermill is another example.

~~~
detaro
Any hints at solutions that help if you're a small company having small
company sized problems but don't want to DIY the entire flow execution logic?

~~~
walleeee
I'm the lone developer on a project which will likely never scale beyond a few
thousand users and I'm really liking Dagster. You can deploy it on a bunch of
other platforms (Airflow, Dask, Celery, K8s) which is _really_ nice for my use
case (automating workflows in HPC environments from the browser) or run it
standalone

------
unixhero
Thread a few weeks back with Apache Airflow's cousin Apache Nifi [0] . A lot
of great discussion in that thread, just like in this one.

[0]
[https://news.ycombinator.com/item?id=23144450](https://news.ycombinator.com/item?id=23144450)

------
slap_shot
> Airflow is an ETL(Extract, Transform, Load) workflow orchestration tool,
> used in data transformation pipelines.

Apologies if this is pedantic, but the orchestration of jobs transcends ETL
workflows. There's countless usecases of scheduling dependent jobs that aren't
ETL workloads.

~~~
carlosf
I've been quite happy with the following pattern:

\- Encapsulate your business logic in microservices and expose ETL actions
with APIs.

\- Call your microservices using Airflow.

That way my Airflow jobs are very lightweight (they only call https APIs) and
only contain logic of when doing things, not how. All the core business logic
for a specific domain lives in a single container that can be tested and
deployed independently.

Doing ETL using Airflow jobs exclusively or Lambda would spread business logic
and make it a nightmare to test and reason about.

------
chartpath
Happy user of Prefect here. I prefer it for being more programmable and able
to run on Dask. If you just want dynamic distributable DAGs and not
necessarily an "ops" appliance feel (like Airflow), check them out:
[https://docs.prefect.io/core/getting_started/why-not-
airflow...](https://docs.prefect.io/core/getting_started/why-not-airflow.html)

Not knocking Airflow, it is great. Luigi too.

------
hn2017
Airflow isn't perfect but it's in active development and one of the biggest
pros compared to other toolkits is that it's an Apache high-level project AND
it's being offered by Google as Cloud Composer. This will make sure it sticks
around and maintains development for some time.

[https://cloud.google.com/composer/](https://cloud.google.com/composer/)

------
gtrubetskoy
If you are using BigQuery and your "workflow" amounts to importing data from
Postgres/MySQL databases into BQ and then running a series of SQL statements
into other BigQuery tables - you might want to look at Maestro, it written in
Go and is SQL-centric, there is no Python dependency hell to sort out:

[https://github.com/voxmedia/maestro/](https://github.com/voxmedia/maestro/)

With the SQL-centric approach you do not need to specify a DAG because it can
be inferred automatically, all you do is maintain your SQL and Maestro takes
care of executing it in correct order.

------
throwaway7281
Having used a variety of modern ETL frameworks in the past years, I consider
writing a hands-on book about what I have learned on the way.

If I may ask, what questions do you find most difficult to solve in the
context of real-world ETL setups?

~~~
ramraj07
Primary problem for me is spending so much time setting up these monsters to
do what's basically a set of cronjobs. What's the most simplest system out
there that can be highly available and deployed as easily as possible?

Another question is, I strongly feel like the definition of pipelines should
not be in code, but in the database. I keep coming back to that design pattern
every time I start coding my own simple scheduling solution. Is there merit to
this thought?

~~~
throwaway7281
Yes, cron is a bit undervalued in that for one off (well locked) tasks it's
perfectly fine to create a crontab entry. And simplicity is king. I feel
people throw frameworks at problems where a simple shell/go script in a cron
would be just enough.

As for the pipeline definition. One goal is to have a notion of pipelines that
is both comprehensive and declarative.

As for a database, what would you store there? Container image to run? Past
execution data (e.g. output path, time, errors)?

The software world has many pipeline-y things, such as CI definitions and
these definitions usually live in configuration files.

What is difficult at time is the tracking of done tasks. Is the output a file
or a new row in some database or many files or many rows or anything else?

------
aequitas
> KubernetesExecutor runs each task in an individual Kubernetes pod. Unlike
> CeleryCelery, it spins up worker pods on demand, hence enabling maximum
> usage of resources.

You'll probably use up a lot of resources indeed, depending on how big your
tasks are you will have quite some overhead to run each and every one in a
seperate pod, compared to running them in a Celery Multiprocessing "thread" on
an already running worker container.

------
ForHackernews
Airflow has major limitations that don't become obvious until you're already
deep into it. I'd advise avoiding it myself.

It's only useful if you have workloads that are very strictly time-bounded
(Every day, do X for all the data from yesterday). It's virtually impossible
to manage an event-driven or for-each-file-do-Y style workflow with Airflow.

------
kfk
I always felt neither Airflow nor Superset solved any of the foundational
problems with data analytics today. If we take Airflow, it is relatively easy
to schedule runs of scripts using cron (or more fancy Nomad jobs with a period
stanza). What else does Airflow give me that cron doesn't? Is the
parallelization stuff working? Dask is built from the ground up with
parallelization in mind, sure it seems to solve a more foundational problem
than Airflow. Is triggering and listening to events working? Doesn't look
like. Is collaboration working? Doesn't seem to be the case since after
writing your python script you need to basically rewrite it into an Airflow
dag.

~~~
bosie
dependency management. if task #3 fails, any task depending on it shouldn't
run. not easy to do with cron based triggers

~~~
kfk
Yes but that is a feature of parallel and Dask does it well.

~~~
bosie
I don't know too much about dask, how would you build a node in a Dask DAG to
execute a java app and analyse its results (i.e. say database entries) to
evalutate the success of that step?

------
jillesvangurp
I've spent the past month+ setting airflow up. To be honest, I don't like it
for a lot of reasons:

1) it's not cloud native in the sense that running this on e.g. AWS is an easy
and well trodden path. Cloud is left as an exercise to the reader of the
documentation and at best vaguely hinted at as a possibility. This is weird
because that kind of is the whole point of this product. Sure, it has lots of
things that are highly useful in the cloud (like an ECS operator or EMR
operator); but the documentation is aimed at python hackers running this on
their laptop; all the defaults are aimed at this as well. This is a problem
because essentially all of that is wrong for a proper cloud native type
environment. We've looked at quite a few third party repos for terraform,
kubernetes, cloudformation, etc that try to fix this. Ultimately we ended up
spending non trivial amounts of time on devops. Basically, this involved lots
of problem solving for things that a combination of wrong, poorly documented,
or misguided by default. Also, we're not done by a long shot.

2) The UX/UI is terrible and I don't use this word lightly. Think
hudson/jenkins, 15 years ago (and technically that's unfair to good old Hudson
because it never was this bad). It's a fair comparison because Jenkins kind of
is a drop in replacement or at least a significant overlap in feature set. And
it arguably has a better ecosystem for things like plugins. Absolutely
everything in Airflow requires multiple clicks. Also you'll be doing CMD+R a
lot as there is no concept of autorefresh. Lots of fiddly icons. And then
there's this obsession with graphs and this being the most important thing
ever. There are two separate graph views, only one of which has useful ways of
getting to the logs (which never requires less than 4-5 mouse clicks). And of
course the other view is the default under most links so you have to learn to
click the tiny graph icon to get to the good stuff.

3) A lot of the defaults are wrong/misguided/annoying. Like catch up
defaulting to true. There's this weird notion of tasks (dags in airflow speak)
running on a cron pattern and requiring a start date in the past. Using a
dynamic date is not recommended (i.e. now would be a sane default). So
typically you just pick whatever fixed time in the past. When you turn a dag
on it tries to 'backfill' from that date unless you set catchup to false. I
don't know in what universe that's a sane default. Sure, I want to run this
task 1000 times just because I unpaused it (everything is paused by default).
There is no way to unschedule that. Did I mention the default parallism is 32.
That in combination with the docker operator is a great way to instantly run
out of memory (yep that happened to us).

4) The UI lacks ways to group tasks like by tag or folders, etc. This gets
annoying quickly.

5) Dag configs as code in a weakly typed language without a good test harness
leads to obvious problems. We've sort of gobbled together our own tests to
somewhat mitigate repeated deploy screw ups.

6) implementing a worker architecture in a language that is still burdened
with the global interpreter lock and that has no good support for either
threading or light weight threads (aka co-routines) or doing things
asynchronously, leads to a lot of complexity. The celery worker is a PITA to
debug.

7) IMHO the python operator is a bad idea because it gives data scientists the
wrong idea about, oh just install this library on every airflow host please so
I can run my thingy. We use the Docker operator a lot and are switching to the
ECS operator as soon as we can figure out how to run airflow in ECS (we
currently have a snow flaky AMI running on ec2).

8) the logging UI is terrible compared to what I would normally use for
logging. Looking at logs of task runs is kind of the core business the UI has
to do.

9) Airflow has a DB where it keeps track of state. Any change to dags
basically means this state gets stale pretty quickly. There's no sane way to
get rid of this stale data other than a lot of command-line fiddling or just
running some sql scripts directly against this db. I've manually deleted
hundreds of jobs in the last month. Also there's no notion of having a sane
default for number of execution runs to preserve. Likewise the there is built
in way to clean up logs. Again, Jenkins/Hudson had that always. I have jobs
that run every 10 minutes and absolutely no need to keep months of history on
that.

There are more things I could list. Also, there are quite a few competing
products; this is a very crowded space. I've given serious thought to using
Spring Batch or even just firing up a Jenkins. Frankly the only reason we
chose airflow is that it's easier for data scientists who are mostly only
comfortable with python. So far, I've been disappointed with how complex and
flaky this setup is.

If you go down the path of using it, think hard about which operators you are
going to use and why. IMHO dockerizing tasks means that most of what Airflow
does is just ensuring your dockerized tasks run. Limiting what it does is a
good thing. Just because you can doesn't mean you should in airflow. IMHO most
of the operators naturally lead to your airflow installs being snow flakes.

Not dockerizing means you are mixing code and orchestration. Just like
installing dependencies on CI servers is not a great idea is also the reason
why doing the same on an airflow system is a bad idea.

~~~
opportune
Other than 5+6 this seems like basically a spec for a managed airflow product.
So basically run Airflow on public cloud, manage all the operational bits,
create a better UI, and fix some upstream bugs.

~~~
rotten
Which is something both Google Cloud Composer and Astronomer.io are trying to
do.

------
Vaslo
Has anyone who has used this also used SSIS? Curious as to how the two compare
as I use SSIS currently and have gained some experience with Python.

------
somurzakov
couple questions re airflow from a guy coming from Informatica/SSIS world:

1\. Does airflow have native (read high speed) connectors to destination
databases (oracle, mysql, mssql) ?

2\. How the typical ETL in Airflow compares to one in Informatica/SSIS in
terms of speed of development, performance (throughput and latency), memory
consumptipon? Is it the same speed, or slower due to using Python interpreter?

3\. Is it easy or hard to use parallel transformations with
processes/threads/async ? For example, ingest data from your source in 20
threads at once, as opposed to serial processing

~~~
oxfordmale
1\. Airflow uses the "default" connectors for destination databases, for
example psycopg2 for Postgres. You can easily write your own hooks using
whatever connector you fancy 2\. It depends on your set up. I move most of the
heavy lifting to SQL or Spark, so it is as performance as Informatica/SSIS 3\.
I have written multi threaded ETL processes using the PythonOperator. That
basically starts off any Python script you want, allowing you the full
flexibility of Python.

~~~
somurzakov
can I ask a question about your #2. It seems you do ELT with spark/sql doing
the T part after loading. Is your loading part high performance or do people
even care whether it is fast or not? In my experience, when I extract and load
data as is (for example into SQL Server) - it is kinda slow, because the
columns have to be wide and generic, to accommodate all the crap that can come
in. For example, I noticed that loading 1M rows into nvarchar(2048) is way
slower, than into varchar(50). Let's say you have one column that usually does
not exceed 50 chars, but sometimes it can be crap data and be 2000 chars. What
is the best scenario to ELT it quickly?

What I found is that if data is high quality - then ELT is totally fine, often
times it ends up being just EL without much T. But if the data is crap, and
you have a lot of wide columns, then even loading it takes time, before we
even get to processing stage. In this scenario ETL works much faster.

~~~
oxfordmale
There is two approaches we follow. The first approach, which is quite slow,
bulk extract the data to S3 and then run our transforms on top of that. If we
need high performance, we write a delta streamer that only streams modified
records.

------
aouyang2
Has anyone tried using airflow to build out an entirely new az/region in aws?
Most common use cases have been for data pipelines, but how about deployments?

~~~
rywalker
You could certainly execute a Terraform script from an Airflow DAG, but I
wouldn't say it's a common use case.

~~~
unixhero
It's a pretty recursive idea. Creator of worlds.

------
niyazpk
We have a lot of spark applications that run on AWS EMR. Right now we use
Oozie to create and coordinate the workflows. Any reason to switch to Airflow?

------
alexchantavy
The article says a webserver is involved; are there other dependencies that I
need to deploy as well?

------
memosstilvi
How on earth did this post reach #1?

~~~
unixhero
Airflow is pretty interesting

------
ibains
Has anyone used Enterprise schedulers - ControlM, Atom and know how it
compares to Airflow?

------
frankdilo
How is this different from Celery?

------
rb808
I've tried to use Airflow but was way more complicated than I expected. I just
want to run a couple hundred jobs, why do I need a database? Surely a few
files would suffice.

It seems a big gap in the market. I can't rely on cron as its a single point
of failure. I have my own hardware so dont want to use AWS Batch or GCP Cloud
Scheduler, any other ideas?

~~~
BiteCode_dev
Wait, you only have a couble hundred jobs, but don't want a single point of
failure, but think a database is too much, but talk about cloud hosting?

This all seems contradictory.

Personnaly, using Python, I go for Celery (www.celeryproject.org): it's a
persistant daemon that can run tasks, provide queues and shedule work like
cron .

A lot of people prefer Python-RQ, as it seems simpler, but the truth is you
can start using celery with just the file system for storing tasks and result:

[https://www.distributedpython.com/2018/07/03/simple-
celery-s...](https://www.distributedpython.com/2018/07/03/simple-celery-
setup/)

If your needs grow, you can plug it to redis, rabbit MQ and/or a database
later.

It can expose an API so that other languages can talk to it and trigger tasks
or retrieve results (but not write tasks, they must be in python).

~~~
ramraj07
Celery for scheduled jobs seem to not be a supported design pattern at all,
and any job that starts to potentially come close to the 1 hour timeout seems
to get annoying to work with in celery. It seems primarily designed to send
emails in response to web requests, which is not the use case most people are
discussing here.

~~~
BiteCode_dev
I don't see how you came to this idea. The jobs can be as long as you want,
you can have retry, persistant queues, priorities, and dependancies.

Of course, I would advice to put a dedicated queue for very long running
tasks, and set worker_prefetch_multiplier to 1 as the doc recommand for long
running tasks:
[https://docs.celeryproject.org/en/stable/userguide/optimizin...](https://docs.celeryproject.org/en/stable/userguide/optimizing.html)

With flowers
([https://flower.readthedocs.io/en/latest/](https://flower.readthedocs.io/en/latest/)),
you can even monitor the whole thing or deal with it manually.

I assume your comment is reporting on other comments, but not direct
experience?

~~~
ramraj07
Direct experience very fresh in memory :)

The issue with long running tasks is that you have to change the timeout to
longer than the default value of one hour (otherwise the scheduler assumes the
job is lost and requeues it). But this is a global parameter across all queues
so this means we essentially loose the one good feature of celery for small
tasks which is retrying lost tasks within some acceptable timeframe.

Further flower seems weird - half the panels don't work when connecting
through our servers; our vpc settings are a bit bespoke but not completely out
there, so it's not fully useful. Also flower only keeps track of tasks queued
after you start the dashboard (but then it accumulates a laundry list of dead
workers across deployments if you keep it running continuosly).

We were also excited to use it's chaining and chord features but went into a
series of bugs we couldn't dig ourselves out of when tasks crashed inside a
chord (went into permanent loops). I just declared bankruptcy on these
features and we implemented chaining ourselves.

Point is, I'm sure we got some parameters wrong, but I and another engineers
spent WEEKS wrangling with celery to at least get it running somewhat
acceptably. That seems a bit too much. We are not L10 Google engineers for
sure but we aren't stupid either. The only stupid decision we made was
probably choosing celery from what I can see.

In the end we still keep celery for the on demand async tasks that run in a
few minutes. For scheduled tasks that run weekly, we just implemented our own
scheduler (that runs in the background in our webservers in the same elastic
beanstalk deployment) that uses regular rdbms backend and does things as we
want. Turns out it's just a few hundred lines of simple python.

~~~
BiteCode_dev
Fair enough and very honest.

> But this is a global parameter across all queues so this means we
> essentially loose the one good feature of celery for small tasks which is
> retrying lost tasks within some acceptable timeframe.

Oh, for this you just setup two celery deamon, each one with their own queues
and config. I usually don't want my long running task on the same instance
than the short ones anyway.

> We were also excited to use it's chaining and chord features but went into a
> series of bugs we couldn't dig ourselves out of when tasks crashed inside a
> chord (went into permanent loops). I just declared bankruptcy on these
> features and we implemented chaining ourselves.

Granted on that one, they not the best part of celery.

Just out of curiosity, which broken and result backend did you use for celery?

I mostly use Redis as I had plenty of problems with Rabbit MQ, and wonder if
you didn't have those because of it.

~~~
ramraj07
Our use case was that the timescale of any of our tasks (depending how complex
a query the user makes) can go from 1 minute to 45 minutes. We demoed a new
task that occassionally went over the 1 hour mark. It's definitely annoying to
have separate queues for these tasks but that might be what we need to do!

We use redis. FWIW within the narrow limits of the task properties it's
remarkably stable,so no complaints on that!

------
fxtentacle
I can't figure out if this is satire or not.

I believe that says a lot about open source projects released in recent years.

~~~
fmakunbound
From out of nowhere it jumps into a list of problems with cron being cron. I
kind of see your point.

~~~
fxtentacle
I think looking at the discussions here is even more surreal.

Airflow, Prefect, Dagster, Kedro,... it appears there are now a lot of tools
that I never heard of and never needed, despite me doing exactly what all of
them try to solve with Hadoop and MapReduce.

~~~
mattmcknight
Airflow is analogous to Oozie in the Hadoop ecosystem.

~~~
fxtentacle
Thanks for clarifying :)

