
How to Manage Apache Airflow with Systemd on Debian or Ubuntu - njanakiev
https://janakiev.com/blog/apache-airflow-systemd/
======
tyingq
ETL is a funny space. At least in the "Enterprise" world, it's dominated by Ab
Initio, which is crazy expensive.

They seem to be coasting too, for quite some time. Their website is probably
the most terrible site I've ever seen for an expensive piece of software. You
can't even tell what it is, how to buy it, or even how to contact them.
[https://www.abinitio.com/en/](https://www.abinitio.com/en/)

~~~
thenaturalist
What's your source for the fact that Ab Initio is dominating?

Other than that, there are a several other tools in the Enterprise Analytics
space that fall into a similar pattern like Alteryx or Collibra. But from
their perspective it makes perfect sense, I guess. When your sales is done by
relationship building and there isn't much competition once you're in, there
isn't really a need to boast a fancy website or make an effort.

If anyone has a good resource on how enterprise IT procurement is done or the
dynamics around it, I'd love to read up on that.

~~~
GordonS
I was going to ask the same question - I've never even heard of Ab Initio.

~~~
tyingq
See this Quora Q&A:

[https://www.quora.com/Which-companies-in-the-USA-use-Ab-
Init...](https://www.quora.com/Which-companies-in-the-USA-use-Ab-Initio-ETL-
too)

My suspicion is that their customers are mostly companies that use Teradata,
because it has a fair amount of Teradata specific features. Probably not good
news for their future, but lucrative for now.

------
villasv
I have found that using KillMode=mixed is very useful when running Airflow
with Celery workers. It allows the system to shudown gracefully coupled with
TimeoutStopSec, so the workers will stop receiving new jobs but will finish
their current jobs before exiting which is nice for auto scaling or spot
instances on AWS (coupled with EC2 lifecycle hooks).

~~~
1996
Interesting. Why do you care about graceful shutdown?

Why not kill ASAP and restart?

Do you have problems with load booming up sometimes?

~~~
akramar
In some cases we need to do an update-deploy-restart while a DAG is still
running (not even the one being updated). Then several minutes or hours later
the child processes raise a segfault and the jobs they were working on fail,
requiring restarting any of those jobs. I imagine a graceful shutdown would
allow the job to finish up and the DAG to continue with the remaining jobs.

------
kfk
Me and my team tried Airflow but found it didn’t fit well with our analytics
work flow. For instance you must rewrite your jupyter notebook into an Airflow
dag, doing basically the same work 2 times. We use dask and will soon deploy
dask.distributed. I have yet to figure out where Airflow actually fits in the
BI/data science architecture.

~~~
jmngomes
Not sure what you mean by "BI/data science architecture" but Airflow is
essentially a scheduler and orchestrator for data processing jobs.

These activities are usually managed by cron and more often by advanced
scheduler tools (depending on the vendor), so it's quite a core part of any
architecture that needs to e.g. load/reload/refresh data periodically.

If the requirement is simply to connect notebooks to a data lake, then the
only scheduling required is to load the data lake, and something like Airflow
may be overkill for this, depending on what/how the data is processed and
loaded.

~~~
kfk
I mean the same thing you mean. My issue with airflow is that it’s complicated
and doesn’t adapt well to cloud computing. Dask runs on aws emr and eks,
Kubernetes, etc.. Unfortunately orchestration is a lot more complicated than
it looks. Parallel executions, retries, logs, status tracking, email
notifications. Airflow doesn’t really tackle _all_ orchestration work.

~~~
smooc
Airflow Maintainer here: what you are describing is exactly what Airflow takes
care of (or should).

I wonder what your issue is/was? Notebooks are supported by means of a
Papermill operator (equivalent to how Netflix operationalizes notebooks) or
PythonOperator/BashOperator which would just wrap around your notebook.

However to parralelize tasks Airflow needs to know a bit more hence you might
have found it required to break up your notebook into individual tasks that
combine into a DAG. Is that what you meant?

~~~
kfk
With dask we code the workflow in the notebook and run in the notebook. We
don’t have to fiddle with operator as every task is python code. Dask is easy
to install which is important since each analyst has to be able to test the
workflows before sending them to production. Finally by programming our own
scheduler we can build the things we need. For instance we are able to listen
to sql tables and api changes and trigger work based on that. Anyway I am sure
I could make Airflow work too but it’s a harder fit vs dask.

------
Lucasoato
(Sorry if it's a little off topic) We were close to adopt Airflow in our
company but we were let down by a detail: the scheduler isn't natively in a
high availability mode. There was an article from Clairvoyant about how to
make it HA but it didn't look safe at all. That was a serious issue for us,
and at the end we went for NiFi. Have anyone had this problem?

~~~
robbyt
Use K8s to schedule the work.

~~~
Filligree
And make sure you hire two engineers to keep it working.

------
jpollock
Why are all these tools DAGs?

I'm guessing I misunderstand what is meant by a "Workflow"?

My assumption is that these are managing a state machine, where workflow is
stand-in for "Business Process"? If it's doing that sort of job, I'd expect
timers and loops?

However, it seems these are aimed at data conversion pipelines?

~~~
jpau
Heyo! Data guy here. Airflow and its DAG-managing peers are important for us.

Data transformations are one thing. For us, it’s the most important thing. Our
data warehouse runs as a massive DAG of nightly batched transformations over
app-generated data.

We also use DAG-managing tools to call external APIs and get new data (eg for
weather and geocoding) and batched ML training/inference pipelines too.

Why something like Airflow? Dependencies are easier to manage reliably. If you
have hundreds or thousands of nodes in your DAG, then it is a lifesaver to be
able to easily 1) run many threads of independent nodes; 2) re-run on
failures; and 3) find nodes impacted by failure.

~~~
jpollock
Sorry, I most definitely didn't want to make light of the problem!

Pulling data from all the various teams' locally created data stores and
external systems to push to analytics is definitely a large problem.

I was trying to figure out if these are aimed at data transformation
pipelines, or state management systems - I've got state management problems,
not data transformation problems.

Slightly different problems, but both fit with "Workflow".

------
Peteris
Side note: the best way to build Airflow pipelines is through Kedro
[https://github.com/quantumblacklabs/kedro-
airflow](https://github.com/quantumblacklabs/kedro-airflow).

------
carlosf
Good article! I actually like systemd, but nowadays I generally run stuff as
containers, so there is less and less opportunity to use it.

Here is my current setup for Airflow:

\- 1 container for the webserver

\- 1 container for the scheduler

\- 1 managed database (I use Postgres, it's a fairly small instance.)

\- 1 S3 bucket to deploy DAGs. I mount it on my containers using s3fs-fuse.

\- You can monitor the scheduler using a PID file, whereas the webserver can
be monitored probing your admin URL.

\- Most configuration can be done using environment variables, which is
perfect for containers.

\- I also configure DAG logs to be shipped to a S3 bucket.

------
EddieCPU
I would have thought a scripting language would have been a better choice than
unit files. The script engine would take care of verifying the script and
starting the processes in the correct order and allowing/restricting access to
other components. Thereby negating the need for such directives as After= or
PrivateTmp=.

~~~
onefuncman
Unit files are one of the things systemd got right.

A scripting language isn't well suited for process management, which is a very
well specified task.

Now, if you mentioned [https://ammonite.io/](https://ammonite.io/) you might
have got my attention...

