
Effective Airflow Development - llambda
https://curology.com/blog/tech/posts/effective-airflow?hn
======
theboat
Making use of airflow's plugin architecture by writing custom hooks and
operators is essential for well-maintained, well-developed data pipelines
(that use airflow). A little upfront investment in writing components (and the
most painful part, writing tests) will go a long way to helping data engineers
sleep at night.

That said, I make a point of using ETL-as-a-service whenever it's available,
because there's no use solving a problem someone else has solved already.

------
julee04
Seeing as this article is from 2019, would people still recommend airflow to
ETL data from APIs to DataWarehouse/DataLakes or is there something better in
the market?

~~~
atomicity
Nothing is more popular yet, but there are better architected options out
there. It's hit 17k GitHub stars and was used by the team I was previously at.
I don't think anything will beat it unless something from the CI/CD or "cloud
native" world moves in unexpectedly.

The operators and scalability are somewhat useful. I was happy with the UI
compared to cron. Testing is a mess. Also, Airflow isn't CI/CD-friendly (but
it's possible to get it to work).

I'd recommend a managed option unless you have a skilled ops team. It reminds
me of Hadoop in terms of how exciting it is to get set up, which isn't a good
thing.

~~~
jamestimmins
Can you expand on "testing is a mess"? Do you mean testing your own DAGs and
operators?

~~~
atomicity
Yeah, like the other reply, I'd mostly say testing DAGs was an issue. Airflow-
related configuration is easy to get wrong and it silently fails a lot.

Now that I think about it though, most of the time I spent on testing wasn't
caused by Airflow. Testing data pipelines just isn't easy with the current
well-known tooling.

------
walrus01
Based on the title I clicked this thinking it was something related to OSI
layer 1, for hot aisle/cold aisle separation in high density datacenters,
compartmentalized-per-cabinet cooling or something.

As a side note how do you effectively google for a piece of software or
product with a name as generic as "airflow"?

~~~
Grimm1
"airflow etl", "airflow scheduler", "airflow software"

------
shankysingh
Testing Pipeline in Airflow is bit pain, but great expectation makes runtime
pipeline validation much easier.
[https://greatexpectations.io](https://greatexpectations.io)

For unit/integration tests we ended up doing lot of Docker in Docker setup.

------
james_woods
I am using Apache Airflow since a couple of years now and the biggest
improvement was the addition of the Kubernetes Operator. You basically keep a
vanilla Airflow installation and the custom code is encapsulated and tested in
containers. This simplifies it a lot.

~~~
recov
This is what we do - it's great for decoupling the two. Any heavy work is run
in GKE with some custom operators. It also makes on boarding non engineers
much easier as they don't have to worry about connections/credentials/etc.

------
rllin
you only need two operators, the kubeoperator, the dataflow operator

and then every task is just a standalone, vertically scalable service on k8s
or a giant horizontally scalable compute job

