
Airflow and the Future of Data Engineering: A Q&A - scapecast
https://medium.com/the-astronomer-journey/airflow-and-the-future-of-data-engineering-a-q-a-266f68d956a9
======
nl
_Airflow. This framework is used by numerous companies and several of the
biggest unicorns — Spotify, Lyft, Airbnb, Stripe, and others to power data
engineering at massive scale._

Is that correct? I've been using (and enjoying) Luigi[1] which came out of
Spotify. I haven't seen anything about them switching to Airflow.

Edit: Now I see in the interview there is this:

 _About Luigi, it is simpler in scope than Airflow, and perhaps we’re more
complementary than competition. From what I gather, the main maintainer of the
product has left Spotify and apparently they are now using Airflow internally
for [at least] some of their use cases. I do not have the full story here and
would like to hear more about it. I’m thinking that many of the companies
choosing Luigi today might also choose Airflow later as they develop the need
for the extra set of features that Airflow offers._

But there are 2 day old commits in the Luigi directory, so I don't know. I
like Airflow too, but it did seems a lot more complicated the Luigi when I
played with it.

[1] [https://github.com/spotify/luigi](https://github.com/spotify/luigi)

~~~
ktamura
If simplicity and non-Python-centricity matter, I encourage folks to look into
Digdag [1][2].

It's Ansible for Workflow Management.

While both Luigi and Airflow (somewhat rightfully) assume the user to
know/have affinity for Python, Digdag focuses on ease of use and helping
enterprises move data around many systems.

If we learned one thing from today's S3 outage, it's not enough to use
multiple cloud infrastructure providers: you should probably have your data in
multiple cloud providers as well.

[1] [https://www.digdag.io](https://www.digdag.io)

[2] [https://github.com/treasure-data/digdag](https://github.com/treasure-
data/digdag)

~~~
caravel
[auhtor] Oh cool, I didn't know you guys released your solution yet, I
remember demoing Airflow to you guys early on. Looks like it turned out great,
congrats on the release!

~~~
ktamura
Thanks! Airflow was/has been a great source of inspiration =)

------
jaz46
One of the important paradigms that I think Luigi and Airflow miss is that
they treat pipelines as a DAG of tasks, when it really should be thought of as
a DAG of Data.

It's a subtle difference, but has huge impacts when you're trying to
dynamically scale tasks based on cluster resources and track data lineage
throughout your system.(Disclosure: I'm the founder of Pachyderm[0], a
containerized data pipeline framework where we version control data in this
way).

Check out Samuel Lampa's post[1] about dynamically scaling data pipelines for
more details.

[0] github.com/pachyderm/pachyderm [1] [http://bionics.it/posts/dynamic-
workflow-scheduling](http://bionics.it/posts/dynamic-workflow-scheduling)

~~~
caravel
[Airflow author] The task is centric to the workflow engine. It's an important
entity, and it's complementary to the data lineage graph (not necessarily a
DAG btw).

At Airbnb we have another important tool (not open source at the moment) that
is a UI and search engine to understand all of the "data objects" and how they
relate. It includes datasets, tables, charts, dashboards and tasks. The edges
are usage, attribution, sources, ... This tool shows [amongst other things]
data lineage and is complementary to Airflow.

~~~
jaz46
Does the other tool you're talking about that works with Airflow allow you to
scale your Airflow tasks based on, for example, the amount of new input data
that needs to be processed? That was one of the major challenges we see in
bioinformatics workloads. Sometimes you have a few new samples to run and
other times there are thousands -- so your task scheduler, although it is
centric, needs to have an understanding of the data too.

~~~
caravel
At a high level [for Airflow specifically] the scheduler or workflow engine
cares most about the tasks and their dependencies, and is somewhat agnostic
about the units of work it triggers.

It's possible to use a feature called XCom as a message bus between tasks, but
would typically direct people in the direction of having stateless,
idempotent, "contained" units of work and avoid cross task communication as
much as possible.
[https://airflow.incubator.apache.org/concepts.html#xcoms](https://airflow.incubator.apache.org/concepts.html#xcoms)

For your case [which I have little input on] I think singleton DAGs described
in another post on this page may work.

------
tedmiston
Hey HN, I co-wrote the article with Maxime. A little late to the party here
but happy to answer any questions you might have about it. I'll send him the
link as well.

 _By the way, thank you to Maxime for sharing his thoughts, and the Astronomer
team for contributing great questions._

~~~
caravel
Maxime reporting for duty here, I'll go through the thread and answer
questions / add comments.

------
kozikow
Airflow works well for "static" jobs, but I miss something like airflow for
dynamic jobs.

By dynamic, I mean something like "user sent us some new data to process,
create a custom graph just for this data". I can create new airflow graph per
each processing pipeline with new dag id every time, but airflow was not
created for use case like this and it's not working well in such scenario.

~~~
gdulli
I've started working on a code generator for Airflow. Not primarily because I
needed dynamic jobs, but more because I didn't want to keep writing the
Airflow boilerplate.

I imagine I'll eventually need to add some sort of management system to move
these dynamic jobs in and out of Airflow to keep them from bloating the
database or cluttering the UI.

~~~
tedmiston
I'd be interested in hearing more about your project and if you plan to open
source. Are you using something like Cookiecutter or Yeoman, or a different
level of abstraction?

~~~
gdulli
My effort is a collection of custom operators for operations that we use very
commonly (run a query, python script, or mapreduce job) and a code generator
that takes a simple text spec describing a set of operations and generates the
Python DAG script. I'm not familiar with Cookiecutter or Yeoman but generating
Python code myself hasn't been complicated. Open-sourcing it isn't an option
right now with my current employer.

~~~
tedmiston
Thanks, this sounds really interesting. If this changes or you decide to write
personal code with similar structure, I'd definitely like to discuss further.
I'm also currently exploring DAG structure but with class and function
abstractions with the goals of increasing code reuse and minimizing
boilerplate around our custom operators.

Cookiecutter takes a Python code file as input with Jinja interspersed (the
template). When you want to make a new instance from the template, it gives
command-line prompts, and then evaluates the Jinja logic — custom variables,
loops, etc to output Python code. It took me 15–30 minutes to get started and
has already paid off.

------
classybull
We're currently in a PoC phase of implementing Airflow and testing it out
versus Luigi. So far, what I've liked, is that Airflow seems to be much more
extensible and modular than Luigi. Getting Luigi to play nicely with our
particular set of constraints was painful, and subclassing the Task was also
tricky because it was way more opinionated about the structure of the class.
Airflow seems way less so. There also seems to be way more right out of the
gate in terms of built in task types. And the UI looks nicer.

We haven't made our final determination yet, but Airflow at the current moment
feels better.

~~~
caravel
[Airflow author here] one of the main differences between Airflow and Luigi is
the fact that in Airflow you __instantiate __operators to create tasks, where
with Luigi you __derive __classes to create tasks. This means it 's more
natural to create tasks dynamically in Airflow. This becomes really important
if want to build workflows dynamically from code [which you should
sometimes!].

A very simple example of that would be an Airflow script that reads a yaml
config file with a list of table names, and creates a little workflow for each
table, that may do things like loading the table into a target database,
perhaps apply rules from the config file around sampling, data retention,
anonymisation, ... Now you have this abstraction where you can add entries to
the config file to create new chunks of workflows without doing much work. It
turns out there are tons of use cases for this type of approach. At Airbnb the
most complex use case for this is around experimentation and A/B testing.

I gave a talk at a Python meetup in SF recently talking about "Advanced data
engineering patterns using Apache Airflow", which was all about dynamic
pipeline generation. The ability to do that is really a game changer in data
engineering and part of the motivation behind writing Airflow the way it is.
I'm planning on giving this talk again, but maybe I should just record it and
put it on Youtube. It's probably a better outlet than any conference/meetup...

------
gregn610
How relevant are Airflow and similar to those of us who aren't operating at
unicorn scale but are shuffling hundreds of CSVs & Excels and wrangling RDBMS
with SQL?

~~~
amalag
I have used Talend in the past, sounds like a fit. But this seems to fit a
different need around job management.

~~~
gregn610
Talend makes my teeth grind. I don't understand why an ETL tool uses a
strongly typed language for a foundation. The number of fun productive hours
I've spent swapping chars to varchars, int to decimal & vice versa. In 2017
computers can read a registration plate from a blurry photograph and spot a
criminal in a stadium, but a user puts apostrophe in a CSV file and schmoo
leaks everywhere.

~~~
vira
Strong typing helps. Keep in mind enterprise ETL tools are designed to move
data from oltp to olap databases, which are often strongly typed as well.

~~~
gregn610
When a tool insist on you cast char fields to varchars before you can test two
fields for equality, or keeps changing all your decimals to floats, how is
that helping? I'm saying that if the underlying language was loosely typed,
those kind of productitvy saps & bug fountains would not happen. In the few
instance where you cared about type, a loosely typed language can usually
offer something.

Last time I checked, enterprise ETL tools were sold as capable of a lot more
than simple OLAP to OLTP. I find the reality provided is somewhat
underwhelming. Given that facebook can tell the difference between a photo of
Dave and one of Jim, why do I have to manually provide a mask for every single
date field flowing through an enterprise?

~~~
busterarm
I don't use enterprise ETL tools. I pretty much write my own every time and
occasionally I'll supplement that with things like Kiba.

------
botswana99
[Bias Alert: I'm Head Chef of DataKitchen]. Our perspective is that the DAG
abstraction should not apply only to data engineering, but the whole analytic
process of data engineering, data science, and data visualization. Analytic
teams love to work with their favorite tools -- Python, SQL, ETL, Jupyter, R,
Tableau, Alteryx, etc. The question is how do you get those diverse teams and
tools to work together to deliver fast, with high quality, and reusable
components?

We've identified seven steps taken from DevOps, CI, Agile and Lean
Manufacturing
([https://www.datakitchen.io/platform.html#sevensteps](https://www.datakitchen.io/platform.html#sevensteps))
that you can start to apply today. We also created a 'DataOps' platform that
incorporates those principles into a software:
[https://www.datakitchen.io](https://www.datakitchen.io).

The challenge is that there are many separate DAGs (and code and
configuration) involved in producing complete production analytics embedded in
each of the tools the team has selected. So what is needed is a “DAG of DAGs”
that encompasses the whole analytic tool chain.

~~~
caravel
[Bias Alert: author of Airflow] can confirm that Airflow allows you to
incorporate all of the seven steps, and more as an open platform.

At Airbnb Airflow is far from being limited to data engineering. All the batch
scheduling goes through Airflow and many team (data science, analysts, data
infra, ML infra, engineering as a whole, ...) uses Airflow in all sorts of
ways.

Airflow has a solid story in terms of reusable components, from extendable
abstractions (operator, hooks, executors, macros, ...) all the way to
computation frameworks.

------
jcalabro
[Disclaimer: I work for Composable] My team and I are working on a project
that I would consider a competitor to Airflow. I'm not overly familiar with
Airflow, but Composable seems to be fit for a much wider variety of use cases.

In Composable's DAG execution engine, you can pull in data from various
sources (SQL, NoSQL, csv, json, restful endpoints, etc.) into our common data
format. You can then easily transform, orchestrate, or analyze your data using
our built-in Modules (blocks) or you can easily write your own. You can then
view your resulting data all within the webapp.

Reading the comments, it seems like Composable supports a lot of the things
people are asking for here that Airflow is lacking. Maybe check us out and let
us know what you think!

For more information: Composable Site -
[https://composableanalytics.com/](https://composableanalytics.com/) Try it
yourself -
[https://cloud.composableanalytics.com/](https://cloud.composableanalytics.com/)
Composable's Blog - [http://blog.composable.ai/](http://blog.composable.ai/)

~~~
caravel
[author of Airflow here] as I wrote in another comment, I'd argue for a
programmatic approach to workflows/dataflows as opposed to drag and drop. It
turns out that code is a better abstraction for software:
[https://medium.freecodecamp.com/the-rise-of-the-data-
enginee...](https://medium.freecodecamp.com/the-rise-of-the-data-
engineer-91be18f1e603)

I'd also argue for open source over proprietary, mostly to allow for a
framework that is "hackable" and extensible by nature. You can also count on
the community to build a lot of the operators & hooks you'll need (Airflow
terms).

~~~
larsf
You can use Composable's Fluent API to author and run dataflows in code. No
GUI required. Composable's platform is open and completely extensible in sense
that anyone can add applications or first class modules to the system,
including swapping out some of the more internal components.

------
batmansmk
A lot of name dropping and unprovable statements. I'm really interested by the
domain and progress, but Airflow needs to be more generous in _real_
information and less in marketing bs. Can someone share a more introductory
article about what makes Airflow different from the current state of the art?

~~~
smooc
Here is, a slightly outdated, article that compares several ETL workflow tools
[http://bytepawn.com/luigi-airflow-pinball.html](http://bytepawn.com/luigi-
airflow-pinball.html) . Why we choose Airflow was because of the following
reasons:

* Scheduler that knows how to handle retries, skipped tasks, failing tasks

* Great UI

* Horizontal scaleable

* Great community

* Extensible; we could make it work in an enterprise context (kerberos, ldap etc)

* No XML

* Testable and debug-able workflows

------
glial
+1 for Airflow. I use it every day to handle tasks with many components and
dependencies. I love that everything is code & version-controlled.

I do wish it had a REST API though.

~~~
papercruncher
I've been keeping close tabs on the project for a while now and it seems that
version 1.8, which should be released in a few days, has the beginning of a
rudimentary API. It also looks like more endpoints are being planned for
subsequent releases

------
dataops
Data engineering is converging under the umbrella of DataOps. For those
interested, there's a DataOps Summit in Boston this June
[https://www.dataopssummit.com/](https://www.dataopssummit.com/)

~~~
batbomb
This is just Data Management, a term which predates "DataOps" by more than a
decade in both research and enterprise. I don't really think it needs a
rebranding.

~~~
dataops
That's not really the case... There's a nice, short podcast that was just
posted that defines DataOps more broadly - spend 20 minutes listening to it,
and see what you think. Interested to hear your thoughts.

[https://thenewstack.io/delving-dataops-
matters/](https://thenewstack.io/delving-dataops-matters/)

------
cardosof
I'm used to running R scripts with cron to handle some batch jobs and I'm no
python dev. Would it be easy to start using Airflow?

~~~
smooc
DAGS in Airflow can just be a few lines. Some understanding of the syntax of
python is required. But you can start simple and add complexity as you require
it.

~~~
classybull
To add to that, you can create a DAG class that's really just a wrapper that
executes your R script and waits for its return code.

------
mtrn
Luigi is nice, because it is really simple to get started and it gradually
allows you to do more complex things like (custom) parameter types, custom
targets, enhanced super classes and dynamic dependencies, event hooks, task
history and more.

One thing that I missed a bit was automatic task output naming based on the
parameters of a task, so I wrote a thin wrapper for that [1]. This helps, but
mostly for smaller deployments.

Airflow and luigi seemed to me like two side of the same thing: fixed graphs
vs data flow. One fixates the DAG, the other puts more emphasis on
composition.

That said, I am excited about the data processing tools to come - I believe
this is an exciting space and choosing or writing the _right tool_ can make a
real difference between a messy data landscape and an agile part of business
and business development.

[1] [https://github.com/miku/gluish](https://github.com/miku/gluish)

~~~
jaz46
Definitely agree that that is one of the great points with Luigi. Airflow's UI
of course blows everything out of the water IMO.

As for organizing your data, my personal and very biased opinion is that
version control semantics similar to Git [0] are a pretty good way to help
tame the complexity of ever-changing data sets. We already version code, but
with versioned data too, now everything becomes completely reproducible.

[0] Our Data Science Bill of Rights:
[http://www.pachyderm.io/dsbor.html](http://www.pachyderm.io/dsbor.html)

~~~
mtrn
The git angle would be a huge step forward. What I found is that
reproducability is not always on people's mind when they designing such
systems, whereas I believe it's one of the most important properties.

Pachyderm is on my TODO list for a while, so thanks for reminding me, I'll try
to implement something real with it soon.

------
throwaway_374
Minor gripe - why can I not execute an entire DAG (end to end) from the UI?
Also trying to execute single tasks from the UI using the "run" functionality
gives a CeleryExecutor requirement error... sorry, I know this isn't the help
forums but it sounds like the most trivial tasks were overlooked.

~~~
detroitcoder
Actually you can but it is a bit clunky. Go to Browse > Dag Runs and select
the 'Create' tab. This pulls up a form where you type in the Dag ID, enter the
start time (set to now but keep in mind it is the local time of the web
server), and a Run ID.

Totally agree there should be both a simple Start Run and Stop Run button.

~~~
tedmiston
What I've been doing when debugging is flipping the DAG switch to off,
clearing the first task in the DAG so the whole thing re-runs, then back to
on.

------
LevonK
Disney Studios uses Azkaban because it's language agnostic, and it believes
that in the data space there is a huge advantage to static (and strong, but
that's beside the point in this argument) typing in the data space.

The language agnostic aspect means that non software engineers can also use
the orchestration platform for runbook automation.

------
sandGorgon
MHO There seems to be quite some conceptual overlap between Airflow's DAG and
Spark RDD.

It seems to me that Airflow is Spark-on-a-db ... or rather Spark is Airflow-
on-Hadoop.

Does anyone know what the difference is ?

~~~
gallamine
Airflow doesn't have anything to do with data storage, movement or processing.
It's a way to chain commands together in such a way so that you can define "do
Z after Y a Z finish", for example. Many people use it like a nice version of
cron with a UI, alerting, and retries.

~~~
sandGorgon
so - celery + spark ? or just Celery Canvas ?
([http://docs.celeryproject.org/en/latest/userguide/canvas.htm...](http://docs.celeryproject.org/en/latest/userguide/canvas.html))

P.S. I'm not trolling - I'm genuinely trying to get a sense of why and when
would I use Airflow. Is it a point of scalability, of productivity , etc ?

For example - the positioning of spark is simple: scalability. Celery is also
very clear: simplicity with good enough robustness if using the rabbitmq
backend.

what does Airflow do differently ?

~~~
caravel
In a modern data team, Spark is just one of the type of job you may want to
orchestrate. Typically as your company gets more tangled in data processing,
you'll have many storage and compute engines that you'll have to orchestrate.
Hive, MySQL, Presto, HBASE, map/reduce, Cascading/Scalding, scripts, external
integrations, R, Druid, Redshift, miroservices, ...

Airflow allows you to orchestrate all of this and keep most of code and high
level operation in one place.

Of course Spark has its own internal DAG and can somewhat act as Airflow and
trigger some of these other things, but typically that breaks down as you have
a growing array of Spark jobs and want to keep a holistic view.

~~~
sandGorgon
that is an incredibly lucid answer. That should be the first line on the
Airflow project.

------
fuzzylearner
Amazing framework with a lot of functionalities. A tool built to be extensible
just what an open source software should look like :)

------
somewhatoff
How would you see Airflow in relation to Apache Beam / GC Dataflow?

~~~
caravel
[author] Airflow is not a data flow engine, though you can use it to do some
of that, but we typically defer on doing data transformations
using/coordinating external engines (Spark, Hive, Cascading, Sqoop, PIG, ...).

We operate at a higher level: orchestration. If we were to start using Apache
Beam at Airbnb (and we very well may soon!), we' use Airflow to schedule and
trigger batch beam jobs alongside the rest of our other jobs.

~~~
somewhatoff
Thanks, that's really interesting. The usage of 'pipeline' to describe both
sorts of system made me think there was a lot of overlap, but I'm
understanding now how they are complementary.

------
vicaya
Who has used both Airflow and Spinnaker? Feedbacks?

------
cocoflunchy
The actual article is here: [https://medium.com/the-astronomer-
journey/airflow-and-the-fu...](https://medium.com/the-astronomer-
journey/airflow-and-the-future-of-data-engineering-a-
q-a-266f68d956a9#.xtu264vu2)

