Airflow. This framework is used by numerous companies and several of the biggest unicorns — Spotify, Lyft, Airbnb, Stripe, and others to power data engineering at massive scale.
Is that correct? I've been using (and enjoying) Luigi[1] which came out of Spotify. I haven't seen anything about them switching to Airflow.
Edit: Now I see in the interview there is this:
About Luigi, it is simpler in scope than Airflow, and perhaps we’re more complementary than competition. From what I gather, the main maintainer of the product has left Spotify and apparently they are now using Airflow internally for [at least] some of their use cases. I do not have the full story here and would like to hear more about it. I’m thinking that many of the companies choosing Luigi today might also choose Airflow later as they develop the need for the extra set of features that Airflow offers.
But there are 2 day old commits in the Luigi directory, so I don't know. I like Airflow too, but it did seems a lot more complicated the Luigi when I played with it.
I'm one of the original Luigi authors, and used to maintain it for a while, merging PR's daily etc. I left Spotify a couple of years ago. Since then it's been maintained mostly by Arash Rouhani, who is increasingly getting busy with other things.
But it's very much in active development and there are multiple pull requests merged every day.
I haven't had a lot of time to check out Airflow, but it seems great. Data engineering and thinking of data processing as functional pipelines is a great paradigm, think we're going to see a lot of future development in this area. Luigi will probably evolve a lot over the next few years. Eventually I think there will be better frameworks. No idea if Airflow is a step change, I think there are still projects yet to be built that unifies everything beautifully
If simplicity and non-Python-centricity matter, I encourage folks to look into Digdag [1][2].
It's Ansible for Workflow Management.
While both Luigi and Airflow (somewhat rightfully) assume the user to know/have affinity for Python, Digdag focuses on ease of use and helping enterprises move data around many systems.
If we learned one thing from today's S3 outage, it's not enough to use multiple cloud infrastructure providers: you should probably have your data in multiple cloud providers as well.
[auhtor] Oh cool, I didn't know you guys released your solution yet, I remember demoing Airflow to you guys early on. Looks like it turned out great, congrats on the release!
As far as we know Airflow is used in one of the teams (I think it maybe advertising). It does not mean the whole company has switched to Airflow, but some team decided it fitted their job better.
One of the important paradigms that I think Luigi and Airflow miss is that they treat pipelines as a DAG of tasks, when it really should be thought of as a DAG of Data.
It's a subtle difference, but has huge impacts when you're trying to dynamically scale tasks based on cluster resources and track data lineage throughout your system.(Disclosure: I'm the founder of Pachyderm[0], a containerized data pipeline framework where we version control data in this way).
Check out Samuel Lampa's post[1] about dynamically scaling data pipelines for more details.
[Airflow author] The task is centric to the workflow engine. It's an important entity, and it's complementary to the data lineage graph (not necessarily a DAG btw).
At Airbnb we have another important tool (not open source at the moment) that is a UI and search engine to understand all of the "data objects" and how they relate. It includes datasets, tables, charts, dashboards and tasks. The edges are usage, attribution, sources, ... This tool shows [amongst other things] data lineage and is complementary to Airflow.
Does the other tool you're talking about that works with Airflow allow you to scale your Airflow tasks based on, for example, the amount of new input data that needs to be processed? That was one of the major challenges we see in bioinformatics workloads. Sometimes you have a few new samples to run and other times there are thousands -- so your task scheduler, although it is centric, needs to have an understanding of the data too.
At a high level [for Airflow specifically] the scheduler or workflow engine cares most about the tasks and their dependencies, and is somewhat agnostic about the units of work it triggers.
It's possible to use a feature called XCom as a message bus between tasks, but would typically direct people in the direction of having stateless, idempotent, "contained" units of work and avoid cross task communication as much as possible.
https://airflow.incubator.apache.org/concepts.html#xcoms
For your case [which I have little input on] I think singleton DAGs described in another post on this page may work.
Apache Nifi. I always make the comparison between the two when introducing Airflow as it often times takes awhile to under stand the nuanced (but critical) difference
Hey HN, I co-wrote the article with Maxime. A little late to the party here but happy to answer any questions you might have about it. I'll send him the link as well.
By the way, thank you to Maxime for sharing his thoughts, and the Astronomer team for contributing great questions.
Airflow works well for "static" jobs, but I miss something like airflow for dynamic jobs.
By dynamic, I mean something like "user sent us some new data to process, create a custom graph just for this data". I can create new airflow graph per each processing pipeline with new dag id every time, but airflow was not created for use case like this and it's not working well in such scenario.
I work on such a system, to organize data processing (HPC) projects in the oil & gas industry, and I try to follow this space. I remember I got excited when I heard of Airflow for the first time, but quickly got frustrated with its "static flow" nature: many "flow" systems are like this, you first design the flow, then "deploy" it and let it run (usually many times).
What our tool does is allow users to organize the flow of their processing jobs on an infinite 2D layout, have some jobs run at the beginning of the flow while they organize another part to run later.
Unfortunately it's a big pile of messy code that depends too much on other "internal" systems so we can't open source it... I'd like to add "yet" because I try to gradually clean it up, simplify and make it more generic, but I'm not sure I'll see that day myself.
I've started working on a code generator for Airflow. Not primarily because I needed dynamic jobs, but more because I didn't want to keep writing the Airflow boilerplate.
I imagine I'll eventually need to add some sort of management system to move these dynamic jobs in and out of Airflow to keep them from bloating the database or cluttering the UI.
I'd be interested in hearing more about your project and if you plan to open source. Are you using something like Cookiecutter or Yeoman, or a different level of abstraction?
My effort is a collection of custom operators for operations that we use very commonly (run a query, python script, or mapreduce job) and a code generator that takes a simple text spec describing a set of operations and generates the Python DAG script. I'm not familiar with Cookiecutter or Yeoman but generating Python code myself hasn't been complicated. Open-sourcing it isn't an option right now with my current employer.
Thanks, this sounds really interesting. If this changes or you decide to write personal code with similar structure, I'd definitely like to discuss further. I'm also currently exploring DAG structure but with class and function abstractions with the goals of increasing code reuse and minimizing boilerplate around our custom operators.
Cookiecutter takes a Python code file as input with Jinja interspersed (the template). When you want to make a new instance from the template, it gives command-line prompts, and then evaluates the Jinja logic — custom variables, loops, etc to output Python code. It took me 15–30 minutes to get started and has already paid off.
This allows to build ETL jobs which react to all sorts of external application triggers (such as an upload made by a user in that case, but it could be an API notification / webhook etc).
While it's not the exact use case you've described (and it's hard to pinpoint without a concrete example), you could handle related tasks by writing a more generic static job that pulls a dynamic config, eg from a mongo db, and uses the branch operator.
How about a custom sensor class that never timesout and you define its poke method to listen for data and processes it? Granted, you'll have to make sure you allocate enough workers for that..
Airflow is just the workflow management layer on top of your data pipeline. The flexibility to generate custom graphs based on user-specific parameters should be handled within a pipeline task.
Based on your example, I would have a single dag that would 1. get user data and 2. generate a graph.
All the flexibility should be defined in whatever function, script or program you define to generate the graph.
I think they are referring to "event-driven" DAGs which would be both shaped and triggered dynamically. You can accomplish this now but it feels a bit hacky and is pretty clear that it goes against the Airflow paradigm of static, slowly-changing workflows
[Airflow author here] in general, and when thinking in terms of best practices, we like to think of a DAG's shape as slowly changing, in a similar way that a database's tables definition is slowly changing. In general, you don't want to change your table's structure dynamically. This constraint brings a certain clarity and maintainability, and most use cases can be expressed this way.
Now. Airflow allows you to do what you're describing as well and will explain how to. If you were my coworker I'd dig deeper and try to understand whether the design you want is the design that is best, but let's assume it is. So first we support "externally triggered DAGs", which means those workflows don't run on a schedule, they run when they are triggered, either by some sensor, or externally in some way. A use case for that would be some company processing genomes files, and everytime a new genome file shows up, we want to run a static DAG for it.
https://airflow.incubator.apache.org/scheduler.html#external...
Now if your DAG's shape changes dramatically at every run [a shapeshifting DAG!], I would argue that conceptually they are different DAGs, and would instruct to build "singleton" DAGs dynamically. Meaning you have python code that creates a dag object [with its own dag_id] for each "instance", with the schedule_interval='@once', meaning each DAG will run only once. You can shape each DAG individually, from that same script, and craft whatever dependency you might like for each one.
Though all of this is not only possible and easy-ish to do, it may not be the best approach. Try to think of your DAGs and tables as static [or slowly changing] if you can, and the data as the variable.
As an analogy, try to think of an oil pipeline that changes shape based on the quality of the oil it processes. Crazy?! It's easier to think of the pipeline as static and infrastructure, and to have components that can sort and direct the flow in [existing and static] pipes.
Starting from 1.8 you will be able to trigger dags through a rest API, that is fully supported.
Shaping DAGs dynamically poses a challenge to the scheduler on how to 'predict' what tasks need to run in the future. The scheduler needs to evaluate which tasks will need to run, without actually executing these tasks themselves. For Airflow in its current state that is a chicken and egg problem.
For the future, I can think of allowing dynamic dags being described through the Rest API, but that is definitely further out and has not really popped up yet on the horizon.
We're currently in a PoC phase of implementing Airflow and testing it out versus Luigi. So far, what I've liked, is that Airflow seems to be much more extensible and modular than Luigi. Getting Luigi to play nicely with our particular set of constraints was painful, and subclassing the Task was also tricky because it was way more opinionated about the structure of the class. Airflow seems way less so. There also seems to be way more right out of the gate in terms of built in task types. And the UI looks nicer.
We haven't made our final determination yet, but Airflow at the current moment feels better.
[Airflow author here] one of the main differences between Airflow and Luigi is the fact that in Airflow you instantiate operators to create tasks, where with Luigi you derive classes to create tasks. This means it's more natural to create tasks dynamically in Airflow. This becomes really important if want to build workflows dynamically from code [which you should sometimes!].
A very simple example of that would be an Airflow script that reads a yaml config file with a list of table names, and creates a little workflow for each table, that may do things like loading the table into a target database, perhaps apply rules from the config file around sampling, data retention, anonymisation, ... Now you have this abstraction where you can add entries to the config file to create new chunks of workflows without doing much work. It turns out there are tons of use cases for this type of approach. At Airbnb the most complex use case for this is around experimentation and A/B testing.
I gave a talk at a Python meetup in SF recently talking about "Advanced data engineering patterns using Apache Airflow", which was all about dynamic pipeline generation. The ability to do that is really a game changer in data engineering and part of the motivation behind writing Airflow the way it is. I'm planning on giving this talk again, but maybe I should just record it and put it on Youtube. It's probably a better outlet than any conference/meetup...
How relevant are Airflow and similar to those of us who aren't operating at unicorn scale but are shuffling hundreds of CSVs & Excels and wrangling RDBMS with SQL?
I would say it can be used at it's simplest as a replacement for cron. It supports running programs on a schedule and you can set concurrency rules, SLAs, and triggers around even single commands or programs. You also get nice graphs of task run time and email alerts if jobs take longer than the SLA you've set.
My opinion is: not necessarily, and probably unlikely. Airflow and Luigi are overkill unless you have a certain level of complexity in your system.
For perspective, the company I work at has tried both (as in we built products using each, and the one with Luigi is still in use). We operate on data in the < 10TB space used primarily for machine learning applications. Luigi and Airflow both introduced complexity that simply wasn't useful relative to our data flow. They both ended up getting in the way more than they helped and introduced developer overhead that wasn't justifiable.
However they are both very nice tools, and it's easy to see how they can help reduce complexity overall with very large numbers of distributed mostly-static task graphs. If that's how you consume/transform your files and data, either tool might be worth looking into.
Many [most] of the companies using Airflow are small-ish scale, maybe less than a half a dozen people writing jobs that need to be scheduled. Airflow will bring clarity even to modest efforts, and will allow to scale the processes as needed.
In terms of architecture, it's pretty straightforward to setup Airflow on one box and run all the services there until you have to grow out of it and scale out to having multiple workers.
Airflow is more about scaling your number of processes, not the size of your data. It's relevant if you want to have a higher-level system for managing processes independent of what those processes are.
Talend makes my teeth grind. I don't understand why an ETL tool uses a strongly typed language for a foundation. The number of fun productive hours I've spent swapping chars to varchars, int to decimal & vice versa. In 2017 computers can read a registration plate from a blurry photograph and spot a criminal in a stadium, but a user puts apostrophe in a CSV file and schmoo leaks everywhere.
I've been building ETL tools in my day to day work for the past 18 months and weak types are a disaster. In all of my tools as soon as I extract data, the first thing I do is establish its type. I have great, reusable tools for type conversation that are set by rules for load.
Any CSV you're working with should be properly escaped anyway or you're bound for a world of pain.
Maybe we have drastically different use cases, but dynamic and weak typing are a disaster in the data space. Not sure why people build production systems in such languages.
If it matters anywhere it matters in the data space. We don't see the disconnect between decimal and int, but when you're expecting a character and you get varchar, (not sure about the apostrophe case, but I suspect your talking about quotes and embedded commas) and the number of fields or composition of fields changes (e.g. col1:sttring "jack, dorsey" col2:int 156 and the parser sees col1:string "jack" col2: "dorsey, 156" you want to know that is broken ASAP.
If you enjoy making sure your peas don't touch the carrots, then sure, strong typing is great in the data space. But when you get woken up because a spreadsheet comes through as 11.0 rather than 11 or you have to type
Double.parseDouble(x) == Double.parseDouble(y)
/* instead of pythons */
x == y
27 times to get a feed file parsed, then I'd say that the tools are lacking basic useability. And especially primitive in light of the handwriting reading, photo tagging, go playing,supermario winning possibilities of ML.
That means you don't have a clean "domain model" (or business logic layer) and a data filtering layer that creates the domain objects. You should apply business logic on the pure objects/entities, and you should make sure that the filtering handles the representational problems (parsing, data integrity, partial data, and so on).
And ideally if something makes it through the data filtering layer into the logic layer and does not make sense there, then that should be handled. And that's where strong types help. It forces you to handle these cases, even if that means logging/alerting/ignoring, but at least you'll have to make a decision when you write the logic, instead of 3AM in the morning.
When a tool insist on you cast char fields to varchars before you can test two fields for equality, or keeps changing all your decimals to floats, how is that helping? I'm saying that if the underlying language was loosely typed, those kind of productitvy saps & bug fountains would not happen. In the few instance where you cared about type, a loosely typed language can usually offer something.
Last time I checked, enterprise ETL tools were sold as capable of a lot more than simple OLAP to OLTP. I find the reality provided is somewhat underwhelming. Given that facebook can tell the difference between a photo of Dave and one of Jim, why do I have to manually provide a mask for every single date field flowing through an enterprise?
[Bias Alert: I'm Head Chef of DataKitchen]. Our perspective is that the DAG abstraction should not apply only to data engineering, but the whole analytic process of data engineering, data science, and data visualization. Analytic teams love to work with their favorite tools -- Python, SQL, ETL, Jupyter, R, Tableau, Alteryx, etc. The question is how do you get those diverse teams and tools to work together to deliver fast, with high quality, and reusable components?
The challenge is that there are many separate DAGs (and code and configuration) involved in producing complete production analytics embedded in each of the tools the team has selected. So what is needed is a “DAG of DAGs” that encompasses the whole analytic tool chain.
[Bias Alert: author of Airflow] can confirm that Airflow allows you to incorporate all of the seven steps, and more as an open platform.
At Airbnb Airflow is far from being limited to data engineering. All the batch scheduling goes through Airflow and many team (data science, analysts, data infra, ML infra, engineering as a whole, ...) uses Airflow in all sorts of ways.
Airflow has a solid story in terms of reusable components, from extendable abstractions (operator, hooks, executors, macros, ...) all the way to computation frameworks.
[Disclaimer: I work for Composable] My team and I are working on a project that I would consider a competitor to Airflow. I'm not overly familiar with Airflow, but Composable seems to be fit for a much wider variety of use cases.
In Composable's DAG execution engine, you can pull in data from various sources (SQL, NoSQL, csv, json, restful endpoints, etc.) into our common data format. You can then easily transform, orchestrate, or analyze your data using our built-in Modules (blocks) or you can easily write your own. You can then view your resulting data all within the webapp.
Reading the comments, it seems like Composable supports a lot of the things people are asking for here that Airflow is lacking. Maybe check us out and let us know what you think!
[author of Airflow here] as I wrote in another comment, I'd argue for a programmatic approach to workflows/dataflows as opposed to drag and drop. It turns out that code is a better abstraction for software:
https://medium.freecodecamp.com/the-rise-of-the-data-enginee...
I'd also argue for open source over proprietary, mostly to allow for a framework that is "hackable" and extensible by nature. You can also count on the community to build a lot of the operators & hooks you'll need (Airflow terms).
You can use Composable's Fluent API to author and run dataflows in code. No GUI required. Composable's platform is open and completely extensible in sense that anyone can add applications or first class modules to the system, including swapping out some of the more internal components.
A lot of name dropping and unprovable statements. I'm really interested by the domain and progress, but Airflow needs to be more generous in real information and less in marketing bs. Can someone share a more introductory article about what makes Airflow different from the current state of the art?
Here is, a slightly outdated, article that compares several ETL workflow tools http://bytepawn.com/luigi-airflow-pinball.html . Why we choose Airflow was because of the following reasons:
* Scheduler that knows how to handle retries, skipped tasks, failing tasks
* Great UI
* Horizontal scaleable
* Great community
* Extensible; we could make it work in an enterprise context (kerberos, ldap etc)
[author] sorry you feel that way. I understand that the section on other similar tools is controversial, especially if you work on one of those. I'm repeating what I hear being very active in this community, and answering the question that was asked to the best of my knowledge. I'm open to editing the article with better information if anyone wants to share more insight.
As mentioned about Luigi, I do not have the whole story, one fact I know is that someone from Spotify gave a talk I did not attend in NYC at an Airflow meetup, and I've heard the original author had left the company. Those are provable statements, happy to debunk on the article if needed.
I've been keeping close tabs on the project for a while now and it seems that version 1.8, which should be released in a few days, has the beginning of a rudimentary API. It also looks like more endpoints are being planned for subsequent releases
You can always expose the REST API. Its pretty easy considering they are just flask blueprints. Since, you can make your own custom plugin - You can build a lot using existing infrastructure that Airflow provides.
Data engineering is converging under the umbrella of DataOps. For those interested, there's a DataOps Summit in Boston this June https://www.dataopssummit.com/
This is just Data Management, a term which predates "DataOps" by more than a decade in both research and enterprise. I don't really think it needs a rebranding.
That's not really the case...
There's a nice, short podcast that was just posted that defines DataOps more broadly - spend 20 minutes listening to it, and see what you think. Interested to hear your thoughts.
DAGS in Airflow can just be a few lines. Some understanding of the syntax of python is required. But you can start simple and add complexity as you require it.
Luigi is nice, because it is really simple to get started and it gradually allows you to do more complex things like (custom) parameter types, custom targets, enhanced super classes and dynamic dependencies, event hooks, task history and more.
One thing that I missed a bit was automatic task output naming based on the parameters of a task, so I wrote a thin wrapper for that [1]. This helps, but mostly for smaller deployments.
Airflow and luigi seemed to me like two side of the same thing: fixed graphs vs data flow. One fixates the DAG, the other puts more emphasis on composition.
That said, I am excited about the data processing tools to come - I believe this is an exciting space and choosing or writing the right tool can make a real difference between a messy data landscape and an agile part of business and business development.
Definitely agree that that is one of the great points with Luigi. Airflow's UI of course blows everything out of the water IMO.
As for organizing your data, my personal and very biased opinion is that version control semantics similar to Git [0] are a pretty good way to help tame the complexity of ever-changing data sets. We already version code, but with versioned data too, now everything becomes completely reproducible.
The git angle would be a huge step forward. What I found is that reproducability is not always on people's mind when they designing such systems, whereas I believe it's one of the most important properties.
Pachyderm is on my TODO list for a while, so thanks for reminding me, I'll try to implement something real with it soon.
Minor gripe - why can I not execute an entire DAG (end to end) from the UI? Also trying to execute single tasks from the UI using the "run" functionality gives a CeleryExecutor requirement error... sorry, I know this isn't the help forums but it sounds like the most trivial tasks were overlooked.
Actually you can but it is a bit clunky. Go to Browse > Dag Runs and select the 'Create' tab. This pulls up a form where you type in the Dag ID, enter the start time (set to now but keep in mind it is the local time of the web server), and a Run ID.
Totally agree there should be both a simple Start Run and Stop Run button.
Disney Studios uses Azkaban because it's language agnostic, and it believes that in the data space there is a huge advantage to static (and strong, but that's beside the point in this argument) typing in the data space.
The language agnostic aspect means that non software engineers can also use the orchestration platform for runbook automation.
Airflow doesn't have anything to do with data storage, movement or processing. It's a way to chain commands together in such a way so that you can define "do Z after Y a Z finish", for example. Many people use it like a nice version of cron with a UI, alerting, and retries.
P.S. I'm not trolling - I'm genuinely trying to get a sense of why and when would I use Airflow. Is it a point of scalability, of productivity , etc ?
For example - the positioning of spark is simple: scalability. Celery is also very clear: simplicity with good enough robustness if using the rabbitmq backend.
In a modern data team, Spark is just one of the type of job you may want to orchestrate. Typically as your company gets more tangled in data processing, you'll have many storage and compute engines that you'll have to orchestrate. Hive, MySQL, Presto, HBASE, map/reduce, Cascading/Scalding, scripts, external integrations, R, Druid, Redshift, miroservices, ...
Airflow allows you to orchestrate all of this and keep most of code and high level operation in one place.
Of course Spark has its own internal DAG and can somewhat act as Airflow and trigger some of these other things, but typically that breaks down as you have a growing array of Spark jobs and want to keep a holistic view.
Airflow uses Celery to horizontally scale its execution. The Airflow scheduler takes care of what tasks to run in what order, but also what to do when they fail, need to retry, don't need to run at all, backfill the past etc.
Spark for Airflow is just one of the engines where a transformation of data can happen.
I definitely get where you're coming from. At Astronomer, we use both Airflow and Spark, though Spark is very new to me.
For us, Airflow manages workflows and task dependencies but all of the actual work is done externally. Each task (operator) runs whatever dockerized command with I/O over XCom. Note that we use a custom Mesos executor instead of the Celery executor. An Airflow DAG might kick off a different Spark job based on upstream tasks.
[author] Airflow is not a data flow engine, though you can use it to do some of that, but we typically defer on doing data transformations using/coordinating external engines (Spark, Hive, Cascading, Sqoop, PIG, ...).
We operate at a higher level: orchestration. If we were to start using Apache Beam at Airbnb (and we very well may soon!), we' use Airflow to schedule and trigger batch beam jobs alongside the rest of our other jobs.
Thanks, that's really interesting. The usage of 'pipeline' to describe both sorts of system made me think there was a lot of overlap, but I'm understanding now how they are complementary.
Is that correct? I've been using (and enjoying) Luigi[1] which came out of Spotify. I haven't seen anything about them switching to Airflow.
Edit: Now I see in the interview there is this:
About Luigi, it is simpler in scope than Airflow, and perhaps we’re more complementary than competition. From what I gather, the main maintainer of the product has left Spotify and apparently they are now using Airflow internally for [at least] some of their use cases. I do not have the full story here and would like to hear more about it. I’m thinking that many of the companies choosing Luigi today might also choose Airflow later as they develop the need for the extra set of features that Airflow offers.
But there are 2 day old commits in the Luigi directory, so I don't know. I like Airflow too, but it did seems a lot more complicated the Luigi when I played with it.
[1] https://github.com/spotify/luigi