Is that correct? I've been using (and enjoying) Luigi which came out of Spotify. I haven't seen anything about them switching to Airflow.
Edit: Now I see in the interview there is this:
About Luigi, it is simpler in scope than Airflow, and perhaps we’re more complementary than competition. From what I gather, the main maintainer of the product has left Spotify and apparently they are now using Airflow internally for [at least] some of their use cases. I do not have the full story here and would like to hear more about it. I’m thinking that many of the companies choosing Luigi today might also choose Airflow later as they develop the need for the extra set of features that Airflow offers.
But there are 2 day old commits in the Luigi directory, so I don't know. I like Airflow too, but it did seems a lot more complicated the Luigi when I played with it.
But it's very much in active development and there are multiple pull requests merged every day.
I haven't had a lot of time to check out Airflow, but it seems great. Data engineering and thinking of data processing as functional pipelines is a great paradigm, think we're going to see a lot of future development in this area. Luigi will probably evolve a lot over the next few years. Eventually I think there will be better frameworks. No idea if Airflow is a step change, I think there are still projects yet to be built that unifies everything beautifully
It's Ansible for Workflow Management.
While both Luigi and Airflow (somewhat rightfully) assume the user to know/have affinity for Python, Digdag focuses on ease of use and helping enterprises move data around many systems.
If we learned one thing from today's S3 outage, it's not enough to use multiple cloud infrastructure providers: you should probably have your data in multiple cloud providers as well.
It's a subtle difference, but has huge impacts when you're trying to dynamically scale tasks based on cluster resources and track data lineage throughout your system.(Disclosure: I'm the founder of Pachyderm, a containerized data pipeline framework where we version control data in this way).
Check out Samuel Lampa's post about dynamically scaling data pipelines for more details.
At Airbnb we have another important tool (not open source at the moment) that is a UI and search engine to understand all of the "data objects" and how they relate. It includes datasets, tables, charts, dashboards and tasks. The edges are usage, attribution, sources, ... This tool shows [amongst other things] data lineage and is complementary to Airflow.
It's possible to use a feature called XCom as a message bus between tasks, but would typically direct people in the direction of having stateless, idempotent, "contained" units of work and avoid cross task communication as much as possible.
For your case [which I have little input on] I think singleton DAGs described in another post on this page may work.
By the way, thank you to Maxime for sharing his thoughts, and the Astronomer team for contributing great questions.
By dynamic, I mean something like "user sent us some new data to process, create a custom graph just for this data". I can create new airflow graph per each processing pipeline with new dag id every time, but airflow was not created for use case like this and it's not working well in such scenario.
What our tool does is allow users to organize the flow of their processing jobs on an infinite 2D layout, have some jobs run at the beginning of the flow while they organize another part to run later.
Unfortunately it's a big pile of messy code that depends too much on other "internal" systems so we can't open source it... I'd like to add "yet" because I try to gradually clean it up, simplify and make it more generic, but I'm not sure I'll see that day myself.
In the meantime... maybe Node-RED ? https://nodered.org/
I imagine I'll eventually need to add some sort of management system to move these dynamic jobs in and out of Airflow to keep them from bloating the database or cluttering the UI.
Cookiecutter takes a Python code file as input with Jinja interspersed (the template). When you want to make a new instance from the template, it gives command-line prompts, and then evaluates the Jinja logic — custom variables, loops, etc to output Python code. It took me 15–30 minutes to get started and has already paid off.
This allows to build ETL jobs which react to all sorts of external application triggers (such as an upload made by a user in that case, but it could be an API notification / webhook etc).
Based on your example, I would have a single dag that would 1. get user data and 2. generate a graph.
All the flexibility should be defined in whatever function, script or program you define to generate the graph.
Now. Airflow allows you to do what you're describing as well and will explain how to. If you were my coworker I'd dig deeper and try to understand whether the design you want is the design that is best, but let's assume it is. So first we support "externally triggered DAGs", which means those workflows don't run on a schedule, they run when they are triggered, either by some sensor, or externally in some way. A use case for that would be some company processing genomes files, and everytime a new genome file shows up, we want to run a static DAG for it.
We also support branching, meaning you can take different paths down the DAG based on what happened upstream.
Now if your DAG's shape changes dramatically at every run [a shapeshifting DAG!], I would argue that conceptually they are different DAGs, and would instruct to build "singleton" DAGs dynamically. Meaning you have python code that creates a dag object [with its own dag_id] for each "instance", with the schedule_interval='@once', meaning each DAG will run only once. You can shape each DAG individually, from that same script, and craft whatever dependency you might like for each one.
Though all of this is not only possible and easy-ish to do, it may not be the best approach. Try to think of your DAGs and tables as static [or slowly changing] if you can, and the data as the variable.
As an analogy, try to think of an oil pipeline that changes shape based on the quality of the oil it processes. Crazy?! It's easier to think of the pipeline as static and infrastructure, and to have components that can sort and direct the flow in [existing and static] pipes.
Shaping DAGs dynamically poses a challenge to the scheduler on how to 'predict' what tasks need to run in the future. The scheduler needs to evaluate which tasks will need to run, without actually executing these tasks themselves. For Airflow in its current state that is a chicken and egg problem.
For the future, I can think of allowing dynamic dags being described through the Rest API, but that is definitely further out and has not really popped up yet on the horizon.
We haven't made our final determination yet, but Airflow at the current moment feels better.
A very simple example of that would be an Airflow script that reads a yaml config file with a list of table names, and creates a little workflow for each table, that may do things like loading the table into a target database, perhaps apply rules from the config file around sampling, data retention, anonymisation, ... Now you have this abstraction where you can add entries to the config file to create new chunks of workflows without doing much work. It turns out there are tons of use cases for this type of approach. At Airbnb the most complex use case for this is around experimentation and A/B testing.
I gave a talk at a Python meetup in SF recently talking about "Advanced data engineering patterns using Apache Airflow", which was all about dynamic pipeline generation. The ability to do that is really a game changer in data engineering and part of the motivation behind writing Airflow the way it is. I'm planning on giving this talk again, but maybe I should just record it and put it on Youtube. It's probably a better outlet than any conference/meetup...
For perspective, the company I work at has tried both (as in we built products using each, and the one with Luigi is still in use). We operate on data in the < 10TB space used primarily for machine learning applications. Luigi and Airflow both introduced complexity that simply wasn't useful relative to our data flow. They both ended up getting in the way more than they helped and introduced developer overhead that wasn't justifiable.
However they are both very nice tools, and it's easy to see how they can help reduce complexity overall with very large numbers of distributed mostly-static task graphs. If that's how you consume/transform your files and data, either tool might be worth looking into.
In terms of architecture, it's pretty straightforward to setup Airflow on one box and run all the services there until you have to grow out of it and scale out to having multiple workers.
Any CSV you're working with should be properly escaped anyway or you're bound for a world of pain.
If it matters anywhere it matters in the data space. We don't see the disconnect between decimal and int, but when you're expecting a character and you get varchar, (not sure about the apostrophe case, but I suspect your talking about quotes and embedded commas) and the number of fields or composition of fields changes (e.g. col1:sttring "jack, dorsey" col2:int 156 and the parser sees col1:string "jack" col2: "dorsey, 156" you want to know that is broken ASAP.
Double.parseDouble(x) == Double.parseDouble(y)
/* instead of pythons */
x == y
And ideally if something makes it through the data filtering layer into the logic layer and does not make sense there, then that should be handled. And that's where strong types help. It forces you to handle these cases, even if that means logging/alerting/ignoring, but at least you'll have to make a decision when you write the logic, instead of 3AM in the morning.
Last time I checked, enterprise ETL tools were sold as capable of a lot more than simple OLAP to OLTP. I find the reality provided is somewhat underwhelming. Given that facebook can tell the difference between a photo of Dave and one of Jim, why do I have to manually provide a mask for every single date field flowing through an enterprise?
We've identified seven steps taken from DevOps, CI, Agile and Lean Manufacturing (https://www.datakitchen.io/platform.html#sevensteps) that you can start to apply today. We also created a 'DataOps' platform that incorporates those principles into a software: https://www.datakitchen.io.
The challenge is that there are many separate DAGs (and code and configuration) involved in producing complete production analytics embedded in each of the tools the team has selected. So what is needed is a “DAG of DAGs” that encompasses the whole analytic tool chain.
At Airbnb Airflow is far from being limited to data engineering. All the batch scheduling goes through Airflow and many team (data science, analysts, data infra, ML infra, engineering as a whole, ...) uses Airflow in all sorts of ways.
Airflow has a solid story in terms of reusable components, from extendable abstractions (operator, hooks, executors, macros, ...) all the way to computation frameworks.
In Composable's DAG execution engine, you can pull in data from various sources (SQL, NoSQL, csv, json, restful endpoints, etc.) into our common data format. You can then easily transform, orchestrate, or analyze your data using our built-in Modules (blocks) or you can easily write your own. You can then view your resulting data all within the webapp.
Reading the comments, it seems like Composable supports a lot of the things people are asking for here that Airflow is lacking. Maybe check us out and let us know what you think!
For more information:
Composable Site - https://composableanalytics.com/
Try it yourself - https://cloud.composableanalytics.com/
Composable's Blog - http://blog.composable.ai/
I'd also argue for open source over proprietary, mostly to allow for a framework that is "hackable" and extensible by nature. You can also count on the community to build a lot of the operators & hooks you'll need (Airflow terms).
* Scheduler that knows how to handle retries, skipped tasks, failing tasks
* Great UI
* Horizontal scaleable
* Great community
* Extensible; we could make it work in an enterprise context (kerberos, ldap etc)
* No XML
* Testable and debug-able workflows
As mentioned about Luigi, I do not have the whole story, one fact I know is that someone from Spotify gave a talk I did not attend in NYC at an Airflow meetup, and I've heard the original author had left the company. Those are provable statements, happy to debunk on the article if needed.
What do you mean by [current state of the Art]?
I do wish it had a REST API though.
One thing that I missed a bit was automatic task output naming based on the parameters of a task, so I wrote a thin wrapper for that . This helps, but mostly for smaller deployments.
Airflow and luigi seemed to me like two side of the same thing: fixed graphs vs data flow. One fixates the DAG, the other puts more emphasis on composition.
That said, I am excited about the data processing tools to come - I believe this is an exciting space and choosing or writing the right tool can make a real difference between a messy data landscape and an agile part of business and business development.
As for organizing your data, my personal and very biased opinion is that version control semantics similar to Git  are a pretty good way to help tame the complexity of ever-changing data sets. We already version code, but with versioned data too, now everything becomes completely reproducible.
 Our Data Science Bill of Rights: http://www.pachyderm.io/dsbor.html
Pachyderm is on my TODO list for a while, so thanks for reminding me, I'll try to implement something real with it soon.
Totally agree there should be both a simple Start Run and Stop Run button.
The language agnostic aspect means that non software engineers can also use the orchestration platform for runbook automation.
It seems to me that Airflow is Spark-on-a-db ... or rather Spark is Airflow-on-Hadoop.
Does anyone know what the difference is ?
P.S. I'm not trolling - I'm genuinely trying to get a sense of why and when would I use Airflow. Is it a point of scalability, of productivity , etc ?
For example - the positioning of spark is simple: scalability. Celery is also very clear: simplicity with good enough robustness if using the rabbitmq backend.
what does Airflow do differently ?
Airflow allows you to orchestrate all of this and keep most of code and high level operation in one place.
Of course Spark has its own internal DAG and can somewhat act as Airflow and trigger some of these other things, but typically that breaks down as you have a growing array of Spark jobs and want to keep a holistic view.
For us, Airflow manages workflows and task dependencies but all of the actual work is done externally. Each task (operator) runs whatever dockerized command with I/O over XCom. Note that we use a custom Mesos executor instead of the Celery executor. An Airflow DAG might kick off a different Spark job based on upstream tasks.
Spark for Airflow is just one of the engines where a transformation of data can happen.
We operate at a higher level: orchestration. If we were to start using Apache Beam at Airbnb (and we very well may soon!), we' use Airflow to schedule and trigger batch beam jobs alongside the rest of our other jobs.