Hacker News new | comments | show | ask | jobs | submit login

Airflow by airbnb is a nice alternative to luigi. We've been using it for our ETL and it's been working greatly so far. http://airbnb.io/projects/airflow/

Can you write a little more about your evaluation of the two?

We've been using the python celery task queue library to coordinate python ETL-esque jobs, but as our setup gets more heterogenous language and job wise we're looking for something built more explicitly for the purpose. Luigi and Airflow are both contenders.

I think Luigi doesn't have a central scheduler which allows you to run tasks hourly/daily. In the docs it suggests that you can use cron to trigger tasks periodically. This was the major reason why we didn't pick Luigi. Airflow provides that capability and also has a nice UI where you can manage the DAGs (visualize the task instances, set task states, etc). It's a young project and the codebase is clean and easy to understand. Airbnb is also actively developing on it as far as I know.

Thank you for your response!

One of the major difference is how tasks are defined. In Luigi, you have to derive a base task class to create a task. In Airflow you instantiate a task by calling an operator (you can think of an operator as a task factory). If generating tasks dynamically is important to you, Airflow is a better option in that regard since you'd have to do meta-programming on the Luigi side.

Data engineering is moving in a direction where pipelines are generated dynamically. "Analysis automation", "analytics as a service", "ETL frameworks" require dynamic pipeline generation. Providing services around aggregation, A/B testing and experimentation, anomaly detection and cohort analysis require metadata-driven pipeline generation.

Airflow is also more state aware where the job dependencies are kept of for every run independently and stored in the Airflow metadata database. The fact that this metadata is handled by Airflow makes it much easier to say- rerun a task and every downstream tasks from it for a date range. You can perform very precise surgery on false positive/ false negative and rerun sub sections of workflow in time easily.

I'm not sure if this would fit your needs yet, but Pachyderm.io might be interesting if language agnostic is meaningful to you and you're using containers. Disclosure: I'm one of the founders and we don't have any of the UI features of Luigi or Airflow yet.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact