
Luigi vs. Airflow vs. Pinball - Maro
http://bytepawn.com/luigi-airflow-pinball.html
======
vtuulos
We have been using Luigi in production for a year now at AdRoll, to manage a
graph of tens of data processing tasks. We have been really happy with it.

You can read more about our setup in these two blog posts:

[http://tech.adroll.com/blog/data/2015/09/22/data-
pipelines-d...](http://tech.adroll.com/blog/data/2015/09/22/data-pipelines-
docker.html)

[http://tech.adroll.com/blog/data/2015/10/15/luigi.html](http://tech.adroll.com/blog/data/2015/10/15/luigi.html)

~~~
suresk
Just curious - have you guys used the hadoop contrib stuff with Luigi? We use
it almost exclusively to kick off Hadoop jobs and when I went to refactor some
of my predecessors stuff that just kicked off a raw process that called
'hadoop jar' to the hadoop contrib stuff, I ran into a lot of weird issues
(largely, the way arguments get passed to the hadoop job).

Just was curious if many other people were using the hadoop contrib stuff
successfully, or if I am trying to use something that isn't very well
supported.

~~~
vtuulos
No, we haven't used Luigi with Hadoop. For batch processing we use
containerized jobs with a simple job queue, like described in the blog
article.

------
vikiomega9
I find a lot of the use cases end up using hadoop anyway and I was wondering
why tools like oozie are not used. It appears as if such projects are feats of
engineering and nothing more. I might be gravely mistaken but that's how I
see. Comments that suggest otherwise would be greatly appreciated. EDIT: I
also find it odd that one might write a workflow manager because they can't
find an equivalent one for python.

~~~
suresk
Common complaints I've heard about Oozie is that it has a high learning curve,
not a great UI, and people hate the fact that it is XML based. This is a
pretty decent comparison of Oozie vs Luigi (and Azkaban):

[http://www.slideshare.net/jcrobak/data-
engineermeetup-201309](http://www.slideshare.net/jcrobak/data-
engineermeetup-201309)

~~~
vikiomega9
That presentation was pretty good with the good and bad takes. Do you think
frameworks like Casacading or spark make things a lot easier as a higher
abstraction on hadoop / different compute model?

~~~
suresk
I haven't tried Cascading, but I've started doing some stuff with Spark and
really like it. I feel like it is _usually_ an easier abstraction to work with
and it is a lot easier to prototype and experiment with.

------
mtrn
We have been using Luigi for a larger project and it works fine. Some people
have a bit of a hard time understanding what it is about and why at least some
software for scheduling is needed. I find the notion of "make for data"
useful.

After a presentation on Luigi in a Python User Group, we had a lively
discussion about certain features. One issue that came up was the fact, that
downstream tasks are not necessarily recomputed, once you change something in
the code. For that to happen, you would have to keep track of the source code
as well. Similarities with Nix came up, where a change in code leads to a
different ID, so all changes can be tracked.

Shameless plug: When I started using Luigi, I missed some auto-generated
filename feature for task outputs, so I wrote a utility library for that (and
a few other things): [https://git.io/vg4D0](https://git.io/vg4D0)

------
twunde
The most interesting part of this are the links to the actual reviews

------
harlowja
Don't forget another one (used by openstack projects):

[http://docs.openstack.org/developer/taskflow/](http://docs.openstack.org/developer/taskflow/)

Comments/questions welcome!

------
suresk
Good write up. We've been pretty happy with Luigi, but built-in scheduling
would be really, really nice, so going to have to take a look at Airflow.

------
thesorrow
We are using airflow to schedule backups for our servers and it's been really
stable so far !

