
Toil – A Python Workflow Engine - nikolay
http://toil.readthedocs.io/en/latest/
======
tetron
After Arvados ([http://arvados.org](http://arvados.org)) Toil has the most
complete support for the common workflow language
[http://commonwl.org](http://commonwl.org)) which is an emerging standard for
writing portable workflows that can run on different platforms, instead of
being tied to a particular engine or grid/cluster/cloud technology.

------
rtpg
What exactly are workflows here? I tried following some links but can't quite
figure it out

~~~
daveguy
Appears to me to be a series of commands used for processing data. The
workflow description language (WDL) probably has the best overview of the 1-2
level links:

[https://github.com/broadinstitute/wdl](https://github.com/broadinstitute/wdl)

Essentially a formalization of a unix command pipe. With the ability to tee
and recombine components. Also, the ability to define valid inputs and
outputs.

~~~
protomikron
I do not want to be snarky, but are these languages actually used and what do
they bring to the table?

I have to admit, if I hear the term "workflow" I get very suspicious as there
does not seem to be a consensus on what exactly a "workflow" is. The only
thing people seem to agree on is the idea that "a workflow describes some kind
of data flow" which is pretty broad as this includes all programs.

So a "workflow engine" is then ... an interpreter for a (mostly domain-
specific) language?

~~~
tetron
(Toil contributor here)

Workflow systems are generally used to orchestrate parallel batch jobs on a
cluster, which may have hundreds or thousands of individual tasks. So the
emphasis is on defining what units of work are independent that can run on
separate nodes without needing to communicate with other tasks, and how
outputs of one task feed into the inputs of another.

~~~
SEJeff
Could Toil be extended to work on something like say Mesos maybe using the
Marathon scheduler, or would you be better off to write a toil framework
instead?

~~~
braincode
AFAIK, there's some support in there already:

""" Develop and test on your laptop then deploy on any of the following:

Commercial clouds: - Amazon Web Services (including the spot market) -
Microsoft Azure Private clouds: - OpenStack High Performance Computing
Environments: - GridEngine - Apache _____Mesos_ __ __\- Parasol - Individual
multi-core machines """

[http://toil.readthedocs.io/en/latest/](http://toil.readthedocs.io/en/latest/)

------
mooseburger
Not Python 3 compatible eh? Thought it was disheartening at first, but then it
seems the project started in 2011.

~~~
zzleeper
True, but at the same time it's odd for open source projects to not have py3k
compatibility. Hell, the stuff I have on github has no python 2 compatibility
because honestly after spending some time in Python 3 it really feels like a
step back (compared to bytes vs strings, etc.). Unless you are getting paid or
depend on libraries that are only on py2, why would you use python 2 instead
of 3?

~~~
coredog64
Because RHEL6 and a-hole sysadmins who are too lazy to install SCL are all too
common.

~~~
bpchaps
I'm a sysadmin who uses python27, centos and rhel on a daily basis and I've
never personally installed it - simply for not knowing anything about it, nor
has anybody asked me to install it. The name rings a slight bell, but it
doesn't sound like anything a python package would sound like, so I can kinda
see how I could have just overlooked it many, many times. Besides, installing
numpy/scipy/etc is just a few commands away to having a similar build in
comparison, anyway, so I've never even needed to think about alternatives.

I understand your frustration, but calling us assholes and lazy is just a
little harsh. At this point, I'd have hoped it was understood by everyone that
this wasn't the sysadmin's fault. How insistent or pushy or hoity toity are
you were when asking? I've been woken enough times while on call after a 2.7
upgrade to pretty much give up on it, so even the request gives me a blinding
headache these days. So yeah, I'll probably be a little grumpy at that request
and I _will_ take my time with it on top of the other things going on. If I
was called lazy or an asshole for bringing these issues up, be ready to wait.

Hell, to flip it back on y'all, would it hurt to have proper shebangs? For
every developer who asks for a 2.7 upgrade, I get three more asking to
downgrade until they fix their badly self-deployed code, despite plenty of
preparation. It sucks pretty hardcore.

That said, I have a flood of baseline rhel/centos machines to install soon, so
I'll try to include it in my builds.

(most of these issues I've had are in larger environments where best practices
are best-effort, through legacy swamps.)

~~~
th0br0
I believe SCL refers to this:
[https://en.wikipedia.org/wiki/Scientific_Linux](https://en.wikipedia.org/wiki/Scientific_Linux)

It's an unusual abbreviation though.

~~~
Spiritus
More likely
[https://www.softwarecollections.org/](https://www.softwarecollections.org/)

------
ktamura
Toil is very much reminiscent of Luigi. I hope the author(s) will elaborate on
this here (or elsewhere): There's very little on both readthedocs as well as
their GitHub
repo[https://github.com/BD2KGenomics/toil](https://github.com/BD2KGenomics/toil)

------
cevaris
Seems like a lot of work to setup. Also, spent several minutes on the docs,
and did not one line of code; examples?

~~~
zxv
An 8 line Makefile translates to an 80 line common workflow (.cwl) file.

[https://github.com/common-workflow-
language/workflows/tree/m...](https://github.com/common-workflow-
language/workflows/tree/master/workflows/compile)

~~~
dmytroi
Honestly the format looks a bit over-engineered IMHO. The task is quite
simple: make build system that executes jobs on clusters. So why not get best
build configuration format practices and just make them run over network? For
example ninja build system [1] format is quite good in my opinion, so just
make runtime execute commands over network. Or travis-ci [2] is another
example of well designed configuration format, and it really enables
developers to write small and powerful configurations. Sure it was even done
before (though mostly for C/C++ stuff), like IncrediBuild [3] for example, or
FASTBuild [4] or distcc [5]. Though the case with precise control of pipes
could be improved in current build systems, but not sure how important it is
for this application.

\- [1] [https://ninja-build.org/](https://ninja-build.org/) \- [2]
[https://travis-ci.org/](https://travis-ci.org/) \- [3]
[https://www.incredibuild.com/](https://www.incredibuild.com/) \- [4]
[http://fastbuild.org/](http://fastbuild.org/) \- [5]
[https://github.com/distcc/distcc](https://github.com/distcc/distcc)

~~~
samuell
Haven't checked ninja, but I've blogged a bit on limitations in common build
systems, such as make and its various derivatives:

"The problem with make for scientific workflows":

[http://bionics.it/posts/the-problem-with-make-for-
scientific...](http://bionics.it/posts/the-problem-with-make-for-scientific-
workflows)

"Workflow tool makers: Allow defining data flow, not just task dependencies"

[http://bionics.it/posts/workflows-dataflow-not-task-
deps](http://bionics.it/posts/workflows-dataflow-not-task-deps)

The last of which is a limitation of even the most of the "very much-
engineered" ones, as the post goes on to explain.

~~~
dmytroi
From the first blog post:

> Files are represented by strings

I think it's especially true for make - looks like it was designed to
efficiently express operations for transformations of the same type (like .cpp
-> .o/.obj). So in different use case it may become a bit clumsy to use. Ninja
should help a bit in this case - you can define a rule, and just use rule name
when defining inputs and outputs of a build statement, though it still
operates on files.

>[Problems with] combinatorial dependencies

Yes, partially this could be fixed with wildcards in make. Ninja doesn't have
wildcard support, so I've created the buildfox [1] to fix it :)

>Non-representable dependency structures

I think it's a limitation of this type of build systems, their configuration
language oriented on expressing "how" to achieve things, not "what" to
achieve.

\- [1]
[https://github.com/beardsvibe/buildfox/](https://github.com/beardsvibe/buildfox/)

------
mnkmnk
Is there a GUI tool to work with cwl? This could take off if an ETL like gui
tool could generate the config for you.

~~~
tetron
Try [https://github.com/rabix/cottontail](https://github.com/rabix/cottontail)

------
fermigier
Nice.

A presentation on this (or similar) subject would be nice at PyData Paris
([http://pydata.fr/](http://pydata.fr/)), the CFP is open.

