
Introducing Drake, a kind of ‘make for data’ - dirtyvagabond
http://blog.factual.com/introducing-drake-a-kind-of-make-for-data
======
madhadron
I wrote a workflow processing system (<http://github.com/madhadron/bein>)
that's still running around the bioinformatics community in southern
Switzerland, and came to the conclusion that something like make isn't
actually what you want. Unfortunately, what you want varies with the task at
hand. The relevant parameters are:

\- The complexity of your analysis. \- How fixed your pipeline is over time.
\- The size of a data set. \- How many data sets you are running the analysis
on. \- How long the analysis takes to run.

If you are only doing one or two tasks, then you barely need a management
tool, though if your data is huge, you probably want memoization of those
steps. If your pipeline changes continuously, as it does for a scientist
mucking around with new data, then you need executions of code to be objects
in their own right, just like code.

Make-like systems are ideal when:

\- Your analysis consists of tens of steps. \- You have only a couple of data
sets that you're running a given analysis on. \- The analysis takes minutes to
hours, so you need memoization.

Another Swiss project, openBIS, is ideal for big analyses that are very fixed,
but will be run on large numbers of data sets. It's very regimented and
provides lots of tools for curating data inputs and outputs. The system I
wrote was meant for day to day analysis where the analysis would change with
every run, was only being run on a few data sets, and the analysis tool
minutes to hours to run. Having written it and had a few years to think about
it, there are things I would do very differently today (notably, make
executions much more first class than they are, starting with an omniscient
debugger integrated with memoization, which is effectively an execution
browser).

So bravo for this project for making a tool that fits their needs beautifully.
More people need to do this. Tools to handle the logistics of data analysis
are not one size fits all, and the habits we have inherited are often not what
we really want.

~~~
zmmmmm
Heh, all the bioinformaticians come out of the woodwork :-)

Here's yet-another-project for bioinformatics workflows that I've been
involved in. This one based on Groovy:

<http://bpipe.org>

I agree with your sentiments about the nature of pipelines vs build system a
la make. Many many people start down the path of putting the classic DAG
dependency analysis as the foundation of their needs when in fact, this isn't
so much of a problem in real situations, and is even somewhat
counterproductive because it forces you to declare a lot of things in a static
way that actually aren't static at all. I've found tools like this completely
break down when your data starts determining your workflow (eg: if the file is
bigger than X I will break it in n parts and run them in parallel, otherwise I
will continue on and do it using a different command entirely in memory).

In my experience the problems in big data analysis are more about the
complexity of managing the process, achieving as much parallelization with as
little effort and craziness as possible (don't see any mention of that in
Drake), documenting what actually happened when something ran so you can
figure it out later, and most of all, flexibility in modifying it since it
changes every day of the week.

One mistake that Drake appears to make (again, from my quick skim), is
interweaving the declaration of the "stages" of the pipeline (what they do)
and the dependencies between them (the order they run in). This makes your
pipeline stages less reusable and the pipeline harder to maintain. Bpipe
completely separates these things out, which is something I like about it.

~~~
aboytsov
Thanks for your feedback. We do mention parallelization in the designdoc, it's
just not implemented yet. It's quite easy to add though. We have a lot of
features spec'ed out, but not implemented.

I would appreciate if you elaborated on separating step definitions from
dependency definitions. In my mind, they are the same thing. If you mean that
steps might not be connected by input-output relationship, but still have
dependencies, Drake fully supports that via tags. If you mean that steps might
be connected through input-output files, but not depend upon each other, I
don't frankly see how it's possible. And if you mean some other syntax which
more clearly separates the two, Drake supports methods which achieves exactly
that. If you mean something else, I would love to see an example.

Thanks!

~~~
zmmmmm
> I would appreciate if you elaborated on separating step definitions from
> dependency definitions

As I said, I only very quickly skimmed since I'm busy, I might have overlooked
information, and apologies in that case. But take the example from the front
page:

    
    
        evergreens.csv <- contracts.csv
          grep Evergreen $INPUT > $OUTPUT
    

So now suppose a new requirement comes along - Evergreen is also called
"Neverbrown" sometimes. It's decided the best way is to convert all references
at input so nothing else gets confused downstream. So I need an extra step,
now

    
    
        renamed.csv <- contracts.csv:
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
    
        evergreens.csv <- renamed.csv:
            grep Evergreen $INPUT > $OUTPUT
    

Adding this step forced me to modify the declaration of the original command,
even though what I added had nothing to do with that command. With Bpipe, for
example, you say

    
    
        extract_evergreens = { 
          exec "grep Evergreen $input > $output" 
        }
    
        fix_names = { 
          exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
        }
    

Then you define your pipeline order separately -

    
    
        run { fix_names + extract_evergreens }
    

If I get contracts from a different source that don't need the renaming, I can
still run my old version and I'm not changing the definition of anything:

    
    
        run { extract_evergreens }
    

Hope this explains what I mean, and again apologies if this is all clearly
explained in your docs and I just jumped to conclusions from the simple
examples!

~~~
aboytsov
I see. Thank you very much. I think this is very cool. I can see several
problems with this approach, and I would greatly appreciate it if you could
comment on that. After all, I don't know Bpipe.

The fundamental issue is why do you have to repeat the filename, and I did
give it some thought.

1\. What your example does is allows to allocate dependencies based on
positions. It's pretty cool. This seems to be easily reproducible in Drake, if
we add a special symbol that would just mean "a temporary file" for the
output, and "last temporary output" for the input (by the way, you don't need
colons):

    
    
        _ <- contracts.csv
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
    
        evergreens.csv <- _
            grep Evergreen $INPUT > $OUTPUT
    

or even:

    
    
        <<- contracts.csv
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
    
        evergreens.csv <<-
            grep Evergreen $INPUT > $OUTPUT
    

2\. One of the problems, as you can see, that it only works if you don't care
about the filenames, i.e. you use a temporary file. Similarly, your Bpipe
expression:

    
    
        run { fix_names + extract_evergreens }
    

doesn't care about filenames as well. How do you add it there? What if you
need this file for debugging purposes, or if it's an input to some further
step down the road? In this case, you'd have to do what you want to avoid
doing (i.e. modify the original step).

3\. I'm even more concerned with multiple inputs and multiple outputs. As long
as your workflow is simple, you can get away with a + b. But when it's more
complicated, you would have to do something like:

    
    
        run { (((fix_names + extract_evergreens) * and_some_otheroutput) + some_other_step) * some_other_output }
    

(I used * as an operator that puts two outputs together to create an input
with two files for the next command. Mathematically, + is better for that and
* is for what + is used in your examples. :))

As you can see, it gets unreadable so fast, that you'd want to use some sort
of identifiers to specify dependencies, and would end up with a scheme pretty
much equivalent to filenames. The fact that some file might be a temporary is
a related, but parallel problem.

4\. Now even worse, I'm not quite sure how this syntax could accomodate
multiple _outputs_. If fix_name creates several outputs, and
extract_evergreens uses only one, you can't get around it without some weird
syntax and specifying a numeric position. It also gets out of hand pretty
quickly and you're back to using some sort of identifiers, be it filenames or
not.

5\. Speaking of identifiers, you can use variables in Drake instead of
filenames, so you can abstract filenames away. But it seems to me there's a
more fundamental problem in play.

6\. If you're concerned with coupling implementation and input and output
names, Drake has methods for this:

    
    
        fix_names()
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
    
        extract_evergreens()
            grep Evergreen $INPUT > $OUTPUT
    
        renamed.csv <- contracts.csv [method:fix_names]
        evergreens.csv <- renamed.csv [method:extract_evergreens]
        

or even, as discussed above:

    
    
        <<- contracts.csv [method:fix_names]
        evergreens.csv <<- [method:extract_evergreens]
    

To summarize, I think your example is cool, but it seems to only be practical
for rather simple workflows. And I can also see how Drake can easily be
extended to support such syntactic sugar. For more complicated dependencies
though, I don't really see a better approach.

I would love to hear your further thoughts on the matter, and whether you'd
like to see something similar to what I proposed in Drake. Or something else.

Artem.

~~~
zmmmmm
Sorry for the late reply - I was really busy yesterday and didn't have time to
do it justice.

> One of the problems, as you can see, that it only works if you don't care
> about the filenames

This is a really insightful point - it touches on one of the ways Bpipe
differs philosophically from other tools. Bpipe absolutely says you don't want
to manage the file names. Not that you don't care about them, but it takes the
position that naming the files is a problem it should help you with, not a
problem you should be helping it with. It enforces a systematic naming
convention for files, so that every file is named automatically according to
the pipeline stages it passed through. So, for example, after coming through
the 'fix_names' stage, 'input.csv' will be called 'input.fix_names.csv'. It
does sometimes give you names that aren't correct by default, but it gives you
easy ways to "hint" at how to produce the right name. Eg - if we want the
output to end with ".txt" we write:

    
    
        fix_names = {
            exec "sed 's/Neverbrown/Evergreen/g' $input > $output.txt
        }
    

Similarly if there are a lot of inputs and you need the one ending with ".txt"
you will write "$input.txt", if you want the second input ending with ".txt"
you will write "$input2.txt", and so on. Part of this stems from the huge
number of files that you can end up dealing with. When you start having
hundreds or thousands of outputs naming them quickly goes from being something
you want to do to a chore that drives you completely crazy and you want a tool
to help you with. Bpipe's names definitively tell you all the processing that
was done on a file which is extremely helpful for auditability as well.

> I'm even more concerned with multiple inputs and multiple outputs

As I touch on above, it's really not too hard. Bpipe gives you ways to query
for inputs in a flexible manner to get the ones you want. The commands you
write imply what files you need, and Bpipe searches backwards through the
pipeline to find the most recent files output that satisfy those needs.
Multiple outputs are similar ...

    
    
        fix_names = {
            exec "sed 's/Neverbrown/Evergreen/g' > $output1.txt 2> $output2. txt"
        }
    

If you need to reach further back in the pipeline to find inputs there are
more advanced ways to do it, but this works for 80% of your cases (the whole
idea of a pipeline is that each stage usually processes the outputs from the
previous one - so this is what Bpipe is optimized to give you by default).

> I think your example is cool, but it seems to only be practical for rather
> simple workflows. And I can also see how Drake can easily be extended to
> support such syntactic sugar.

It depends what you mean by "simple". I use it for fairly complicated things -
20 - 30 stages joined together with 3 or 4 levels of nested parallelism. It
seems to work OK. I'd argue that it's more than syntactic sugar, though - it's
a different philosophy about what problems are important and what the tool
should be helping you with.

Thanks for the great discussion!

~~~
aboytsov
Thank you very much for your response.

Actually, I don't think there are any philosophical differences, and I'll try
to make my case.

> Bpipe absolutely says you don't want to manage the file names.

I think this is too strong a statement as I try to show below.

> So, for example, after coming through the 'fix_names' stage, 'input.csv'
> will be called 'input.fix_names.csv'.

fix_names _is_ the identifier in this case. There's really not much of a
difference whether you use identifiers to come up with filenames, or you use
filenames to come up with identifiers. If anything, I think filenames are
preferable, because the user doesn't have to be aware of the scheme the tool
uses to convert identifiers to filenames. The fact that identifiers are just a
little bit shorter (e.g. don't have .txt extension or something) does not
overweigh the inconvenience of knowing where the files are. The problem with
this approach is because figuring out where the files are requires knowledge
of the tool inner workings, that can only be acquired from reading the code or
documentation.

There's another problem with these naming conventions, is that if you use the
same code in multiple steps, things can become quite confusing. How will BPipe
name them? Or is the only way to handle it is to copy-and-paste the code and
create another rule?

It seems like not clear enough separation between the code and the filenames
can be a source of problems... Please correct me if I'm wrong.

When I compare:

    
    
       _ <- contracts.csv
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT
    
        evergreens.csv <- _
            grep Evergreen $INPUT > $OUTPUT
    

with

    
    
       contracts:
            sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT.csv
    
       evergreens:
            grep Evergreen $INPUT.csv > $OUTPUT.csv
    
       contracts + evergreens
    

I strongly prefer the first option, because there's less implicit things going
on, and the code is separated clearer from the file naming. Besides, it's even
shorter.

> Similarly if there are a lot of inputs and you need the one ending with
> ".txt" you will write "$input.txt", if you want the second input ending with
> ".txt" you will write "$input2.txt", and so on.

This can work for very simple workflows with maybe several cases of multiple
inputs and outputs, but it's unmanageable when complexity grows.

Imagine a step which takes 3 inputs - one separate, one which is output #2 of
a previous step, and one which is output #6 of yet another step. You can't use
numbers to resolve that. You will end up coming up with some sort of semantic
identifiers, which will almost completely replace BPipe's naming convention.
And what's worse, they will be hard-coded in your step's commands, which means
you'll have to edit the code if you want to change the filenames, or re-use
this step's implementation somewhere else.

> When you start having hundreds or thousands of outputs naming them quickly
> goes from being something you want to do to a chore that drives you
> completely crazy and you want a tool to help you with.

I'm not sure I agree here. Here's how I see it:

Instead of naming hundreds of files, you have to name hundreds of methods
(commands). Yes, you don't have to repeat the filenames to create
dependencies, but you have to repeat the method names (in "contracts +
evergreens"), and in a way which quickly breaches the boundaries of
readability.

This doesn't work for complicated workflows, and for simple ones, I would
prefer positional linking rather than comping up with names, like in the
example I provided above.

There's nothing that prevents Drake from coming up with filenames from more
abstract identifiers. We could come up with some syntax where you'd just give
an identifier (say, "~contracts"), and we'll take care of the file location
and name, just like BPipe does. The major difference is not this. The major
difference is that we think you need to identify inputs and outputs to build
the graph, and the method name is insignificant until you want code re-use,
and BPipe seems to take the opposite position - that you need to give method
names, and then use a separate expression to build the graph.

I think I provided at least a few strong arguments why BPipe is wrong on this
one. I would really love to hear your further thoughts.

> As I touch on above, it's really not too hard. Bpipe gives you ways to query
> for inputs in a flexible manner to get the ones you want.

I'm sorry I didn't understand neither this nor the example you provided. Could
you please elaborate? In the example you provided you identify different
outputs by adding a number to their names. Is that how subsequent steps are
supposed to refer to them as inputs - by the positional output number from the
step that used to generate them?

> I'd argue that it's more than syntactic sugar, though - it's a different
> philosophy about what problems are important and what the tool should be
> helping you with.

I appreciate your opinion. But the way I see it is this:

1) As far as different philosophies go, I find BPipe's one to be a bit
problematic for complicated cases.

2) And for simple cases, it all comes down to syntactic sugar.

I understand it's hard to argue an abstract, so I'll tell you what. Give me an
example of a BPipe workflow that you particularly like, and I'll put it in
Drake. I might need to invent some Drake features on the fly, but it's a good
thing. This is what these discussions are for. I'll try to show you that
there's no philosophical difference, and Drake has a more flexible approach
overall. I am looking forward to this challenge, because your opinion is
important to me.

Thank you!

Artem.

~~~
zmmmmm
Hey, just want to say thanks for the great discussion again. I'm a bit humbled
at the length & depth of thought you're putting into it.

> The problem with this approach is because figuring out where the files are
> requires knowledge of the tool inner workings, that can only be acquired
> from reading the code or documentation

I suppose this is true but it's really not an issue I have in practice. I run
the pipeline and it produces (let's say) a .csv file as a result. I execute

    
    
        ls -lt *.csv
    

And I see my result at the top. There's really not a huge inconvenience in
trying to find the output. Having the pipeline tool automatically name
everything instead of me having to specify it is definitely a win in my case.
I suspect we're using these tools in very different contexts and that's why we
feel differently about this. It sounds like you need the output to be well
defined (probably because there's some other automated process that then takes
the files?) You _can_ specify the output file exactly with Bpipe, it's just
not something you generally _want_ to do. There's nothing wrong with either
one - right tool for the job always wins!

> if you use the same code in multiple steps, things can become quite
> confusing. How will BPipe name them

It just keeps appending the identifiers:

    
    
       run { fix_names + fix_names + fix_names }
    

will produce input.fix_names.fix_names.fix_names.csv. So there's no problem
with file names stepping on each other, and it'll even be clear from the name
that the file got processed 3 times. One problem is you _do_ end up with huge
file names - by the time it gets though 10 stages it's not uncommon to have
gigantic 200 character file names. But after getting used to that I actually
like the explicitness of it.

> Imagine a step which takes 3 inputs - one separate, one which is output #2
> of a previous step, and one which is output #6 of yet another step

Absolutely - you can get situations like this. We're sort of into the 20% of
cases that need more advanced syntax (eventually we'll explore all of Bpipes's
functions this way :-) ). But basically Bpipe gives you a query language that
lets you "glob" the results of the pipeline output tree (not the files in the
directory) to find input files. So to get files from specific stages you could
write:

    
    
        from(".xls", ".fix_names.csv", ".extract_evergreens.csv") {
            exec "combine_stuff.py $input.xls $input1.csv $input2.csv"
        }
    

It doesn't solve _everything_ , but I guess the idea is, make it work right
for the majority of cases ("sensible defaults") and then offer ways to deal
with harder cases ("make simple things easy, hard things possible"). And when
you really get in trouble it's actually groovy code so you can write any
programmatic logic you like to find and figure out the inputs if you really
need to.

> Instead of naming hundreds of files, you have to name hundreds of methods
> (commands)

Not at all - if my pipeline has 15 stages then I have 15 commands to name.
Those 15 stages might easily create hundreds of outputs though.

> The major difference is that we think you need to identify inputs and
> outputs to build the graph, and the method name is insignificant until you
> want code re-use, and BPipe seems to take the opposite position - that you
> need to give method names, and then use a separate expression to build the
> graph

Again, a really insightful comment, but I'd take it further (and this goes
back to my very first comment). Bpipe isn't just not trying to build a graph
up front, it really doesn't think there is a graph at all! At least, not an
interesting one. The "graph" is a _runtime product of the pipeline's
execution_. We don't actually know the graph until the pipeline finished. An
individual pipeline stage can use if / then logic _at runtime_ to decide
whether to use a certain input or a different input and that will change the
dependency graph. You have to go back and ask why you care about having the
graph up front in the first place, and in fact it turns out you can get nearly
everything you want without it. By not having the graph you lose some ability
to do static analysis on the pipeline, but to _have_ it you are giving up
dynamic flexibility. So that's a tradeoff Bpipe makes (and there _are_
downsides, it's just in the context where Bpipe shines the tradeoff is worth
it).

> In the example you provided you identify different outputs by adding a
> number to their names. Is that how subsequent steps are supposed to refer to
> them as inputs - by the positional output number from the step that used to
> generate them

I think the "from" example above probably illustrates it. The simplest method
is positional, but it doesn't have to be, you can filter with glob style
matching to get inputs as well so if you need to pick out one then you just do
so.

> 1) As far as different philosophies go, I find BPipe's one to be a bit
> problematic for complicated cases.

I can't argue with that - but that's sort of the idea: simple things easy,
hard things possible. Complicated cases are complicated with every tool. I
guess I would say that pipeline tools live at a level of abstraction where
they aren't meant to get _that_ complicated.

> 2) And for simple cases, it all comes down to syntactic sugar.

I guess I'd have to disagree with this, as I really think there are some
fundamental differences in approach that go well beyond syntactic sugar.

> Give me an example of a BPipe workflow that you particularly like, and I'll
> put it in Drake

I wouldn't mind doing that - I'll need to look around and find an example I
can share that would make sense (what I do is very domain specific - unless
you have familiarity with bioinformatics it will probably be very hard to
understand). I'll pm you when I manage to do this, but it may take me a little
while (apologies).

Thanks as always for the interesting discussion. I think this is a fascinating
space, not least because there have been _so many_ attempts at it - I would
say there are probably dozens of tools like this going back over 20 years or
so - and it seems like nobody has ever nailed it. Bpipe has problems, but so
does every tool I've ever tried (I'm probably up to my 8th one or so now!).

~~~
aboytsov
...continued from part1. read part1 first!...

> It doesn't solve everything, but I guess the idea is, make it work right for
> the majority of cases ("sensible defaults") and then offer ways to deal with
> harder cases ("make simple things easy, hard things possible").

My contention is that while BPipe makes simple things easy, hard things
possible, Drake makes both easy _and_ possible. I think I've made some points
to that regard, and gave you examples of Drake code which is just as easy to
write as the corresponding BPipe's code without compromising on functionality.
But to really conclusively prove this, I'm looking forward to more BPipe
examples. So far, I haven't seen anything that is simpler (or even shorter) in
Bpipe.

> Not at all - if my pipeline has 15 stages then I have 15 commands to name.
> Those 15 stages might easily create hundreds of outputs though.

When I first read it I thought this is a great point and you're onto
something. But as I thought about it more, I realized that it only seems this
way.

Here's the thing: if you have 15 stages but hundreds of files, it can mean
only two things:

1) The vast majority of those files are leaf files, that is - they are either
inputs (with pre-determined names) or outputs, which names you don't really
care about (surprisingly). Drake can generate filenames for leaf output files
with ease, as they don't affect the dependency graph.

2) The vast majority of those files are _not_ leaves, but it means that the
steps either:

2a) pass to each other dozens multiple inputs and outputs, and you have to
either give them identifiers (as described above, Drake can do it too) or use
positions (unmanageable).

2b) even worse, have a big and complicated dependency graph with much more
than 14 edges, in which case your syntax of { a + b + c } will be almost
definitely inadequate to describe such a complex thing (15 vertices and
several dozens edges).

So, any way you look at it, Drake can do the same thing in the same way or
better. Am I missing something?

> Bpipe isn't just not trying to build a graph up front, it really doesn't
> think there is a graph at all! At least, not an interesting one. The "graph"
> is a runtime product of the pipeline's execution.

I don't understand it. I'm afraid it doesn't work this way. You can't have the
graph as a runtime product of the execution (i.e. after the execution),
because it cripples your ability to do partial evaluation of targets. That is,
you have to have dependency graph before you can even answer the question -
"is target A up-to-date?". If you need to run the workflow to arrive at a
conclusion, there's no guarantee how much time it will take. I also believe it
unnecessary melds the distinction between the commands and the workflow. If
your code needs to care about its dependencies, it can't be used out of
context. So, maybe an example?

But if all you need to do is re-run everything every time, then it means
you're really doing something trivial, and it also raises the question of why
we need a tool like BPipe in the first place.

> An individual pipeline stage can use if / then logic at runtime to decide
> whether to use a certain input or a different input and that will change the
> dependency graph.

I don't see how it could work this way. Could you please give me an example
along with the explanation of how BPipe will handle it on the control level?

> You have to go back and ask why you care about having the graph up front in
> the first place, and in fact it turns out you can get nearly everything you
> want without it.

I'm confused, I think nothing could be further from the truth. The dependency
graph specifies what steps depend on what steps. If you don't know it, you
don't even know how to start evaluating the workflow, because you don't know
which step to build first. I don't understand this statement _at all_. Could
you please elaborate or give me an example?

> By not having the graph you lose some ability to do static analysis on the
> pipeline, but to have it you are giving up dynamic flexibility.

I need to see an example of this.

> I can't argue with that - but that's sort of the idea: simple things easy,
> hard things possible. Complicated cases are complicated with every tool.

I don't think having 3 inputs is a very complicated case. And neither is
having any dependency graph which is not a linear step1, step2, step3. My
point is as soon as you get any of those, BPipe starts to slowly evolve into
Drake, with some very weird syntax and inconsistencies (like having "implicit"
dependencies in steps' implementations but having to also specify some or all
of the dependencies in the "run" statement).

It's possible that I'm misunderstanding BPipe. Maybe some more examples would
fix this.

> I guess I'd have to disagree with this, as I really think there are some
> fundamental differences in approach that go well beyond syntactic sugar.

I don't really see them. And you can't just disagree, you have to provide
arguments. :) I understand you can see it differently, but it seems like so
far, there could be a Drake workflow for every BPipe example, which uses the
same ideas and is equally easy to write (but not necessarily the reverse).
This means it all comes down to syntax, no?

Again, I might be misunderstanding BPipe.

I think it's really, really hard to argue abstract concepts. I would very much
appreciate some examples. It doesn't even have to be your favorite workflow.
Just give me anything. Write something and ask - "how would you put it in
Drake?". I think my response would make it clear whether there are syntactic
or philosophical differences. We've already established that there are some
things BPipe cannot do as well as Drake can. I'd like to see the reverse to be
true. Because in this case we can really identify philosophical differences,
but if it's the opposite - i.e. Drake can do everything BPipe can with the
same ease - than it's not a question of philosophy any more but design.

I'm not trying to attack BPipe. I just want to make the best tool possible,
and if we make compromises, I want to make sure they are informed. We must
consciously choose some things not to be as easy or possible in Drake for some
other greater good. So far, I can't identify any of those things.

Show me. :)

Artem.

P.S. You don't have to give a real world example. I think that would actually
unnecessary restrain and slow you down. Just demonstrate a basic concept, a
feature, name your steps A, B, C - I don't care what they do. Only if it's
something extremely exotic I might ask if there's a real world use-case for
this, but I think I can come up with use-cases for pretty much anything. :)

P.P.S. Please include what you do to run the workflow in your examples. I
suspect I might have misconceptions about what "run" statement does and how
Bpipe resolves dependencies.

P.P.S. I appreciate the dialog as well. Especially since BPipe is your 8th
tool. I would like Drake to be your 9th, and better than anything you used
before, including Bpipe.

~~~
zmmmmm
I'm sorry I don't have time to answer in full. I'm just going to respond to
this one point because I think it's pretty fundamental and perhaps explaining
it will clear up other things!

    
    
        > The dependency graph specifies what
        > steps depend on what steps. If you don't
        > know it, you don't even know how to
        > start evaluating the workflow, because
        > you don't know which step to build
        > first. I don't understand this statement
        > at all. Could you please elaborate or
        > give me an example?
    

I can see this is really really hard to grok if you're basing everything on
the idea of a DAG, and so many tools are that it's very natural to think you
couldn't do it any other way. Think of it as imperative vs declarative if you
like. In Bpipe the user declares the pipeline order explicitly (as you've
seen) - so that's the first part of the answer to your question. Bpipe knows
which part to execute first because the user said to explicitly. But this
isn't used for figuring out dependencies - dependencies arise as actual
commands are executed. Back to our famous example:

    
    
        fix_names = {
          exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
        }
    
        extract_evergreen = ...
    
        run { fix_names + extract_evergreen }
    

We run it like this:

    
    
        bpipe run pipeline.groovy input.csv
    

If you run it once, Bpipe builds input.fix_names.csv. If you run it twice,
Bpipe is clever enough not to build input.fix_names.csv again! How is that if
it doesn't know about the dependency graph?! Well, it does it "just in time".
It executes the "fix_names" pipeline stage (or "method") and that calls the
"exec" command. The "exec" command sees that all the inputs referenced ($input
variables) are older than the outputs referenced ($output variables). So it
knows it doesn't have to rebuild those outputs, and skips executing the
command. So what about transitive dependencies? If C depends on B which
depends on A, (so dependencies are A => B => C) what happens if you delete
file B? Technically you don't need to build C because it's still newer than A,
but Bpipe can't see it any more. Well, Bpipe knows this too because it keeps a
detailed manifest on all the files created. So when the call to create B is
executed it can see that although B was deleted, it _did_ exist and in its
last known state was newer than input files, so there's no need to rebuild it,
as long as downstream dependencies are OK.

So in this way Bpipe handles dependencies for you. What it does _not_ do is
figure out which order to execute things in. It does them in exactly the order
you tell it. This is one of those things that conventional tools solve which
isn't actually that important (in my uses) but which occasionally is very
annoying - I actually _want_ to control the order of things sometimes. I want
to be able to tell it "do this first, then that, then the next thing"
regardless of dependencies. Usually it's pretty obvious what the right order
things should be in and there are other externalities that influence how I
like to do it ("I know this part uses a lot of i/o so try to do it in parallel
with another bit that's mainly using CPU", or "Let's run this part last
because it will be after hours and the other jobs will have finished"). Having
the tool think this stuff up by itself can save you a bit of time but it can
lose you a lot because you don't have the ability to really control what's
going on.

~~~
aboytsov
> I'm sorry I don't have time to answer in full.

We're not getting anywhere. Just give me goddamn examples! :) Please!
Examples!

> I can see this is really really hard to grok if you're basing everything on
> the idea of a DAG, and so many tools are that it's very natural to think you
> couldn't do it any other way.

 _There is no other way_. BPipe is based on the idea of a DAG. You just don't
see it.

> In Bpipe the user declares the pipeline order explicitly.

And this is a _big mistake_. The reason is simple - explicit order is very
hard to manage once you have multiple inputs and outputs, and as a
consequence, complicated (instead of linear) dependency relationships.

What you don't seem to realize, is that by "declaring the pipeline order
explicitly" _you create a dependency graph_. It's a part of your workflow
definition. Your workflow contains the full definition of the dependency
graph. Even if it didn't, you would still use it. There is no other way.

This is what I meant when I said - you create your dependency graph in "run".
And this is a bad idea.

> dependencies arise as actual commands are executed.

What does it mean exactly? That the first command will somehow tell Bpipe what
to run next? If not, then I don't understand this statement at all.

> How is that if it doesn't know about the dependency graph?! Well, it does it
> "just in time".

 _It does not matter_ if you calculate the dependency graph before you run the
first command, or as you run the commands. It makes absolutely no difference.
The only difference is whether it is _computable_ or not. If you say it's not
computable until run-time, please elaborate on that.

> So in this way Bpipe handles dependencies for you.

So far I see that this is very standard and doesn't differ in any way from
what Drake or any other tool does. The only thing that differs, and I am
repeating myself, is how you define your dependency graph - through input and
outputs, or in "run". So far it seems that "run" is quite unfortunate. But
please give me examples.

> So in this way Bpipe handles dependencies for you. What it does not do is
> figure out which order to execute things in. It does them in exactly the
> order you tell it.

This is a meaningless statement. Drake also executes steps in the order you
tell it. The only difference is _how_ you tell it. In Drake, you tell it
through specifying a list of steps each step depends on individually (once
again, it doesn't matter that filenames are used for that - Drake also
supports tags, or it could be some other identifiers). In Bpipe, you tell it
in "run", collectively and sequentially. Drake's way supports the whole
variety of graphs, while Bpipe's way - only a very limited subset. And for
this limited subset, Drake can give you (I think) a syntax just as good if not
better than Bpipe's. If you don't quite understand what I'm talking about,
give me an example, and I will demonstrate.

> I actually want to control the order of things sometimes.

This is fine, the only question is _how_. You say Bpipe's way is _convenient_.
I say give me an example and I'll show you that Drake's way is not any less
convenient. I'm sorry to keep repeating myself, I thought I stressed the
importance of examples quite a bit in my previous email and I want to stress
it again. Examples, please!

> I want to be able to tell it "do this first, then that, then the next thing"
> regardless of dependencies.

This statement is self-contradictory. You don't seem to realize that by
telling it "do this first, then that" you are _defining_ dependencies. It's
fine, and it's OK, and it can be convenient, but you can't say _regardless_ of
them.

Again - give me examples! Our conversation is becoming useless without
examples.

You did not, but I'll just grab whatever you threw my way:

    
    
        fix_names = {
          exec "sed 's/Neverbrown/Evergreen/g' $input > $output"
        }
    
        extract_evergreen = ...
    
        run { fix_names + extract_evergreen }
    
        $ bpipe run pipeline.groovy input.csv
    

Drake can support this perfectly:

    
    
        _ <- $[in]
          exec "sed 's/Neverbrown/Evergreen/g' $INPUT > $OUTPUT"
    
        $[out] < _
          ........
    
        $ drake -v out=pipeline.groovy,in=input.csv
    

Isn't that much nicer? What disadvantages you can see?

Tell me what is it that you would like to do with this script, and I'll tell
you a better way to do it in Drake. Is it multiple versions of run that you
want to have? Easy. Are you concerned about inserting a step in the middle?
Trivial. Tell me why Drake's code is worse, and I'll listen. So far it seems
like it's better because it's shorter and more flexible at the same time.

> Having the tool think this stuff up by itself can save you a bit of time but
> it can lose you a lot because you don't have the ability to really control
> what's going on.

What _exactly_ are you losing?

I am sorry if I sound irritated. I am. I've just been begging for examples,
and you keep talking in abstract, and it would be fine, but you're making a
lot of mistakes. So, instead of looking at concrete things that would make my
point apparent to you (or the opposite, prove that I'm wrong), I keep pointing
to flaws in your reasoning, which frankly, is _irrelevant_. One picture is
worth a thousand words.

I really want your feedback. But please give me examples.

~~~
zmmmmm
> There is no other way. BPipe is based on the idea of a DAG. You just don't
> see it.

So if you think Bpipe uses a DAG, then I wonder how you would think it deals
with:

    
    
      run { fix_names + fix_names + fix_names }
    

In terms of the pipeline stages that run this is cyclic, so it cannot be a
DAG. On the other hand the files created do _usually_ form a DAG dependency
relationship, but even there, in the most general case, it's not at all
impossible in an imperative pipeline to read a file in and write the same file
out again in modified form (or more likely, to modify it in place), so the
file depends on itself - another non-DAG relationship. I'm sure you'll object
to this in a purist sense, and tell me it is a horribly broken idea, but as a
practising bioinformatician, when I have a 10TB file and modifying it in place
will save me hours and huge amounts of space, I'm much more interested in
getting my job done than being pure about things.

I think you're right that we're at diminishing returns here, and I'm sorry
I've frustrated you. We're trying to bite off more than we can chew in a forum
like this.

I wish you all the best with Drake and I'll definitely check it out down the
track (when it supports parallelism, since that's too important to me right
now). For now, though, I don't intend to read / respond to any more replies in
this thread.

~~~
aboytsov
This is not a cyclic dependency graph!!! This is a syntax for copying
vertices, nothing else. It creates a DAG of three vertices and two edges, but
uses only one step definition to do so. It automatically replicates the step
definition as needed. It could be extremely easy to reproduce in Drake:

    
    
      fix_names()
         ...
    
      _ <- $[in]  [method:fix_names]
      _ <- _      [method:fix_names]
      $[out] <- _ [method:fix_names]
    

Is there any difference between Bpipe's version and Drake's version that I am
failing to see?

> I'm much more interested in getting my job done than being pure about
> things.

It's funny coming from someone who I have been _BEGGING_ for examples but
getting abstract philosophical reasoning in return.

I repeat. Give me an example. So far you haven't given me one example of what
Bpipe can do that Drake couldn't do in the same way or better, and yet you
continue claiming philosophical differences.

If we concentrate on _examples_ and discuss how they would work, whether there
are differences, and what these differences are, I guarantee you, we'll make
progress. But then again, I'm repeating myself.

Artem.

------
jboggan
I really wish that I had a tool like this back in grad school. I was doing
bioinformatics work and merging, chopping, and processing various datasets
over many months. When a new version of the underlying data came out it was
not an easy task to go back and re-process it through dozens of steps in Perl
and R. Having a tool like this would have made it a single command to do so
and also ensured repeatability and transparency in my data, something which is
often sorely lacking in an academic setting.

I am one of the data engineers at Factual and though I didn't have a role in
creating it I definitely enjoy using it on a day to day basis. You begin to
see the utility of it when you have a dozen people working up and down a data
pipeline and need to coordinate as product specs evolve or schemas change.

I also really like the tagging features - you can add specific tags to
different steps in the build and run different "flavors" of your workflow
depending upon what is needed. For example, you might build a workflow that
collects, cleans, filters, and performs calculations on data from all over the
world - but you might also want alternative versions of the build that only
work on specific regions or smaller debug datasets. Tags make that really
simple to do, even when many steps are shared by the different versions or the
dependencies are complicated.

~~~
xaa
As a fellow bioinformatician I can agree that this looks quite useful.

Although (since you mention R), I wonder why there's no love for R in Drake,
given that R is perhaps _the_ quintessential data processing language.

~~~
dirtyvagabond
There is love for R in Drake! As of about an hour ago:
[https://github.com/Factual/drake/commit/f63dd2630ca3e5e4a6a6...](https://github.com/Factual/drake/commit/f63dd2630ca3e5e4a6a6baa4296d62dcd078690e)

------
aaronjg
I've spent a lot of time working with pipelining software, first for my last
job doing bioinformatics research, and now for handling analytics workflows at
Custora. We ultimately decided to write our own (which we are considering open
sourcing, email me if you are interested in learning more).

The initial system that I used was pretty similar to Paul Butler's technique,
with a whole bunch of hacks to inform Make as to the status of various MySQL
tables, and to allow jobs to be parallelized across the cluster.

At Custora, we needed a system specifically designed for running our various
machine learning algorithms. We are always making improvements to our models,
and we need to be able to do versioning to see how the improvements change our
final predictions about customer behavior, and how these stack up to reality.
So in addition to versioning code, and rerunning analysis when the code is out
of date we also need to keep track of different major versions of the code,
and figure out exactly what needs to be recomputed.

We did a survey of a number of different workflow management systems such as
JUG, Taverna, and Kepler. We ended up finding a reasonable model in an old
configuration management program called VESTA. We took the concepts from VESTA
and wrote a system in Ruby and R to handle all of our workflow needs. The
general concepts are pretty similar to to Drake, but it is specialized for our
ruby and R modeling.

Some more useful links for those interested:

JUG <https://github.com/luispedro/jug>

Taverna <http://www.taverna.org.uk/>

Kepler <https://kepler-project.org/>

VESTA <http://vesta.sourceforge.net/>

------
ori_b
It looks like all of the drakefiles could be replaced pretty trivially with
Makefiles. Replacing '<-' with ':', ';' with '#', and '$INPUT', '$OUTPUT' with
'$<' and '$@', and inserting shell invocations of the Python interpreter looks
like it would do the job.

The major differences I see are:

    
    
        - Inline support for Python et al.
        - Confirming the steps that will be taken.
        - HDFS support.
    

Are there any other big differences?

~~~
aboytsov
The example in the blogpost is understandably trivial, and it can be
implemented in almost any Make-like system.

The concept of Make is not unique. Everything that has dependencies and
executes steps is similar to Make in concept. Drake is no exception, and it
can be replaced with Make, but no more so than Rake, Ant or Maven can be
replaced by Make. That is, if it's trivial - yes. Just a bit more complicated
- no.

Some things are merely painful to implement with Make, some are just
impossible:

    
    
      - multiple outputs
      - no-input and no-output steps
      - HDFS support
      - Hadoop's partial files support (part-?????)
      - forced execution of any subbranch, up or down the tree or any individual targets (crucial for debugging and development)
      - target exclusions
      - protocol abstraction - inline Python is just one example
      - tags
      - branching
      - methods
    

These are just what's implemented already. Other things are planned such as:

    
    
      - automated data versioning (backup and revert)
      - parallelization
      - real-time status console
      - retries, email notifications
      - etc.
    

Requirements for building executables and working with large, complicated and
expensive data workflows are quite visible different, and the most important
thing about Drake is that it provides the platform for convenient features
(such as versioning or email notifications) to be implemented. And once they
are, every data workflow can take advantage of them.

I guess, if Make was really, really extendable, we could have considered it as
a platform for all this. But it's not, and hacking all of that into Make's
source code in C would be, I'm sure, a much greater pain than writing Drake.

Artem.

~~~
blablabla123
retries and email notifications is a good one. Currently I do something
similar with cronjobs, rsync, shell scripts and some custom tools -- on
multiple boxes. (Email notification with mailx) Works in theory pretty well,
in practice race conditions become a problem, making it sometimes annoying
because I need to run things manually when I need up to date processed data.
If I had retries, this would be an improvement.

~~~
aboytsov
Got ya. Please voice your opinion about the priority in which features should
be implemented by submitting a feature request at
<https://github.com/Factual/drake/issues>, or +1'ing an existing one.

There are so many potential features to be added to Drake, and a lot of them
have already been thought about and spec'ed out, that we need some sort of a
way to figure out what to do first.

Of course, if you'd like to actively contribute, we'd be ecstatic.

Artem.

------
danpalmer
With an empty workflow, this is the result of `drake --version`.

    
    
      $ time drake --version
        Drake Version 0.1.0
        Target not found: ...
        drake --version  5.42s user 0.18s system 188% cpu 2.969 total
    

For short scripts that you should be running in the shell, this is really bad.
I expect basic make commands on small projects to be effectively instant.
Compilation might take a bit longer, but 5.4s to print the version points to a
5s overhead on all executions.

I'm guessing this is due to the JVM overhead, so that pretty much says this
project isn't suited to the JVM. The JVM is great for long running processes,
and applications where the overhead is a very small percentage of the total
running time, but if it takes 5s longer than `make` to print it's version,
that's really not a good sign.

This is a fantastic idea, and I will definitely be using it. But this overhead
needs fixing.

~~~
aboytsov
Hey, thanks for trying out our tool!

First of all, --version shouldn't try to run any targets. This seems like a
bug. Thanks.

Yes, you guessed correctly - this is the JVM startup time. I just hate JVM for
that. We experimented with Nailgun and Drip to eliminate it - Nailgun is
problematic because it uses a shared JVM for all runs, and it can get quite
hairy sometimes. In the long run, Nailgun is almost certainly not an answer,
since it assumes things we have no control over (i.e. Clojure runtime) don't
do destructive tear down. Drip is a bit more promising, but we didn't succeed
running Drake under it (simpler things worked fine though).

So, we're still looking into it, and we're looking for other ideas, too.

In the meantime, you could run Drake under REPL:

(-main "...")

The only problem is that Drake calls System/exit but we can add a flag ("--
repl") that would prevent it from doing so, and you'll stay in REPL.

Thoughts?

P.S. JVM is unfortunate but Clojure is a fantastic language for something like
Drake.

~~~
danpalmer
Thanks for the detailed and well explained reply.

I have limited experience with Clojure, but it does seem to be a good match to
this sort of task due to it's structure. However the JVM seems to be a real
drawback to me. Perhaps with something like Scheme or Lisp you might get a
similar program structure, and be able to compile to faster binaries?

The REPL is a solution, but as many developers are using tools like make with
many other tools in the shell, running a REPL like that would prevent them
from using other things efficiently. Ultimately I think the overhead time
needs to be removed.

If it takes far longer than something like make, that's not necessarily an
issue. The key point is making it fast from the user's perspective. As long as
it runs in a fraction of a second, I can't see much of a difference between
0.1s and 0.0001s, so I don't think that sort of difference really matters,
it's when it gets over 1s that it becomes an issue.

Running something like Nailgun in the background may be a good solution, I
don't have any experience with it. But if it requires starting a daemon in the
background, that could get in the way of using the tool in a normal way.

I don't really know what the best solution to this problem is. I'm not sure
Clojure is the best tool for the job.

~~~
aboytsov
I can certainly see your point about using Drake in an automated environment
where this delay would still matter, but running a daemon is not practical. I
think you have a lot of good arguments against JVM. There were some moments
when I thought it might not have been the best choice as well - for example,
Java world is notoriously poor with dealing with child processes.

So, I agree, but there are several arguments that it's not that bad after all:

\- Drake is fundamentally an interactive tool. If you run it as a part of an
automated process, all its flexibility is not quite needed. You could have
Drake print a list of all shell commands it would execute, and save it to get
your automated script.

\- Most data workflows Drake is good for are quite expensive. Minutes,
sometimes hours. Definitely much more than 5 seconds. The reason is simple -
if your workflow takes so little time, you're really not gaining much by using
a complicated tool like Drake, instead of just putting it all in a linear
shell script, and simply re-running everything every time you need it.

\- Maybe we'll find a good solution like Nailgun and Drip.

\- Maybe someone will make a Java-code compiler that would create a stand-
alone executable out of a JAR.

\- Maybe Sun will eliminate JVM startup overhead. Or somebody will release a
3rd party JVM without it.

\- Maybe we'll have a compiled version of Clojure one day.

\- Other maybes. :)

We certainly would support any effort to port Drake into Lisp, C++, Ruby,
Python or any language you desire. Porting it into Common Lisp might not be
that much easier than to Ruby. We might not consider it ourselves, since the
effort will be quite substantial.

Does it sound reasonable to you?

~~~
wink
I would say if a startup overhead time of < 10 second bothers you, you're not
working with "data". Of course sed and grep have less overhead, but I wouldn't
even thin of trying out a new tool for files/datasets larger than, say a
Gigabyte. (Rough guess, I know you can use grep and sed in under 10 seconds
for larger files, the point is about perspective and complexity.)

Clojure is sadly a really bad choice for fire-and-forget cli scripts, but
"large scale data processing" doesn't fit this criterion for me.

~~~
danpalmer
I'm mostly going to use this for parsing XML into some other formats and
getting it into SQLite databases I think. The reason I would like to use Drake
over 'raw' Python scripts is because it supports a lot of the mundane stuff
that goes around the actual processing of the data, and I want to automate the
processes.

I typically deal with sub-100MB XML documents, so processing them takes very
little time, but having the quick iteration of changing the format and re-
outputting is a key part of the development cycle for me, and I think very
useful when you are experimenting with new data and seeing how it could be
used. Doing quick transforms is awesome.

~~~
aboytsov
Drip now works with Drake! Yes, it's still less than ideal if you're calling
Drake hundreds of times from an automated script which you need to run
quickly, but for interactive development, it should work just fine:

[https://github.com/Factual/drake/wiki/Faster-
startup:-Drake-...](https://github.com/Factual/drake/wiki/Faster-
startup:-Drake-with-Drip)

------
gojomo
I could imagine a bash shell that helps create drake files, by remembering in
a richer history structure all files read/modified by subprocesses.

(A degenerate drake file, one line per 'step', would almost be a 1:1
representation of this richer history... though you then might want to
coalesce and reorder atomic steps to represent the real shape of your workflow
and dependencies.)

------
moonboots
Djb redo[1], a make alternative, feels like a good fit for these type of data
manipulation and dependency representations. Below is a port of the first
example. The build script is just shell, so you can do stuff like embed python
with a heredoc. One bit of syntactic sugar is that redo assumes stdout is the
desired contents of the generated file, so you don't need to explicitly pipe
to an OUTPUT variable.

    
    
      #!/bin/sh
      case $1 in
      contracts.csv)
        curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt
        ;;
      evergreens.csv)
        redo-ifchange contracts.csv
        grep Evergreen contracts.csv
        ;;
      report.txt)
        input=evergreens.csv
        redo-ifchange $input
        python2 <<-EOF
      linecount = len(file("$input").readlines())
      print("File $input has {0} lines.\n".format(linecount))
      EOF
        ;;
      esac
    

[1] <https://github.com/apenwarr/redo>

~~~
aboytsov
Please see my response to Make comparison:

<http://news.ycombinator.com/item?id=5111527>

I suspect most of the points I made would be applicable to redo as well, if
not more so. Trivial things don't require Drake. Heck, they often times don't
require Make as well - just put it in a linear shell script if the steps are
not too expensive. It's when things are getting complicated you need something
like Drake.

~~~
moonboots
Redo lacks features baked into Drake, especially the Hadoop integration, but I
believe it would be easier to incorporate custom functionality into redo
versus hacking Make or writing a custom build system. I haven't used Drake, so
I would be interested in a small but complicated Drake script which tackles an
intractable problem in Make. I don't claim redo can provide a cleaner solution
than a purpose-built system, but I think it will be unexpectedly simple.

~~~
aboytsov
The most crucial thing that Make lacks is multiple outputs and precise control
over execution. When you're debugging/developing a large and expensive
workflow, you absolutely must have the ability to say things like: \- run only
this step, I'm debugging it \- I've changed implementation of this step, re-
build it and everything that depends on it \- build everything except this
branch, it's expensive and I don't need to rebuild it that often (example:
model training)

Other examples of intractable problems in Make would be timestamped dependency
resolution between local and HDFS files. If Make can't look at HDFS, it can't
say if the step needs to be built or not. I don't think you can fix it with
external commands.

But generally, search for intractable problems is a futile one. Remember,
everything you can code in Java, you can code in a Turing machine. :)

~~~
lars512
Make can certainly generate multiple outputs, and can trivially be coerced to
redo any step you like.

Provided you add your code as a dependency in the analysis, then it will
happily redo only what's changed, giving you nice tight iterations.

I think it's real limitations are with multi-machine setups, as in the HDFS
problem you're mentioning. Then you need a new tool.

~~~
aboytsov
Sorry, I might be very ignorant of make - could you please give me a command
to re-build a particular target and everything that depends on it?

~~~
lars512
So make's default behaviour "make somefile.csv" is to build the whole tree of
dependencies. To force rebuild of everything, run "make -B somefile.csv". It
then assumes everything is out of date.

To force rebuild of one step, just delete its output or run "touch" on one of
its dependencies before running make. Then that step will get redone.

I like to have generated data in a separate folder, say "output/" which you
can then snapshot, blow away, or do what you like with. Basically though, I
keep it separate from data and code inputs.

~~~
aboytsov
Thanks! This much I know. But it doesn't answer my question. Let me repeat it:
could you please give me a command to re-build a particular target and
everything that depends on it?

~~~
lars512
That's exactly what "make -B mytarget" does...

Are you thinking of a particular problematic scenario?

~~~
aboytsov
No, make -B mytarget rebuilds either mytarget only or mytarget and everything
mytarget depends on. A more common scenario is when you need to rebuild
mytarget and everything that depends on it. Without rebuilding other parts of
the workflow that you don't need.

~~~
tedunangst
This is a really weird request. make won't rebuild things that haven't
changed, so the default make all rule will only rebuild the things depending
on mytarget. Every time you change mytarget, just run make (all) and
everything that depends on mytarget (and only those things) will be rebuilt.

~~~
aboytsov
Really?

This is not a weird request, this is one of the most common things we do when
we're developing a workflow. You need to do this every time you make changes
to code and you want these changes to propagate.

You can't run "make all", because it literally builds everything. You might be
working on a specific branch of the workflow, and the overall workflow could
be huge. And out-of-date in a lot of places. Or it could contain steps that
are very expensive, but not necessary to build for your development purposes
(for example, generating a model). This is why exclusions are also important,
and make also does not support them.

Make also does not support multiple outputs, and I gave you a prooflink
before. And a lot of other things which we think are important, too (I could
make a list. I did, actually).

If you like Make, you should continue using it. I think it is a little
arrogant on your part to try to explain to us that we simply wasted our time.
We built the tool to address the problems _we_ were facing. If you do not face
similar problems, by all means, use Make.

~~~
tedunangst
Sorry, I didn't mean to imply "you're doing it wrong". Didn't even realize you
made the tool. Oops. Personally, if large chunks of my output are out of date,
I don't like the idea of commingling them with new stuff, but obviously I
don't know a whole lot about what you're doing.

~~~
aboytsov
Thanks. I think you're missing the point. Imagine a big, complicated data
workflow, like the one the diagram for which I showed at my video (real-life
workflow):
[http://www.youtube.com/watch?feature=player_detailpage&v...](http://www.youtube.com/watch?feature=player_detailpage&v=BUgxmvpuKAs#t=1016s)

Now imagine you're not the only one working on it. You may have even never run
it in its entirety, since it takes 10 hours. Imagine there's a branch which
you, a developer, is currently working on. This branch depends on some other
files in the workflow. Let's say, generate synonyms from the sentence dataset.
Or, some complicated cleaning of some intermediate data. This is not a small
task and you will spend a couple of days doing it, re-running your code dozens
of times in the process.

You don't care about other parts of the workflow. You only care about what
you're developing and how it propagates. _Does_ it propagate? Does it break
something down the road? What is the final output? Did all this synonym
collection help? Did the changes you made in learning code improve the
results?

When you're done, you may commit your code and somewhere else somebody will
build a nice new dataset, but while you're working on it, you really need to
be able to run any target individually, with dependencies or without, as well
as forcibly rebuild all steps down the tree to see the final result.

Makes sense?

~~~
tedunangst
Thanks.

------
madMilo
Reminds me of Makeflow: A Portable Abstraction for Data Intensive Computing on
Clusters, Clouds, and Grids, Workshop on Scalable Workflow Enactment Engines
and Technologies (SWEET) at ACM SIGMOD, May, 2012.

<https://www3.nd.edu/~ccl/software/makeflow/>

~~~
aboytsov
Nice. Surprisingly, we weren't aware of Makeflow and kinda missed it
completely. On the first look, it seems like Drake is quite a bit more
feature-rich than Makeflow. Please see the designdoc and/or the tutorial video
for details.

------
jeffdavis
Cool project. I expected to be underwhelmed, but when I saw the dependency
stuff, I was impressed. Maybe it should include a hook so that it can detect
dataset changes automatically by running a separate command (or did I miss
it?).

With a bit of creativity, I think there may be a lot of applications here.

~~~
aboytsov
This is an awesome idea. Currently Drake only supports timestamped and forced
evaluations, but it would be great to have an evaluation abstraction where you
could provide your own implementation of whether a target's changed and/or
whether a target is to be considered fresher/younger than another target.
Timestamped would compare modification times, forced would return true, and it
could be extended indefinitely.

If you're serious about it, please submit a feature request
(<https://github.com/Factual/drake/issues>), and describe more specifically
what you would like to be able to do in your case.

Thank you for a great thought.

Artem.

------
daemon13
Artem, the approach you guys are using is really EXCELLENT!

I think that a bit of a disconnect here may be because some OPs might be used
to 'compiling' code versus 'compiling' data angle that you are using.

This is especially evident by make dependencies discussion with lars512.

To give a simple specific example: I have a dataset of say 5000-50000 SKUs
that are aggregated across 9-12 dimensions. My final report/analysis uses 3
scenarios. Now one sub-set of one scenario has changed [that's the raw input]
- of course running 'data compilation' by using data that changed and ONLY
what depends on it is the most effective&efficient approach.

Just my 2 financial cents...

~~~
aboytsov
Thank you very much for your kind words and support, and we certainly are
looking forward to your feedback, feature requests and bug reports, as well as
your code contributions, should you so desire.

We built this based on our own pain points with a larger audience in mind. We
hope we got some things right, because the success of any tool is defined by
its users. So, if you like it, let's build a thriving community together!

Artem.

------
swalsh
Whoa, this is the first time i'm hearing of "Factual" but playing around i'm
impressed! There was a side project I had a while ago, which i eventually gave
up because I couldn't source some data. These guys found it!

------
jcromartie
I like the idea that the tasks can be implemented in any language, but I feel
like this has limitations compared to something like Rake, where the step
definition is code, too. What this means is that in Rake I am not just limited
to defining new task bodies, but new ways of defining tasks themselves.

I see that Drake is implemented in Clojure, so I'd imagine you understand the
value of homoiconicity and extensible languages. So I wonder why you didn't
just use Clojure all the way through?

~~~
aboytsov
This is a great question. Our approach to this is described here:

[http://www.youtube.com/watch?feature=player_detailpage&v...](http://www.youtube.com/watch?feature=player_detailpage&v=BUgxmvpuKAs#t=2393s)

In short, we don't feel like it's an either or question. We want to have Drake
as a command-line frontend to the core functionality, but we would love to
see/have other frontends developed as well. Currently, there's no Clojure DSL
for Drake, but I think it'd be totally awesome.

The reason we started from command-line is because our workflows are
heterogenous, and we also didn't want to limit Drake to developers and
associate it with coding. Clojure can be quite a big learning curve if you
only need it to specify steps and link them together through file
dependencies.

We had an important design goal in mind: Drake should be as simple as writing
a shell script. If it's not, our experience shows that most workflow start as
trivial shell-scripts with one or two steps, and by the time it grows into
something unmanageable, it's kinda too late. :)

On a related note, Drake supports Clojure code inlining for manipulation of
the parse tree. It's not an equivalent, just a somewhat related feature. It
allows you to modify the steps, dependencies, and anything else in the parse
tree directly from Clojure.

------
Xion
There seems to be few differences between Drake and just rolling out Makefiles
for data processing, but I definitely see this project has potential.
Distributed processing over AWS/Compute Engine/etc. clusters would be one nice
thing to have, as a kind of simpler alternative to Hadoop.

I really like the inline, multi-language scripting though.

~~~
aboytsov
Thanks! We feel that in practice, there's quite a lot of differences between
Drake and most Make-like systems. See this response for details:
<http://news.ycombinator.com/item?id=5111527>

------
fnbr
Perhaps I am the only one having issues here, but I cannot seem to get drake
to run. Is there anything that is supposed to be done after building the
uberjar?

Further, I don't understand how I'm supposed to alter my path to be able to
run drake by simply entering 'drake'- would it be possible to get some help?

(I'm sorry if this is really obvious)

~~~
aboytsov
The project's README file (<https://github.com/Factual/drake> \- scroll down)
contains building and running instructions, as well as how to create a simple
script to run Drake which you can put on your PATH.

~~~
fnbr
Ah, sorry, I should have been more clear. I've actually gone through the
readme a few times, to no avail. I'll triple-check it though.

~~~
aboytsov
Read the "Installation" section, there's "A nicer way to run Drake"
subsection. But I would advise to read the whole "Installation" section
carefully.

~~~
fnbr
I actually did that, several times.

My mistake was that I didn't realize I was supposed to have Drake.jar in the
same folder as the workflow that I was trying to execute (I'd keep getting the
error 'Unable to access jarfile drake.jar'). Naive error, I suppose.

However, I'm still having trouble executing the 'A nicer way to run Drake'
instructions. I created a file named 'drake' on my path, and inserted the
given text. However, I keep getting the error

'Exception in thread "main" java.lang.NoClassDefFoundError: drake/core'

Was I supposed to alter the script in any way? I just naively copy/pasted.

~~~
aboytsov
You _don't_ have to have Drake.jar in the same folder as the workflow you're
trying to execute.

You create the script as described in the documentation, and you put it
somewhere on your PATH _along_ with the JAR file. The JAR files has to be in
the same directory as the script.

Sorry if it wasn't clear. I'll fix the doc.

~~~
aboytsov
Actually, it was in the doc. If you followed the instructions below precisely,
just send us your terminal log so that we can see what you're missing.

A nicer way to run Drake

We recommend you "install" Drake in your environment so that you can run it by
just typing "drake". Here's a convenience script you can put on your path:

    
    
      #!/bin/bash
      java -cp $(dirname $0)/drake.jar drake.core $@
    

Save that as `drake`, then do `chmod 755 drake`. Move the uberjar to be in the
same directory. Now you can just type `drake` to run Drake from anywhere.

~~~
fnbr
I'm embarrassed- you're completely right. My apologies. It's working perfectly
now. Thanks for putting up with me!

------
circa
When you run it. It tells you, "you're the fuckin' best, you da fuckin' best."

------
jonathanjaeger
Am I the only one who immediately thought of Drake the rapper? He's pretty
famous, not sure if this was considered during the naming process. Even if
it's not a legal problem, it's an SEO/social media problem.

~~~
sehugg
A drake is a male duck. They were pretty famous back in the day.

~~~
jonathanjaeger
True, but I wouldn't call my product 'Queen', 'Cream', 'Journey', or another
noun that could be confused with someone or something famous. This distracts
from the conversation of the product, so perhaps I shouldn't have brought it
up.

------
roolio_
Kudos for your work! Do you plan to integrate Amazon S3 the same way you did
for hdfs?

~~~
aboytsov
Thank you. Why not? We would love to see it, but we're also not actively using
Amazon S3 at the moment. But we would be more than happy to review code
contributions.

First of all, you can file a feature request:
<https://github.com/Factual/drake/issues>

Adding a new filesystem to Drake's source is very easy. You just create a
filesystem object that implements a bunch of methods for: listing directory,
removing file, renaming file and getting file's timestamps, and then put it
along with the corresponding prefix in the filesystem map. That's pretty much
it. Assuming there's client JAR for Amazon S3, written either in Clojure or in
Java, it should be quite simple to do.

Artem.

------
abraininavat
Why Clojure?

~~~
aboytsov
We love Clojure. Lisp is an extremely powerful language, and Clojure brings
all this to the practical JVM world. And Lisp is quite good in operating on
lists and graphs, which is a big part of Drake.

~~~
pencilcode
out of curiosity, why did you go the clojure route instead of the scala route?
From what i understand, scala has more libraries available, including ai and
nlp libraries but maybe my impression is not correct?

~~~
aboytsov
It's hard to compare Clojure and Scala. Scala is a multi-paradigm programming
language with strong OOP support and functional support. It's arguably more
verbose than Clojure but looks much more similar to Java.

Clojure is a Lisp. Lisp stands aside all other programming languages, first of
all, because it supports syntactic abstraction (a.k.a. "code is data").
Hardcode addicts (I'm not one of them) say there are only two programming
languages - Lisp and non-Lisp.

Here's a good comparison of Scala and Clojure:
[http://stackoverflow.com/questions/1314732/scala-vs-
groovy-v...](http://stackoverflow.com/questions/1314732/scala-vs-groovy-vs-
clojure)

When we made the decision to switch to Clojure, several things affected it, in
no particular order: \- we had some people who were already very proficient in
Lisp \- we liked how expressive and compact it was \- Lisp is considered to
possess immense expressive power (see <http://www.paulgraham.com/lisp.html>)
\- we were enamoured by Cascalog ([http://nathanmarz.com/blog/introducing-
cascalog-a-clojure-ba...](http://nathanmarz.com/blog/introducing-cascalog-a-
clojure-based-query-language-for-hado.html)), and it's written in and for
Clojure. This one payed off very well. \- Lisp has a reputation of being great
at manipulating data: lists, graphs, etc.

Here's a good answer from one of our engineers:
<http://www.quora.com/Clojure/Why-would-someone-learn-Clojure>

As for libraries, both Clojure and Scala are JVM-based, and Clojure has a very
good syntax for Java interop, so all Java libraries are available to us. But,
of course, Clojure community also spits out libraries like crazy, for example,
take a look at this marvel which we use in Drake for parsing:
<https://github.com/joshua-choi/fnparse>.

~~~
pencilcode
Thanks for your feedback. I've been playing around with both languages, and
was leaning towards scala since it seemed more likely i could use it
professionally, even though i liked clojure a bit more, sortta like the lisp
like syntax.

