I've (ab)used apenwarr's redo for a couple big data processing projects, with mi...

I've (ab)used apenwarr's redo for a couple big data processing projects, with mixed results.

One was a news recommendation engine. We pulled down and parsed RSS feeds, crawled every new link they referred to, crawled thumbnails for each page, identified and scraped out textual content from pages, ran search indexing on the content, ran NLP analysis, added them to a document corpus, ran classifiers and statistical models, etc.

Every step of the way took some input files and produced an output file. We used programs written in many different languages -- whatever was best for the job.

So a build system was the obvious way to structure all of this, and we needed a build system we could push pretty hard. Our first version used make and quickly ran into some limitations (essentially, we needed more control over the dependency graph than was possible with the static, declarative approach) so we turned to redo, which lets you write build scripts in the language of your choice.

One thing we needed almost immediately was more powerful pattern matching rules than make's % expansion. No problem: invent a pattern syntax and a special mode where every .do script simply announces what patterns it can handle. Collect patterns, iteratively match against what's actually in the filesystem, and then you've got the list of target files you can build. (This already differs from make, which wants you to either specify the targets explicitly up front as "goals," or enumerate their dependencies via a $(shell ...) expansion and then string transform them into a list of targets which are ALSO matched by some pattern rule somewhere...okay you get it, it's make, it's really disgusting.)

Another thing we needed was to say, here's a list of target files that appear in the dependency graph, give me them in topologically sorted order. This allowed us to "compact" datasets as they became fragmented, without disturbing things downstream from them in the dependency graph. Again, this was not difficult with redo once we had some basic infrastructure.

Now, was all of this maintainable, or was it just kind of insane? I think in the end it ended up somewhat insane, and most importantly, it was an unfamiliar kind of insane. The insanity that you encounter in traditional Makefiles is at least well understood. And treatable.

With redo, you can do almost anything with your build. You can sail the seven seas of your dependency graph. It's awesome. It's also terrifying, because there is very little to guide you, and you may very well be in uncharted waters.

But give it a shot anyway. YMMV.