|First, the link: http://wiki.github.com/documentcloud/cloud-crowd|
I recently started working for a nonprofit where we're encouraged to open-source our codebase. We're going to be getting into some heavy document processing, and need a reasonably scalable and efficient way to go about it. In practice, this means scripting together different command-line programs (graphics-magick, pdftk, tesseract) across a large number of servers, and gluing it all together with Ruby. CloudCrowd is our first take on a pleasant parallel processing experience -- as an "action", you write a Ruby class that responds to 'process', for the portion of the computation that can be parallelized, and optionally 'split' and 'merge'. The central server and job queue, the worker daemons, the web interface for monitoring, and the automatic retry of failed work units is all handled for you.
In the Wiki (linked above), there's a pretty comprehensive explanation of the architecture -- I'd love to hear suggestions, or ideas that can be pilfered from Hadoop or even Grand Central Dispatch to improve it.