Hacker News new | comments | show | ask | jobs | submit login

Maybe containers like Docker are useful for your use case?

Docker can distribute the software needed to run the job well, which is definitely part of the issue and something I should use more.

However, I also have in mind a pipeline of scripts where one script may be a prerequisite to the other. Condor has some nice abstractions for this by organizing the scripts/jobs as a directed acyclic graph according to their dependencies. I was thinking other batch systems might support this as well. But some of my challenge comes in learning how each batch system would run these DAGs. Each one will have some commands to launch jobs, to wait for a job to finish before running some other job, to check if jobs failed, to rerun failed jobs in the case of hardware failure, etc.

It seems like the DAG representation would contain enough detail for any batch system but there may be some nuances. For example, I tend to think of these jobs as a command to run, the arguments to give that command, and somewhere to put stdout and stderr. But Condor also will report information about the job execution in some other log files. Cases like this illustrate where my DAG representation (or at least the data tracked in nodes) might break down, but I haven't used these other systems like PBS enough to know for sure.

Apache Airflow defines and runs DAGs in a sane way, IMO. Takes some configuration, but worth it for more complicated projects.

[Luigi](https://github.com/spotify/luigi) out of Spotify sounds exactly like what you're looking for. It allows you to specify dependent tasks, pipe input/output between them, and more.

You should check out pachyderm [1] for setting up automated data pipelines. Also great for reproducibility.

[1]: http://www.pachyderm.io

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact