As in, I like to get my experiment setup (usually distributed and many different components interacting with each other) to a point where one command resets all components, starts them in screen processes on all of the machines with the appropriate timing and setup commands, runs the experiment(s), moves the results between machines, generates intermediate results and exports publication ready plots to the right folder.
Upside: once it's ready, iterating on the research part of the experiment is great. No need to focus on anything else any more, just the actual research problem, not a single unnecessary click to start something (even 2 clicks become irritating when you do them hundreds of times).
Need another ablation study/explore another parameter/idea? Just change a flag/line/function, kick off once, and have the plots the next day. No fiddling around.
Downside: full orchestration takes very long initially, but a bit into my research career I now have tons of utilities for all of this. It also has made me much better at command line and general setup nonsense.
Do you or the parent or others take steps to abstract away the distributed computing steps so that others may run the pipelines in their distributed computing environments? More specifically, I use Condor (http://research.cs.wisc.edu/htcondor/) but other batch systems like PBS are also popular. Ideally my pipeline would support both systems (and many others).
Ultimately though, I rely on some common environment (such as a particular binary existing in PATH) that lives outside the build system and would need to be recreated by whoever who would like to reproduce the calculations. I never looked into abstracting that away with something like Docker.
(I plan to document the build system once it's at least roughly stable and I finish my Phd.)
However, I also have in mind a pipeline of scripts where one script may be a prerequisite to the other. Condor has some nice abstractions for this by organizing the scripts/jobs as a directed acyclic graph according to their dependencies. I was thinking other batch systems might support this as well. But some of my challenge comes in learning how each batch system would run these DAGs. Each one will have some commands to launch jobs, to wait for a job to finish before running some other job, to check if jobs failed, to rerun failed jobs in the case of hardware failure, etc.
It seems like the DAG representation would contain enough detail for any batch system but there may be some nuances. For example, I tend to think of these jobs as a command to run, the arguments to give that command, and somewhere to put stdout and stderr. But Condor also will report information about the job execution in some other log files. Cases like this illustrate where my DAG representation (or at least the data tracked in nodes) might break down, but I haven't used these other systems like PBS enough to know for sure.
Classic HN followup (at least I hope): What's currently getting in your way or annoying? What problem would you currently like to just disappear?
When talking about reproducibiluty in science, it's usually about the availability of original data to verify that the original conclusions were correct, but one level higher there's also computational reproducibility with its own challenges (freezing the original environment of the experiment).
So-called orphan repositories (i.e. content doesn't fit into any other bucket) like Zenodo welcome your articles, datasets and code.
My components arent strictly microservices (a mixture of open source components and handwritten tools) and they interact in all sorts o fprotocols with each other (importing jsons, csv, GRPC, HTTP), but I essentially treat the configuration flags as their API, so there are no implicit configurations that I could forget about. The rest is just naming things well, e.g. descriptive names for experiment result files etc.
Initially I thought everyone is doing that, but from talking to PhDs in other domains I noticed that there is a strong bias towards people working in any kind of complex distributed setting having these pipelines.
My friends who devise ML models and just test them on datasets on their laptop never had a real need to get a pipeline in place because they never felt the pain points of setting up large distributed experiments.