Hacker News new | comments | show | ask | jobs | submit login

I take enormous pleasure in automating every part of my research pipelines (comp sci).

As in, I like to get my experiment setup (usually distributed and many different components interacting with each other) to a point where one command resets all components, starts them in screen processes on all of the machines with the appropriate timing and setup commands, runs the experiment(s), moves the results between machines, generates intermediate results and exports publication ready plots to the right folder.

Upside: once it's ready, iterating on the research part of the experiment is great. No need to focus on anything else any more, just the actual research problem, not a single unnecessary click to start something (even 2 clicks become irritating when you do them hundreds of times). Need another ablation study/explore another parameter/idea? Just change a flag/line/function, kick off once, and have the plots the next day. No fiddling around.

Downside: full orchestration takes very long initially, but a bit into my research career I now have tons of utilities for all of this. It also has made me much better at command line and general setup nonsense.

Another nice thing about setups like yours is reproducibility. So long as you've got your setup in git and you've stored the flags/lines/functions, you can instantly redo the same experiment.

I agree and I have been working to do this with some of my pipelines as well. One challenge I have been facing is that my compute environment may be quite different than others'. This is mainly the case with respect to distributed computing that seems to be an essential part of the pipelines: I often wish to experiment with multiple hyperparameter settings which creates a lot of processes to run.

Do you or the parent or others take steps to abstract away the distributed computing steps so that others may run the pipelines in their distributed computing environments? More specifically, I use Condor (http://research.cs.wisc.edu/htcondor/) but other batch systems like PBS are also popular. Ideally my pipeline would support both systems (and many others).

I ended up writing a simple distributed build system (https://github.com/azag0/caf, undocumented) that abstracts this away. (I use it at a couple of clusters with different batch systems). You define tasks as directories with some input files and a command, which get hashed and define a unique identity. The build system then helps with distributing these tasks to other machines, execution, and collection of processed tasks.

Ultimately though, I rely on some common environment (such as a particular binary existing in PATH) that lives outside the build system and would need to be recreated by whoever who would like to reproduce the calculations. I never looked into abstracting that away with something like Docker.

(I plan to document the build system once it's at least roughly stable and I finish my Phd.)

Maybe containers like Docker are useful for your use case?

Docker can distribute the software needed to run the job well, which is definitely part of the issue and something I should use more.

However, I also have in mind a pipeline of scripts where one script may be a prerequisite to the other. Condor has some nice abstractions for this by organizing the scripts/jobs as a directed acyclic graph according to their dependencies. I was thinking other batch systems might support this as well. But some of my challenge comes in learning how each batch system would run these DAGs. Each one will have some commands to launch jobs, to wait for a job to finish before running some other job, to check if jobs failed, to rerun failed jobs in the case of hardware failure, etc.

It seems like the DAG representation would contain enough detail for any batch system but there may be some nuances. For example, I tend to think of these jobs as a command to run, the arguments to give that command, and somewhere to put stdout and stderr. But Condor also will report information about the job execution in some other log files. Cases like this illustrate where my DAG representation (or at least the data tracked in nodes) might break down, but I haven't used these other systems like PBS enough to know for sure.

Apache Airflow defines and runs DAGs in a sane way, IMO. Takes some configuration, but worth it for more complicated projects.

[Luigi](https://github.com/spotify/luigi) out of Spotify sounds exactly like what you're looking for. It allows you to specify dependent tasks, pipe input/output between them, and more.

You should check out pachyderm [1] for setting up automated data pipelines. Also great for reproducibility.

[1]: http://www.pachyderm.io

Yes, this comes especially handy when reviewers ask for additional experiments.

This is really interesting, and what I hope all research starts to shape into in the future.

Classic HN followup (at least I hope): What's currently getting in your way or annoying? What problem would you currently like to just disappear?

I actually had to think about this for a few minutes. I suppose having written a number of custom orchestration tools both in local clusters and the cloud, debugging distributed services is still incredibly tedious.

It's great that you do that, thank you. Do you also publish your data, programs and setup to allow others to reproduce/build on your research?

When talking about reproducibiluty in science, it's usually about the availability of original data to verify that the original conclusions were correct, but one level higher there's also computational reproducibility with its own challenges (freezing the original environment of the experiment).

So-called orphan repositories (i.e. content doesn't fit into any other bucket) like Zenodo welcome your articles, datasets and code.

I do this as well (research in computational chemistry.) Incredibly useful once set up, plus forcing oneself to think about a project in abstract terms to automate it can give valuable perspective.

This. I do the same, although these days I've narrowed down my scope somewhat to stuff I _must_ use rather than my old Ansible/Hadoop stack ensemble.

How much time does a full reset take?

A few minutes - killing processes on one machine, restarting a database on another machine/cluster, reloading a schema, importing data (for some experiments), deploying a new service, loading tensorflow models, warming up benchmark clients. I found that the key of good research pipelines is having a really consistent way of passing around configurations between components.

My components arent strictly microservices (a mixture of open source components and handwritten tools) and they interact in all sorts o fprotocols with each other (importing jsons, csv, GRPC, HTTP), but I essentially treat the configuration flags as their API, so there are no implicit configurations that I could forget about. The rest is just naming things well, e.g. descriptive names for experiment result files etc.

Initially I thought everyone is doing that, but from talking to PhDs in other domains I noticed that there is a strong bias towards people working in any kind of complex distributed setting having these pipelines.

My friends who devise ML models and just test them on datasets on their laptop never had a real need to get a pipeline in place because they never felt the pain points of setting up large distributed experiments.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact