
​Supercomputer leaders come together on new open-source framework - CrankyBear
http://www.zdnet.com/article/supercomputer-leaders-come-together-on-new-open-source-framework/
======
batbomb
I work on the astronomy side of these things (job workflow management), and
it's kind of frustrating to develop against a supercomputer. We're still stuck
with schedulers and batch systems as opposed to provisioning.

If you have a job and it just needs one core (your code is embarrassingly
parallel), that's all well and fine, except you can't just get one core, you
get 16 or 24 as the smallest unit. That's fine, except now you end up writing
a task queue, and a system which spins up new workers whenever the old ones
are killed due to wall clock time. But most of these supercomputing centers
aren't too keen on having services running locally, so now your task queue is
a remote system, and now you've got to do clever things wherever you can to
make all this work the way you want it to (leverage any remote APIs, write SSH
tunnel wrappers, etc...) Now you've got another batch system for a batch
system.

The solution to this at the system level is to instead also provide serial
queues which schedule one job to one core. But if too many serial jobs are in
the works, especially if they aren't confined to a subset of the nodes, the
true MPI jobs which will struggle to get scheduled. This isn't too much of an
issue so long as everyone has a wall clock, but it's definitely not ideal.

Ideally, you'd provision your machines, run a few services, and have them as
long as you need them, do a bunch of shit, and release your machines. But it
doesn't work that way.

More pain comes around should you need to stage in/out your massive amounts of
data.

~~~
cing
My lab in computational structural biology has had to develop software to face
similar challenges using HPC resources. One PhD student in our lab probably
spent a year writing a client-server application to execute hundreds of
compute jobs using the systems job scheduler (instead of working on actual
problems in biology). Due to restrictions in the HPC environment, the code had
to be fault-tolerant to clients and server being killed due to wall clock
exceeding 24 hours. As you might expect, the student had no experience with
distributed computing, so the "software" is actually thousand of lines of
incomprehensible BASH and C code.

I do like the sound of an open-source framework for HPC though, I run
computations across dozens of national supercomputers and they're all very
different.

~~~
lenish
I don't have access to an environment like this, but wasn't GNU Parallel
designed for job scheduling across clusters? I'm not sure how tolerant it is
of having the processes on the remote nodes killed, since I've never had to
deal with that. I'm curious how much of the problem it solves, though.

