I would think that getting simulation time on a top ranked super computer would ...

davidmr · on June 20, 2016

I worked on a system that was at one time #3 on the list. There were no substantial idle periods. If the machine was out of maintenance, jobs were being run on it. Depending on the internal structure of the machine, some parts of it might be idle as the scheduler drained other parts to allow space for larger jobs to start, but I wouldn't call it idle.

Having done exclusively HPC for almost 16 years now, I can state that there is one and only one immutable law of HPC: grad students will find a way to squeeze out every second of available CPU time on a large system. The amount of effort going into taking advantage of those empty spots in the clusters where smaller jobs can run quickly while the scheduler drains resources for a large job (called backfilling) is tremendous.

The thing that sticks with me is what a fantastically complicated problem HPC job scheduling is. I've seen dozens of fresh-faced undergrads or first year grad students come in with a full head of steam and decide that they're going to "solve" HPC job scheduling, but every single one has gotten bogged down in the minutiae and the muck, and I've never seen software that "gets it right". Every scheduler sucks in its own unique way.

Whether all of those jobs need an old-fashioned supercomputer is certainly up for debate, but they certainly get used, at least in my experience.

beamatronic · on June 20, 2016

>> grad students will find a way to squeeze out every second of available CPU time on a large system

>> they certainly get used, at least in my experience

As a taxpayer, this made me feel good, thank you for sharing.

wepple · on June 20, 2016

interesting, do you feel that the lack of idle periods was because it was being used for tasks that truly demanded and could not be performed without that compute power?

Or was it a combination of that with "well its there, and we want to do this thing that would happen quicker on the big machine" - IE not necessarily enabling new research, just adding a nice speedup to existing work?

tanderson92 · on June 20, 2016

You don't generally build a large supercomputer to simply run a task faster than on a smaller cluster. The goal is to enable more realistic simulations by solving problems using a greater number of unknowns, increasing the fidelity of output.

So no, these machines are often used for problems which really do utilize the large-scale nature of the computers.

davidmr · on June 20, 2016

I agree with tanderson92. Whether the actual science from all the simulations is "worthwhile" is for people much smarter than me, but to get time on a "leadership" class machine, you have to prove in your application that a) it will run properly on the hardware; and b) that it's science you wouldn't get done on smaller machines.

So lots of the "big data" type problems with lots of uncoordinated jobs aren't going to take advantage of the primary component of these big computers, the node interconnect. The computers really shine with big fluid dynamics simulations where there's lots of chatter between different components of the simulation to communicate boundary conditions and the like.