Here is my argument: when I worked for DOE, everybody told me I had to run my MD simulations on a super computer using all the processors, and I would judged on my parallel efficiency. This meant using a code that used MPI to communicate at every (or every N) timesteps. I asked, instead, "Why not just run N independent simulations, and pool the results?" In this case, you run an M-thread simulation on each machine (where M = number of cores on the machine) with no internode communication at all except to read input files and write output files.
The short answer is, that approach works just fine, but the DOE supercomputer people won't let you run embarassingly parallel codes because they already spent money on the interconnect to run tightly coupled codes.
In reponse to this, I went to Google, built Exacycle (loosely coupled HPC) and published this well-cited paper: http://www.ncbi.nlm.nih.gov/pubmed/24345941 which in my opinion put the last nail in the coffin of DOE-style physics simulations for molecular dynamics.
That said, there are systems which are so large you can't practically simulate a single instance of the system on a single machine, so you have to partition. Simulating the ribosome is a nice example. However, simulating the ribosome currently provides no valuable scientific data except to tell us that we have major problems with our simulation systems (force field errors, missing QM, electrostatic approximations,e tc).
Interesting! Would it be accurate to say that as the amount of computing power and memory per CPU has increased over the years, so also has the percentage of scientific problems where a single simulation instance will fit on a single CPU? Certainly if you can do so, it's more efficient (in both machine and human resources) to partition by one job per CPU.
Yes, for example when I did my PhD work ~2001 with a T3E I could run a simulation of a duplex DNA in a box of water by running it in parallel. This was true both for memory and CPU reasons. It limited to me studying a single sequence at a time, or 2-3 which was the practical limit on the number of concurrent jobs. This used the well-balanced design of the T3E, which had a great MPI system.
Eventually it reached the point (~2007) where I could fit the whole simulation on a single 4-core Intel box with similar performance. Then, I ran one "task" per machine, and scaled to the number of available machines. This uses
only inter-node communication, which goes over a hub or crossbar on the motherboard. Much faster.
Now, I can fit many copies of DNA on a single machine (one task per core). This is far and away the best, because each processor just accesses its own memory, greatly reducing motherboard traffic, so the problem is basically CPU-bound instead of communication bound (this also now applies to GPUs, such that single GPUs can run one large simulation within its own RAM and not have to spill data back and forth over the CPU/GPU communication path).
This moves the challenge to the IO subsystem- I generate so much simulation data that I need a fat MapReduce cluster to analyze the trajectories.
none of this is news - what you're describing is really just strong scaling. and sure, most systems already have subsets of nodes set aside for post-simulation cleanup.
I'm not just describing strong scaling. I'm describing a cost-effective way to achieve it; that's what really matters.
Why have subsets of nodes for post-simulation cleanup? Why not just run that cleanup on the same nodes you used for simulation? Or other general nodes? Otherwise, you've got two sets of nodes which are used at lower utilization than they would normally be.
I know some people in the life sciences who were strongly encouraged to get Titan time. When they applied and presented ORNL with their embarrassingly parallel code, they were told to go away.
yes, precisely my point. If I wanted to run BLAST by partitioning it to run embarassingly parallel, they wanted to me use mpiBLAST- but mpiBLAST isn't actually any better for any real-world workfload.
This is because 50% of the cost of the machine was the interconnect, and if they let those codes run, it means they wasted budget and will get less next time.
Until I hear that the funders/builders are spending the same amount of budget on machines that let biologists run embarrassingly parallel codes as they spend on TOP500 machines, it's not going to change.