How a Supercomputer Helps Fight Flies That Ravage Crops for 700M People

dekhn · on Oct 20, 2015

I'm puzzled why you'd need a supercomputer to build a phylogenetic tree. The economical speedups for phylogenetic tree generation on supercomputers are limited to about 64 cores, which doesn't justify using a supercomputer. Further supercomputers usually have deep queues because they're so heavily oversubscribed. Instead, you'd just buy some time on a cloud machine with a ton of cores and RAM and use it when you wanted and shut it down when you didn't need it.

stonogo · on Oct 20, 2015

Oversubscription is not a technical hurdle, and your "cloud machine with a ton of cores" is just a supercomputer with a shitty interconnect.

I'd be interested in how you arrived at a 64 core threshold. There's more than one way to skin that particular cat.

dekhn · on Oct 20, 2015

Sure, oversubscription is not a technical hurdle; it's a practical consequence that supercomputers have to run at close to 100% utilization to justify their expenditure. If you can wangle a higher priority so your jobs run faster, but that just means somebody else has to suffer.

As for the 64 core threshold, I'm referring to the publications for MPI phylogenetic codes (I'm inferring this is what they ran, as I can't find their experimental details) because supercomputer groups don't normally let people run embarrassingly parallel codes on fast interconnect boxes).

Phylogenetic tree generation doesn't need a fast interconnect - the best methods use coarse-grained parallelism, meaning you can trivially scale the tree evaluation.

stonogo · on Oct 20, 2015

Sounds like none of these are reasons to avoid a supercomputer, if you've got the grant to provide the compute time. You're kind of arguing both sides of the same coin here -- both that there's no reason to run the job wide, and that it's hard to run wide jobs on public-science installations.

Reading her past publications, she seems to hold her jobs to reasonable levels (around 10k cores in her latest publication). I can see using <30% of the system for such a short run (less than a week) as something that would be challenging for, say, an early-career researcher to lock in. But as you said, this is a job that lends itself extremely well to being broken up into smaller batches of jobs, so that should be something easy to do -- especially as more and more supercomputer-class platforms open up at more and more computational science departments.

She seems strangely fond of Aries, incidentally, so she, at least, thinks the interconnect is important. I personally don't (in this specific task) but I'd be interested in knowing why she does. Her TED talk is, sadly and unsurprisingly, light on technical information.

dekhn · on Oct 21, 2015

I checked, the code they're running is ExaBayes.

In my experience, centers like this are under pressure to find biologists to justify their systems, but most of the bio user don't have codes that scale very much. THe "supercomputer" she's running on isn't very super.

Anyway, I used to be in the class of user, but I found that supercomputers never had the throughput to carry out real science. After arguing with DOE that they spent too much money on interconnect (and not really being able to run my codes because they "only" scaled to 1K cores) I moved to Google and built a system called Exacycle which runs embarassingly parallel codes (of which phylo tree generation is an instance; it's basically metropolis monte carlo sampling, although the MPI versions typically do some clever tricks to create long MC chains efficiently). Personally, I think our approach is far superior for all but the most tightly coupled codes- for example, our work doing huge collections MD simulations of proteins generated far more useful data to analyze than if you'd run just one long trajectory (this is a long-running argument in the MD field, but I think we pretty much nailed it in our paper, http://www.ncbi.nlm.nih.gov/pubmed/24345941

Frankly I don't see many computational science depts making more supercomputer-like services. It's hard to run these systems, they cost a bunch, and they are hard to keep utilized at a rate that is justifying their expense. Commercial clouds are eating away at them- as you say they are basically computers with shitty interconnect. But the reality is that their interconnect is "good enough" for nearly all bio codes I know of. It's far easier for a PI these days to pull out their credit card and pay for grad student cloud use out of pocket than spend time applying for and using supercomputers.

stonogo · on Oct 21, 2015

I'm glad Exacycle exists, but the tasks at which it excels are indeed not tasks for which supercomputers are usually employed. The things that are easy for a PI to pay for with Visa are generally not the things that are interesting to do with supercomputers. I see where you're coming from with the DOE stuff, though; those guys are all about high-visibility "status" computing, and I don't really find that scene appealing either.

I'm not sure what you consider a supercomputer, if you don't consider a petaflops machine with a 3PB filesystem and >50Gb/s i/o "supercomputing." I guess we'll just have to disagree.

dekhn · on Oct 21, 2015

I used to work for NERSC when they ran Seaborg. Before that, i ran my PhD simulations on the T3E Mjolnir (PSC) and the other one, at SDSC whose name I forget. Back in those days, those were supercomputers, but a single cloud machine replaces one of those now.

The only thing I consider a supercomputer is a device designed for capability computing where the capability is ability to run codes that exhibit strong scaling due to their interconnect. The petaflops you quote are calculated by the strong scaling of LINPACK due to its design. I think you need a large amount of petaflops these days to really be considered a supercomputer. And those petaflops come from having a combination of low latency and high bandwidth-- you couldn't just point at Exacycle and say "that's a supercomputer".

The 3PB filesystem and 50Gb/s IO (you didn't say if that was per-node, or the bisection bandwidth of the central switch, or what) aren't particularly "super".

In some sense this is just arguing about the definition of a supercomputer which isn't super interesting; I have a preconceived notion of what one is. If Amazon put a bunch of machines on an MPI switch (IIRC they actually have) and scored well in Top500, and you could buy time on it, and it scaled your code, they'd have one too.

pinewurst · on Oct 20, 2015

Why "strangely fond"?

stonogo · on Oct 20, 2015

Aries has specific limitations, such as poor TCP emulation and reliability concerns. Cray is working hard to ameliorate them, and some of them can be worked around, but there are those who do not like Aries, and they tend to be loud about it. You don't tend to see people strongly praising it (aside from Cray salespeople).