Hacker News new | past | comments | ask | show | jobs | submit login

That's the Sentinel system. I worked on it when I was at Cray, and we did some covid stuff[1][2] with a researcher at UAH. We accelerated a docking code using some cool tech I created (in Perl, so there!) and some mods my teammates did to the queuing system.

The work won some award at SC20[3] (fka Supercomputing conference). I had considered submitting for the Gordon Bell prize, which had been specifically requesting covid work, though I thought the stuff we had done wasn't terribly sexy. We were getting ~250-500x better performance than single CPU runs.

Looking back over these, I gotta chuckle, as this (press releases) is pretty much the only time I'm called "Dr.". :D

Back to the OPs points, they are right. In most cases, cloud doesn't make sense for traditional HPC workloads. There are some special cases where it does, those tend to be large ephemeral analysis pipelines, as in bioinformatics and related fields. But for hardcore distributed (mostly MPI) code, running for a long time on a set of nodes interconnected with low latency networks, dedicated local nodes are the better economic deal.

During my stint at Cray, I was trying (quite hard) to get supercomputers, real classical ones, into cloud providers, or become a supercomputing cloud provider ourselves. The Met Office system is in Azure, is a Cray Shasta, but that was more of a special case. I couldn't get enough support for this.

Such is life. I've moved on. Still doing HPC, but more throughput maximized.

[1] https://www.uah.edu/science/departments/math/news/14954-uah-...

[2] A whole marketing writeup was done here https://www.hpe.com/us/en/newsroom/journey-to-accelerate-dru... . I tried very hard to correct the errors in the writeups. Sadly I wasn't successful.

[3] https://baudry-lab.uah.edu/news#h.121c63ayp0k0




The Azure Met Office win left me very conflicted. As someone who is relatively positive about cloud adoption for science it was good to see some forward thinking. On the other hand, what I've heard about how the procurement was run plus my taxpayer-based views on where critical national infrastructure should be housed makes me rather less happy about the outcome.


So in Azure it's possible to get access to an infiniband cluster somehow? Bare metal?


I don't speak for them (never have), but I believe it to be possible. MSFT do a number of things right (and a few really badly wrong), but you can generally spin up a decent bare metal system there. IO is going to be an issue with any cloud, it will cost for real performance. Between that and networking, clouds could potentially throw in the compute for free ...

Reminds me of a quip I made back in my SGI-Cray (1st time) days. A Cray supercomputer (back then) was a bunch of static ram that was sold, along with a free computer ... Not really true, but it gave a sense of the costs involved.

This said, Azure had (last I checked) real Mellanox networking kit for RDMA access. At Cray we placed a cluster in Azure for an end user (who shall rename nameless), and used several of Mellanox's largest switch frames for 100G Infiniband across > 1k nodes, each with many V100 GPUs. Unit would have been mid single digits on the top500 list that year.

AWS is doing their own thing network wise. Not nearly as good from a performance (latency or bandwidth) as the Mellanox kit. I don't know if Google Cloud is doing anything beyond TCP.

You can do bare metal at most/all of these. You can do some version of NVMe/local disk at all of them. Some/most let you spin up a parallel file system (network charges, so beware), either their own Lustre flavor, or one of BeeGFS, Weka, etc.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: