That's the Sentinel system. I worked on it when I was at Cray, and we did some c...

Moissanite · on Oct 8, 2022

The Azure Met Office win left me very conflicted. As someone who is relatively positive about cloud adoption for science it was good to see some forward thinking. On the other hand, what I've heard about how the procurement was run plus my taxpayer-based views on where critical national infrastructure should be housed makes me rather less happy about the outcome.

anonymousDan · on Oct 8, 2022

So in Azure it's possible to get access to an infiniband cluster somehow? Bare metal?

hpcjoe · on Oct 8, 2022

I don't speak for them (never have), but I believe it to be possible. MSFT do a number of things right (and a few really badly wrong), but you can generally spin up a decent bare metal system there. IO is going to be an issue with any cloud, it will cost for real performance. Between that and networking, clouds could potentially throw in the compute for free ...

Reminds me of a quip I made back in my SGI-Cray (1st time) days. A Cray supercomputer (back then) was a bunch of static ram that was sold, along with a free computer ... Not really true, but it gave a sense of the costs involved.

This said, Azure had (last I checked) real Mellanox networking kit for RDMA access. At Cray we placed a cluster in Azure for an end user (who shall rename nameless), and used several of Mellanox's largest switch frames for 100G Infiniband across > 1k nodes, each with many V100 GPUs. Unit would have been mid single digits on the top500 list that year.

AWS is doing their own thing network wise. Not nearly as good from a performance (latency or bandwidth) as the Mellanox kit. I don't know if Google Cloud is doing anything beyond TCP.

You can do bare metal at most/all of these. You can do some version of NVMe/local disk at all of them. Some/most let you spin up a parallel file system (network charges, so beware), either their own Lustre flavor, or one of BeeGFS, Weka, etc.