Hacker News new | past | comments | ask | show | jobs | submit login

as someone not involved in HPC at all but hears things about it occasionally, is there really anything that defines a supercomputer these days? Or is a supercomputer basically just a whole bunch of servers with low latency networking and running MPI?



Supercomputers can be differentiated from server farms in two ways:

1. Performance measured in FLOP's. Think supercomputers as gigantic graphics cards. The bandwidth between the nodes is incredible. Network topology is different.

2. Problem type is one with lots of interaction between partial solutions.

3. Problem solving capability instead of problem solving capacity.

Supercomputer design is a whole system design for the capability to solve one huge problem in a short time. With just a 'bunch of servers,' you add more servers and get more capacity to solve a larger number of limited-sized problems.

The supercomputer has usually the maximum problem size level and every aspect of the system is optimized and balanced together to reach that maximum. Even if the supercomputer is sold in pieces and can be extended, extensions scale only towards this maximum design where the capability peaks. Installing two maxed-out supercomputers side-by-side does not double the capability for most problems.

This way supercomputer is like a single computer. You can add components until you reach the limits. If you need a bigger computer to solve bigger problems, you build a bigger supercomputer from scratch.


I guess I can imagine how the topology can limit expansion after a certain point. What sort of speedup are we talking about though? Say I have some HPC code that takes a week to run on a purpose built supercomputer. If I threw a bunch of servers together, with the appropriate HW to do RDMA or whatever's needed, how much slower are we talking? 10x slowdown? 100x? I realize that's probably wildly under-specified, I'm just trying to get a sense of the scale of the difference.


I'll give a concrete example. Around 2000 I was running molecular dynamics simulations. Goal is to get the longest trajectory you can in a reasonable time. You can add more processors and speed up the simulation, but at some point, adding more processors doesn't speed things up unless you can speed up the network, because there is some computational barrier you can't cross until the network delivers some data to other nodes (btw this is the same as allreduce in horovod).

I had access to the fastest supercomputer at the time (a T3E with 64 nodes) as well as a small linux cluster.

The speedups I saw on the super computer were: 60X for 64 nodes. The speedups I saw on my linux cluster were: 4X for 8 nodes. All the problems were due to the slow network (10Mbit ethernet with TCP) and congestion (allreduce hammers the network). However, I didn't have to wait a week for my job to start running, and I could add more nodes and run different jobs in parallel.

I concluded that, unless I had an MD jhob that couldn't fit in the linux cluster, it was always better to use the linux cluster instead of the supercomputer.

It really depends a lot on the simulation and how much you can adjust your system to run faster on smaller machines.


I think the scale of compute speed really muddies the water with what we traditionally consider a super computer. A DGX pod is probably a super computer imo.

But with some problems we're seeing that we can solve them with petabyte scale lookup tables. Is that a super computer?

I'd really like to hear if there even is a coherent definition anymore or if that line is just blurred forever.


A full DGX pod, like full TPU pod is a "ML supercomputer". The big detail being that they don't typically support all CPU functionality, and tend to have support only for smaller datatypes (float instead of double). However, neither can act as a full replacement for current HPC codes, although that is a very exciting area of research.


depends who you talk to, I would say "a supercomputer is a machine designed for high performance computing that has capabilities several standard deviations above the mean".

There really are only about 10 supercomputers in the world at any time (the most that multiple nations can afford to run) and the capabilities keep being pushed up.

So no, I don't think it's just a bunch of servers with low latency network. Typically much more integration has to be done to achieve peak performance.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: