Supercomputers can be differentiated from server farms in two ways:
1. Performance measured in FLOP's. Think supercomputers as gigantic graphics cards. The bandwidth between the nodes is incredible. Network topology is different.
2. Problem type is one with lots of interaction between partial solutions.
3. Problem solving capability instead of problem solving capacity.
Supercomputer design is a whole system design for the capability to solve one huge problem in a short time. With just a 'bunch of servers,' you add more servers and get more capacity to solve a larger number of limited-sized problems.
The supercomputer has usually the maximum problem size level and every aspect of the system is optimized and balanced together to reach that maximum. Even if the supercomputer is sold in pieces and can be extended, extensions scale only towards this maximum design where the capability peaks. Installing two maxed-out supercomputers side-by-side does not double the capability for most problems.
This way supercomputer is like a single computer. You can add components until you reach the limits. If you need a bigger computer to solve bigger problems, you build a bigger supercomputer from scratch.
I guess I can imagine how the topology can limit expansion after a certain point. What sort of speedup are we talking about though? Say I have some HPC code that takes a week to run on a purpose built supercomputer. If I threw a bunch of servers together, with the appropriate HW to do RDMA or whatever's needed, how much slower are we talking? 10x slowdown? 100x? I realize that's probably wildly under-specified, I'm just trying to get a sense of the scale of the difference.
I'll give a concrete example. Around 2000 I was running molecular dynamics simulations. Goal is to get the longest trajectory you can in a reasonable time. You can add more processors and speed up the simulation, but at some point, adding more processors doesn't speed things up unless you can speed up the network, because there is some computational barrier you can't cross until the network delivers some data to other nodes (btw this is the same as allreduce in horovod).
I had access to the fastest supercomputer at the time (a T3E with 64 nodes) as well as a small linux cluster.
The speedups I saw on the super computer were: 60X for 64 nodes. The speedups I saw on my linux cluster were: 4X for 8 nodes. All the problems were due to the slow network (10Mbit ethernet with TCP) and congestion (allreduce hammers the network). However, I didn't have to wait a week for my job to start running, and I could add more nodes and run different jobs in parallel.
I concluded that, unless I had an MD jhob that couldn't fit in the linux cluster, it was always better to use the linux cluster instead of the supercomputer.
It really depends a lot on the simulation and how much you can adjust your system to run faster on smaller machines.
1. Performance measured in FLOP's. Think supercomputers as gigantic graphics cards. The bandwidth between the nodes is incredible. Network topology is different.
2. Problem type is one with lots of interaction between partial solutions.
3. Problem solving capability instead of problem solving capacity.
Supercomputer design is a whole system design for the capability to solve one huge problem in a short time. With just a 'bunch of servers,' you add more servers and get more capacity to solve a larger number of limited-sized problems.
The supercomputer has usually the maximum problem size level and every aspect of the system is optimized and balanced together to reach that maximum. Even if the supercomputer is sold in pieces and can be extended, extensions scale only towards this maximum design where the capability peaks. Installing two maxed-out supercomputers side-by-side does not double the capability for most problems.
This way supercomputer is like a single computer. You can add components until you reach the limits. If you need a bigger computer to solve bigger problems, you build a bigger supercomputer from scratch.