I guess I can imagine how the topology can limit expansion after a certain point. What sort of speedup are we talking about though? Say I have some HPC code that takes a week to run on a purpose built supercomputer. If I threw a bunch of servers together, with the appropriate HW to do RDMA or whatever's needed, how much slower are we talking? 10x slowdown? 100x? I realize that's probably wildly under-specified, I'm just trying to get a sense of the scale of the difference.
I'll give a concrete example. Around 2000 I was running molecular dynamics simulations. Goal is to get the longest trajectory you can in a reasonable time. You can add more processors and speed up the simulation, but at some point, adding more processors doesn't speed things up unless you can speed up the network, because there is some computational barrier you can't cross until the network delivers some data to other nodes (btw this is the same as allreduce in horovod).
I had access to the fastest supercomputer at the time (a T3E with 64 nodes) as well as a small linux cluster.
The speedups I saw on the super computer were: 60X for 64 nodes. The speedups I saw on my linux cluster were: 4X for 8 nodes. All the problems were due to the slow network (10Mbit ethernet with TCP) and congestion (allreduce hammers the network). However, I didn't have to wait a week for my job to start running, and I could add more nodes and run different jobs in parallel.
I concluded that, unless I had an MD jhob that couldn't fit in the linux cluster, it was always better to use the linux cluster instead of the supercomputer.
It really depends a lot on the simulation and how much you can adjust your system to run faster on smaller machines.