Yes, to a degree, but probably not quite like you're thinking. The super computers and HPC clusters are highly tuned for the hardware they use which can have thousands of CPUs. But ultimately the "OS" that controls them takes on a bit of a different meaning in those contexts.
Ultimately, the OS has to be designed for the hardware/architecture it's actually going to run on, and not strictly just a concept like "lots of CPUs". How the hardware does interprocess communication, cache and memory coherency, interrupt routing, etc... is ultimately going to be the limiting factor, not the theoretical design of the OS. Most of the major OSs already do a really good job of utilizing the available hardware for most typical workloads, and can be tuned pretty well for custom workloads.
I added support for up to 254 CPUs on the kernel I work on, but we haven't taken advantage of NUMA yet as we don't really need to because the performance hit for our workloads is negligible. But the Linux's and BSD's do, and can already get as much performance out of the system as the hardware will allow.
Modern OSs are already designed with parallelism and concurrency in mind, and with the move towards making as many of the subsystems as possible lockless, I'm not sure there's much to be gained by redesigning everything from the ground up. It would probably look a lot like it does now.
Another aspect of supporting hugely dispersed CPUs a-la SGI Origin 3000 series (say, Origin 3400) is the support for performance telemetry of the attendant interconnecting pieces.
On SGI, the CrayLink HW had perfomance counters visible via the Performance CoPilot (nee PCP).
On Linux, NUMA arch has similar things (numastat, Intel's PCM, other tools). Depending on the workload, it may matter, but if the OS/tooling does not expose the counters, it isn't even possible to quantify the impact.
SGI's IRIX, due to the sheer physical size of their larger ccNUMA systems (AFAIK, AMD's NUMA is from SGI's ccNUMA), had the option to auto-migrate the workloads when certain CPU to working memory latency thresholds were reached.
Ultimately, the OS has to be designed for the hardware/architecture it's actually going to run on, and not strictly just a concept like "lots of CPUs". How the hardware does interprocess communication, cache and memory coherency, interrupt routing, etc... is ultimately going to be the limiting factor, not the theoretical design of the OS. Most of the major OSs already do a really good job of utilizing the available hardware for most typical workloads, and can be tuned pretty well for custom workloads.
I added support for up to 254 CPUs on the kernel I work on, but we haven't taken advantage of NUMA yet as we don't really need to because the performance hit for our workloads is negligible. But the Linux's and BSD's do, and can already get as much performance out of the system as the hardware will allow.
Modern OSs are already designed with parallelism and concurrency in mind, and with the move towards making as many of the subsystems as possible lockless, I'm not sure there's much to be gained by redesigning everything from the ground up. It would probably look a lot like it does now.