Making a chip this large is difficult, expensive, and error prone.
It blows past the reticle limit, so you end up having to do multiple (carefully aligned) exposures for adjacent chunks of the design. Signals can't really travel more than a few mm to a cm or so without significant degredation, so you end up needing to add buffers all over the place. Good yield tends to be exponentially harder with larger designs, since the larger area has a higher probability of overlapping with a defect, so you end up needing to harden the design with redundancy and parts that can selectively be disabled if they turn out to be defective so you don't have to throw out the whole part.
FPGAs have been pushing towards this kind of craziness, but there aren't huge advantages to doing this with a CPU, since the only thing you can do with so much more area is add cores and cache. The cost to going off-die to additional cores isn't so bad. On the other hand, trying to synthesize a multi-FPGA design (while meeting timing) is torture.
Why not, then, build a motherboard with many processor slots that can handle the linking then use traditionally sized processors? I don’t have the background here and am curious.
That's possible already. But usually it is done at the machine level rather than at the processor level (2, 4 and even 8 slot machines exist but are expensive) once you get over certain limits. The kind of problems people tend to solve on such installations (typically clusters of commodity hardware) are quite different from the kind of programs that you run on your day-to-day machine, think geological analysis, weather prediction and so on.
At some point the cost of the interconnect hardware dominates the cost of the CPUs. Lots of parties, for instance Evan Sutherland (https://news.ycombinator.com/item?id=723882) have tried their hand at this, but so far nobody has been able to pull it off successfully.
Eventually it will happen though, this is an idea that's too good to remain without sponsors long.
2 socket servers are actually the norm and dominate datacenters. 4 is still common but usually only done for the ability to address a large amount of RAM in a single machine or niche commercial workloads. 8 socket x86 servers are very unusual.
To add context, this is mostly because the numa properties get weird. With 2 sockets all of the inter-socket links can go directly to the other processor, and xeons have 2-3 of those currently. With 4 and 8 you end up having strange memory topologies that has hops that are less predictable unless you know your application was written for it.
We are actually moving away from wafer scale integration, because sticking to smaller chips makes you a lot more resilient to defects, which are becoming even more of a problem at recent nodes. Recent chips from both intel and amd are based on tiny "chiplets", connected via some sort of fast in-package interconnect.
Chiplets are just an admission that it is a hard problem. In the longer run the cost of interconnecting chiplets will go up again due to their number and then at some point the crossover point will be reached and we're off to some variation on self healing hardware all on one die.
Two kettles in the UK, closer to three in the US where circuits typically top out at 1800w. I actually looked into the feasibility of installing some 240v circuits in my kitchen and ordering kettles/blenders/etc from Europe...too much work lol.
It's very easy with immersion cooling. The article mentions roughly 20kW of heat per 15U which wouldn't even get anywhere near the 100kW+ for which immersion cooling has been designed.
It was considered quite seriously for CPUs back in the 1980s (the article touches on this). The problem is only perhaps 30% of the dies on the wafer will work. You end up with lots of dead silicon which needs routing around.
A current problem which it's unclear how Cerebras are handling is that CPUs and DRAM have different fabrication techniques which I guess can't be mixed on a single wafer, and you don't want your CPU to be too far away from its RAM. Edit: It seems they's using SRAM not DRAM, so that explains it but it must be low density and power hungry memory.
SRAM tends to use less power per bit than DRAM does since it doesn't need to be refreshed, especially with 8T cells. Well, less power when not reading and per bit read. The difference in speeds is high enough, though, that SRAM running at max bandwidth will use more power than DRAM running at max bandwidth. And there are cases, possibly even the Cerebras chip, where the cost of getting the bit where you need it to be outweighs the cost of reading it and the greater density of DRAM might make it more efficient since you don't have to spend energy transporting it so far.
But does that need to be a problem for ML-applications? I’m theory the training should be able to compensate for some broken hardware by simply moving weights around.