I have a feeling it'd be hard to find something that is easier to route and schedule over multiple cores than it is to just add that extra unit to a single core. AMDs latest CPUs are a good example of this, the L3 cache isn't even contiguous across all cores in the same core complex anymore (same access pattern as if you went to a completely different chiplet).
“Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand of random numbers by software processes/threads to exceed the rate at which the random number generator hardware can supply them.”
I don’t think it would be that hard to find something to route over multiple cores. If certain operations were unrolled and buffered, it would be overall more efficient.
Pretty much every other piece of the CPU needs to consume inputs and produce outputs without race conditions or cache coherency. This is where it becomes difficult to connect and schedule and is the reason AMD ends up with 16*16 MB of L3 cache instead of 1x256MB, trying to pump coherency instead of just output over an interconnect comes with an ENORMOUS penalty.
Avoiding speculation on the output of high latency instructions seems prudent.