The problem with GPUs is that I/O latency is very high compared to your average supercomputer. You can do a crazy amount of computation locally on one card, but for problems that aren't "embarrassingly parallel", i.e. those that require a lot of low-latency inter-node communication, you'll immediately be limited by latency.
If nVidia or AMD release GPU based stream processors with onboard or daughterboard-based interconnects directly accessible from the code running on the GPU, THAT's when they'll start eating into CPU market share.
If you're buying a supercomputer, you'll want to make sure to spend at least 50% on the interconnect or you're in for a big disappointment.
Super computers are already focused on "embarrassingly parallel" problems. Otherwise 300,000 cores is not going to do much for you anyway. However, I agree that interconnect speed would be a major issue for many supper computer workloads. Yet, I suspect if you had access to a 10+million$ supercomputer built using 1million GPU cores plenty of people would love to work with such a beast.
No, these are not just racks and racks of individual machines. It presents the programmer with a single system image - it "looks" like one huge expanse of memory.
We have a Blue Gene at Argonne, it's not SSI. It is however not designed for embarrasingly parallel workloads, you use libraries like MPI to run tightly coupled message passing applications (which are very sensitive to latency). You can, and people have, run many-task type applications too.
The basic speed of light limitation means that accessing distant nodes is going to have high latency even if there is reasonable bandwidth. Ignoring that is a bad idea from an efficiency standpoint. And, unlike PC programming the cost of the machine makes people far more focused on optimizing their code for the architecture than abstracting the architecture to help the developer out.
It take care of it to some extent, but you still have to be aware of it as the programmer. MPI and associated infrastructure are set up such that they'll pick the right nodes to keep the network topology and your code's topology well matched. But you have to do your best as a programmer to hide the latency by spending that time doing other things.