Another case which I saw personally was a simulation of a part of the visual cortex of the brain; you had neurons which were connected to their neighbors, but you also had a bunch of connections to far-away neurons, and the bandwidth between processors which simulated different parts of the cortex became a limitation (and the huge supercomputers which had the bandwidth were (a) expensive, and (b) had relatively slow processors for the number crunching in each region).
Except in this case, I found that the physical delay which existed on a long connection between neurons allowed us to buffer the messages and send a notification about the whole train of impulses, effectively compressing the data. Together with some other simple changes, the simulation ran 10 to 100 times faster, and could use clusters instead of supercomputers.
In general, there are not that many cases in which you really can't get rid of the requirement of fast non-local memory access; if there were, these supercomputers wouldn't have died out. But they were useful in some cases, and were also good for freeing people from thinking about how to localize their memory accesses - this speeded up development.