Apparently, the Cray T90 in the 32-processor configuration had 360GBps of shared memory bandwidth - this is still many times above a shared memory configuration that you can get on the desktop today. Of course, supercomputers have largely moved on from shared memory systems to clusters, and have larger aggregate bandwidth - but shared memory still has its uses. Just a point for all the comparisons to smartphones and what not.
EDIT: just to expand on this a bit, this means that there are workloads that these old-school supercomputers will run much faster than a modern high-end desktop. This particularly applies to workloads which have a lot of shared memory access at random locations - a very difficult case for the modern systems, which depend on high cache hit rates. Also, the GFLOPS ratings of these supercomputer processors are in many ways more real than the ratings of commodity processors, which depend on the pipelines being filled in very specific ways. So no, you can't replace this system (at least the 32-processor version) with a smartphone or even a desktop. Which is not to say it would be cost-effective at 39 million of 1996 dollars.
Most simulations can actually be parallelized in such a way that most memory accesses are local - you can divide the object or system in such a way that various parts don't communicate with each other too much, and map that to a cluster computer. But there are some exceptions in which parts of a simulated system can affect each other remotely. For example, in a lot of nuclear simulations, particles produced in one part of the system can very quickly travel to other parts; you have a problem when you have particles in your system which operate on different time scales, e.g. neutrons, heavy nuclei, and photons. This is a big reason why the DOE and nuclear laboratories liked these supercomputer systems.
Another case which I saw personally was a simulation of a part of the visual cortex of the brain; you had neurons which were connected to their neighbors, but you also had a bunch of connections to far-away neurons, and the bandwidth between processors which simulated different parts of the cortex became a limitation (and the huge supercomputers which had the bandwidth were (a) expensive, and (b) had relatively slow processors for the number crunching in each region).
Except in this case, I found that the physical delay which existed on a long connection between neurons allowed us to buffer the messages and send a notification about the whole train of impulses, effectively compressing the data. Together with some other simple changes, the simulation ran 10 to 100 times faster, and could use clusters instead of supercomputers.
In general, there are not that many cases in which you really can't get rid of the requirement of fast non-local memory access; if there were, these supercomputers wouldn't have died out. But they were useful in some cases, and were also good for freeing people from thinking about how to localize their memory accesses - this speeded up development.
>"I'm ecstatic," Blank said. "In 50 years, people are just going to go, 'That was the pinnacle of military computing. These machines are going to be as important as the first PC or the first minicomputer.'" ... Regardless of where the machine ends up, Blank said he plans to keep it in working condition. He said he was arranging with Cray and the supercomputing center to ship the machine in a climate-controlled environment.
>It sat in my barn next to the tractors and manure for five years. I had the only farm capable of nuclear weapons design.
Cray called two years ago and bought it back for parts for an unnamed customer still running one.
That's kind of sad. But he did help ensure another Cray remained in working condition, so I guess that's a positive aspect of the story!
It seems as though bringing memristor technology to market is a sure thing at this point. And that will again revolutionize computing technology in several steps. First, what happens when durable storage has the speed of RAM? Well, everything gets faster, of course. But then sleep functionality gets a whole crap-ton better (because waking from sleep can be effectively instantaneous). Which means longer battery life for mobile devices, yadda yadda.
Then there's the 3rd memristor revolution, where memristor's are used directly to implement logic instead of transistors. How this could play out is anybody's guess, but when I think about it even a little the idea of the technological singularity quickly comes to mind. The implications for raw computing power and for machine based learning are truly astounding.
Thank you! I appreciate you pointing out this substantive and fundamental error in my post. I am so glad that you are engaged with the important aspects of the subject at hand instead of dragging down the discussion through petty, mindless pedantry on matters of little to no consequence.
We don't have the benchmarking data, but I strongly suspect that these kinds of implementations will actually be significantly slower than what compilers are capable of doing on x86-64. This is almost certainly going to be true for stack-based VMs (stack operations are ridiculously slow compared to registers, and because the push/pop sequence done by instructions is implicit and linear you can't take advantage of any parallelization techniques like superscalar execution, branch prediction, out-of-order execution, and even pipelining is less effective).
To get the same level of performance as modern CPUs you really need to take advantage of all the parallelization techniques I mentioned above. This is extremely difficult to design because you need to ensure all the permutations that are possible in valid instruction sequences produce the correct results in the presence of all the reordering/parallel execution going on. Modern CPUs actually have a lot of bugs that are found relating to this, but they only occur in very unusual code, and these bugs are fixed either by patching the microcode on the CPUs or by the OS, so you rarely encounter them. And this despite the amazingly thorough testing and huge amounts of formal verification that go into CPU designs.
I think FPGA-based designs will continue to be very algorithm specific for those reasons, even if we get FPGAs everywhere.
Vector supers were bandwidth machines though. The memory is SRAM and a 8 GFLOPS T94 probably() has about 100 gigabytes/sec of theoretical memory bandwidth. Compare that to 6 GB/sec theoretical bandwidth of LPDDR3.
) Cray lists the fully loaded 32-cpu T90 has doing 800 GB/sec, a 8 GFLOPS T94 has 4 CPUs
Really gives you some context of how far we've come in so little time, eh?
I sometimes find it sad that pieces like this T94 wind up as museum pieces or conversation pieces, but it's just one of the casualties of our rapid rate of progression. I mean, the Qualcomm Snapdragons in cell phones are probably more powerful, and they run off a battery worth a couple thousand mAh.
That is a bit of history, not as attractive as the X or Y-MP (to my taste) but still a neat machine. Folks have been comparing it to a smartphone today, which is inaccurate, it would be better to compare it to the GPU in a smart phone where the GPU shader engines don't have nearly as rich an instruction set as the CLUs of the T94. (and the T94 only had 4 CPUs rather than 16 or 32) But a amazing I/O bandwidth of 8 GBytes second was pretty cool. This thing would have been capable of an awesome render of most 3D scenes.
And how well-built it is! Look at the construction: all-machined bolts and panels and stuff. Looks like something you'd find in the avionics bay of a spaceship. I love how the panels have milled labels, too: "Common Memory", etc. Looks like it'd sing "Daisy, Daisy" as you opened it up. If I had the means I'd pick this thing up and put it in my living room in a heartbeat. Alas.
Someday I want to see the Hacker Dojo buy an old Cray or IBM mainframe and set it up for people to run random jobs on. Not for any practical purpose, just so people can get a feel for what the old systems were like.