The T94 has a peak theoretical performance of about 8 GFLOPS, 1 gig of ram, half a meg of L2 cache and an SSD up to 4 gigs in size.
Compare this to the Galaxy S III, which has at least 1 GB of ram, 64 GB of flash storage, up to 8 MB of L2 cache per core (4x), and nearly 20 GFLOPS of real world performance (on the GPU).
It seems as though bringing memristor technology to market is a sure thing at this point. And that will again revolutionize computing technology in several steps. First, what happens when durable storage has the speed of RAM? Well, everything gets faster, of course. But then sleep functionality gets a whole crap-ton better (because waking from sleep can be effectively instantaneous). Which means longer battery life for mobile devices, yadda yadda.
Then there's the 3rd memristor revolution, where memristor's are used directly to implement logic instead of transistors. How this could play out is anybody's guess, but when I think about it even a little the idea of the technological singularity quickly comes to mind. The implications for raw computing power and for machine based learning are truly astounding.
We don't have the benchmarking data, but I strongly suspect that these kinds of implementations will actually be significantly slower than what compilers are capable of doing on x86-64. This is almost certainly going to be true for stack-based VMs (stack operations are ridiculously slow compared to registers, and because the push/pop sequence done by instructions is implicit and linear you can't take advantage of any parallelization techniques like superscalar execution, branch prediction, out-of-order execution, and even pipelining is less effective).
Even for register-based ones the obvious things like type checking are actually extremely efficient on modern processors (I wrote an explanation on Quora: http://www.quora.com/What-is-a-lisp-machine-and-what-is-so-g...).
To get the same level of performance as modern CPUs you really need to take advantage of all the parallelization techniques I mentioned above. This is extremely difficult to design because you need to ensure all the permutations that are possible in valid instruction sequences produce the correct results in the presence of all the reordering/parallel execution going on. Modern CPUs actually have a lot of bugs that are found relating to this, but they only occur in very unusual code, and these bugs are fixed either by patching the microcode on the CPUs or by the OS, so you rarely encounter them. And this despite the amazingly thorough testing and huge amounts of formal verification that go into CPU designs.
I think FPGA-based designs will continue to be very algorithm specific for those reasons, even if we get FPGAs everywhere.
) Cray lists the fully loaded 32-cpu T90 has doing 800 GB/sec, a 8 GFLOPS T94 has 4 CPUs