"every (non-hand-held) computer’s CPU chip will contain 1,000 fairly homogeneous cores."
There are two problems with these visions one is memory and the other is the interconnect. 1000 cores, even at a modest clock rate, can easily demand 1 Terabyte of memory accesses per second. But memory has the same economies as 'cores' in that it's more cost effective when it is in fewer chips. But the chip is limited in how fast it can send signals over its pins to neighboring chips (see Intel's work on Light bridge).
So you end up with what are currently exotic chip on chip types of deals, or little Stonehenge like motherboards where this smoking hot chip is surrounded by a field of RAM shooting lasers at it.
The problem with that vision is that to date, the 'gains' we've been seeing have been when the chips got better but the assembly and manufacturing processes stayed more or less the same.
So when processors got better the existing manufacturing processes were just re-used.
That doesn't mean that at some point in the future we might have 1000 core machines, it just means that other stuff will change first (like packaging) before we get them. And if you are familiar with the previous 'everything will be VLIW (very large instruction world)' prediction you will recognize that a lack of those changes sometimes derail the progress. (in the VLIW case there have been huge compiler issues)
The interconnect issue is that 1000 cores can not only consume terabytes of memory bandwidth they can generate 10s of gigabytes of data to and from the compute center. That data, if it is being ferried to the network on non-volatile storage needs channels that run at those rates. Given that the number of 10GbE ports on 'common computers' is still quite small, another barrier to this vision coming to pass is that these machines will be starved for bandwidth to get to fresh data to work on, or to put out data they have digested or transformed.
The obvious solution to the memory bandwidth problem is to partition it and connect the pieces directly to the cores. Even if there is a shared region, cores don't have to worry about coherence when using their own private memory.
Yes, computing gets very interesting when what we have is no longer an overgrown IBM 5150.
Each SPE has a tiny amount of memory (256K, IIRC), severely limited connectivity to other SPEs and a downright cruel instruction set. A friendlier ISA and Transputer-like connectivity between the nodes would alleviate some of these problems.
Could you point me to a description of the VLIW compiler problems? In 1981 a small group of us coerced the Unix verion 7 portable C compiler to generate VLIW assembler as a senior project. There was nothing astonishing going on; the pcc had perhaps a couple dozen things it needed to be able to generate (conditional execution, arithmetic, pointers), and it was a simple matter of not using stuff before the (very primitive and shallow) pipeline was able to deliver it. After graduating I lost touch with that kind of fun tech - I was hired to modify accounting software written in BASIC. I've recovered ;-).
The most readily accessible one is the Itanium compiler. Intel (and SGI) worked to create a compiler that could maximize the use of the VLIW instruction set of the processor to achieve application specific performance.
This was presented by Intel to its enterprise customers as the 'secret sauce' that would give Itanium the edge over SPARC and other 64 bit architectures. They have invested millions in making this effort work.
However, reception of a workflow for the Itanium compiler was mixed at best. Some workloads it out performed, others it simply matched. The process for training the compiler, which seemed to me at the time to be an outgrowth of the branch prediction efforts, involved running synthetic benchmarks, collecting information about utilization of the execution units and then synthesizing new instruction mixes for the applications. The imposition of such a heavy weight process which needed to be repeated for nearly every change to the code base, worked against the benefits promised. Since code is likely to change often, proposals of waiting until you are 'ready to ship' before optimziation and tuning. But once shipped patches are made, bugs fixed, anomalies corrected. Changing any line of code could create a massive stall in the pipeline and crush performance until the system was re-tuned.
I don't know if that experience was universal, but it was common enough that such stories were everywhere at places like the Microprocessor Forum and other conferences.
One of the problems with VLIW architectures is lack of binary compatibility between CPU generations.
Suppose you had 4-way VLIW architecture and the next generation become 8-way. Even if new CPU will be able to run old 4-way code, it will twice as slower, I.e you need to recompile your software.
In a sense, they did. And they did famously fail to meet expectations.
But considering that nowadays other stacks which rely on jitting regularly achieve real-world performance that is competitive with much native-compiled software, it seems safe to presume that Transmeta's performance problems stemmed from reasons beyond the basic idea behind CMS.
True, a dual card GPU system can have 1000 cores today, however the cores provide primarily computation on a small (or procedurally generated) data set. This makes them great for simulating atomic explosions, CFD, and bit coin mining but for systems which read data, do a computation on it, and then write new results they don't have the I/O channels to support that sort of flow effectively. Back when I was programming GPUs one issue was that channel bandwidth between the CPU's memory and the video card was insufficient so that textures had to be maximally compressed or procedurally generated or you would starve TPUs waiting for their next bit of Texel data. I'm not sure if that is still the case with 16x PCIe slots however.