There are two problems with these visions one is memory and the other is the interconnect. 1000 cores, even at a modest clock rate, can easily demand 1 Terabyte of memory accesses per second. But memory has the same economies as 'cores' in that it's more cost effective when it is in fewer chips. But the chip is limited in how fast it can send signals over its pins to neighboring chips (see Intel's work on Light bridge).
So you end up with what are currently exotic chip on chip types of deals, or little Stonehenge like motherboards where this smoking hot chip is surrounded by a field of RAM shooting lasers at it.
The problem with that vision is that to date, the 'gains' we've been seeing have been when the chips got better but the assembly and manufacturing processes stayed more or less the same.
So when processors got better the existing manufacturing processes were just re-used.
That doesn't mean that at some point in the future we might have 1000 core machines, it just means that other stuff will change first (like packaging) before we get them. And if you are familiar with the previous 'everything will be VLIW (very large instruction world)' prediction you will recognize that a lack of those changes sometimes derail the progress. (in the VLIW case there have been huge compiler issues)
The interconnect issue is that 1000 cores can not only consume terabytes of memory bandwidth they can generate 10s of gigabytes of data to and from the compute center. That data, if it is being ferried to the network on non-volatile storage needs channels that run at those rates. Given that the number of 10GbE ports on 'common computers' is still quite small, another barrier to this vision coming to pass is that these machines will be starved for bandwidth to get to fresh data to work on, or to put out data they have digested or transformed.
Yes, computing gets very interesting when what we have is no longer an overgrown IBM 5150.
This was presented by Intel to its enterprise customers as the 'secret sauce' that would give Itanium the edge over SPARC and other 64 bit architectures. They have invested millions in making this effort work.
However, reception of a workflow for the Itanium compiler was mixed at best. Some workloads it out performed, others it simply matched. The process for training the compiler, which seemed to me at the time to be an outgrowth of the branch prediction efforts, involved running synthetic benchmarks, collecting information about utilization of the execution units and then synthesizing new instruction mixes for the applications. The imposition of such a heavy weight process which needed to be repeated for nearly every change to the code base, worked against the benefits promised. Since code is likely to change often, proposals of waiting until you are 'ready to ship' before optimziation and tuning. But once shipped patches are made, bugs fixed, anomalies corrected. Changing any line of code could create a massive stall in the pipeline and crush performance until the system was re-tuned.
I don't know if that experience was universal, but it was common enough that such stories were everywhere at places like the Microprocessor Forum and other conferences.
But considering that nowadays other stacks which rely on jitting regularly achieve real-world performance that is competitive with much native-compiled software, it seems safe to presume that Transmeta's performance problems stemmed from reasons beyond the basic idea behind CMS.
GPU drivers dynamically recompile "Virtual ISA" into actual ISA when running kernel or shader first time.
Of course SIMD is not the same as 1K fully independent cores but they're both valid interpretations of parallel computing.
That kind of architecture only lends itself to a certain class of problems but they're a good indication of what can be done to circumvent the interconnect limitations.
A typical GPU hides access to slow resources by doing more computations.