History tells us that the x86 architecture won, but the P4 lost out to AMD and later the Intel Core line. Furthermore, PowerPC is still used. From the article, I can't tell which of these architectures would have trouble scaling up clockspeed and power-consumption. AMD getting almost 50% of the market share was equally a shocking.
Lastly, there is the lingering thought that Windows market dominance facilitated investment that solved the shortcomings of x86, while G4 lingered.
I think the winners and losers were decided by factors other than architecture. x86 beat Power because of Windows, and Intel floated on their past success for a few years until they made a better architecture.
Hard for whom? 99% of the world bet on x86 winning. x86 won.
> AMD getting almost 50% of the market share was equally a shocking.
They assembled a good team, made a good product, priced it reasonably, and it sold well. There isn't anything particularly shocking about that, it's how things are supposed to work.
As for the actually unpredictable: at the time of the article's writing, a person would have probably had trouble predicting how well and how quickly AMD and Intel would integrate one another's features, including 64-bit instructions, but it shouldn't be regarded as shocking. The moment in 2002 when Dave Cutler started making public pronouncements about how good Opteron was, it was clear that Microsoft was happy about having x86 manufacturers competing on features and price. (Microsoft had already abandoned PPC as a big waste of time, even the DEC Alpha port of NT lasted longer)
> x86 beat Power because of Windows
Power was just never really much of a contender as far as a desktop CPU. People apparently forget how hard a time even Apple had getting enough chips and how hard a time IBM had ramping up PPC performance. PPC ended up being great in so many areas but the economics of desktop PC CPU production are pretty brutal.
Really Apple jumped ship because they were the only major customer for the PowerPC CPUs they were buying, and not really large enough to securely sustain their suppliers.
Now they are large enough, hence their own ARM CPUs.
If you're comparing "narrow and deep" to "wide and shallow", neither won, really. Modern high-performance designs are basically "wide and deep" with lots of speculation, better branch prediction, and various tricks to almost eliminate the need to flush the pipeline.
Admittedly, none have quite the pipeline length of the P4, but Haswell's 14-19 stage pipeline definitely qualifies as "deep". Even ARM has gone deep - Cortex-A15 has a 15 stage integer pipeline, plus up to an additional 10 stages for some NEON/FPU instructions.
If anything won, it was designing to the reality of silicon transistor frequency scaling for 90nm and smaller.
PowerPC at the high end still exists in the form of POWER8 for servers, but they're very expensive and don't perform all that well both in absolute terms and energy efficiency - it looks like they've gone down the same path as the P4 for the POWER series, with emphasis on high clock frequency and (extreme) hyperthreading - resulting in ~200W TDP and frequencies in the 4-5GHz range.
.
4-5 GHz is actually a low frequency today; a NetBurst/Power6-like processor would be running at 8-10 GHz today while consuming 500 W or more (if it were possible).
Given that we're speaking hypothetically (Since physics gets in the way) not sure why're saying Netburst would be consuming 500W - as the manufacturing processes shrink so does the power consumption.
For what it's worth, while Intel did win the architecture wars here, the "deep and narrow" pipeline being described here died out with Netburst. The Core microarchitecture that replaced it (which is the predecessor of today's Intel CPUs) used a much more "wide and shallow" pipeline, and benefited greatly from it.
Indeed, the P4 was a very odd CPU to program and optimise for - in some ways, it's the "most RISC-like" microarchitecture Intel has attempted. It was far more sensitive to things like instruction alignment and branch prediction than its successors and predecessors, and likely in the pursuit of higher clock frequencies, some instructions (e.g. shifts/rotates) were made several times slower. This meant near-optimal code sequences for the P6 family and often before that would perform horribly on the P4, and vice-versa. It could beat the PIII in "straight-line" execution of simple integer instructions with no branches, but the PIII was faster (even at a lower clock frequency) with more complex and branch-heavy instructions. It's probably the only x86 where a significant speed advantage can be obtained by extreme loop unrolling, a practice that is mostly counterproductive for post-Nehalem.
One of the more amusing P4 oddities is that certain 32-bit add/sub instructions will have a very slightly higher latency if there is a carry/borrow between the two 16-bit halves - it's very difficult to detect (I believe it's ~0.5 cycle), but it's there. This is probably a consequence of pipelining in the ALU itself.
> One of the more amusing P4 oddities is that certain 32-bit add/sub instructions will have a very slightly higher latency if there is a carry/borrow between the two 16-bit halves
This kind of data-dependent delay gives crypto people hives, for what it's worth. It's the sort of thing that can make timing attacks possible.
> This is probably a consequence of pipelining in the ALU itself.
Yep, the P4 "fast ALU" was double-pumped and did 16-bit chunks of a 32-bit add/sub in adjacent half-cycles [1]. This meant that dependent chains of, e.g., adds could still issue back-to-back, since the first half of a dependent instruction requires only the low 16 bits of its sources. Always struck me as very clever!
While I agree, the possibility is remote. Conde Naste is generally not interested in deep content (and I think that Stokes has moved on to other ventures).
"For a look at two instructions as they travel through the G4e, check out this animated GIF. Modem users should beware, though, because the GIF weights in at 355K."
That is because we embed fonts in the CSS file to cut down on HTTP requests. It's about 100k without the fonts. Sure it's still much larger than existed 10 years ago, but it's pretty standard these days.
Also note that we're gzipping, so the transmitted size is much smaller. And we also correctly return 304 responses after the first request.
History tells us that the x86 architecture won, but the P4 lost out to AMD and later the Intel Core line. Furthermore, PowerPC is still used. From the article, I can't tell which of these architectures would have trouble scaling up clockspeed and power-consumption. AMD getting almost 50% of the market share was equally a shocking.
Lastly, there is the lingering thought that Windows market dominance facilitated investment that solved the shortcomings of x86, while G4 lingered.
I think the winners and losers were decided by factors other than architecture. x86 beat Power because of Windows, and Intel floated on their past success for a few years until they made a better architecture.