Hacker News new | past | comments | ask | show | jobs | submit login
The Pentium 4 and the G4e: An Architectural Comparison (2001) (arstechnica.com)
47 points by CoolGuySteve on Sept 1, 2014 | hide | past | favorite | 24 comments



It was hard to tell who would win.

History tells us that the x86 architecture won, but the P4 lost out to AMD and later the Intel Core line. Furthermore, PowerPC is still used. From the article, I can't tell which of these architectures would have trouble scaling up clockspeed and power-consumption. AMD getting almost 50% of the market share was equally a shocking.

Lastly, there is the lingering thought that Windows market dominance facilitated investment that solved the shortcomings of x86, while G4 lingered.

I think the winners and losers were decided by factors other than architecture. x86 beat Power because of Windows, and Intel floated on their past success for a few years until they made a better architecture.


I completely agree. When it comes to competing on architecture, it's unfortunate that Intel gets incredible economies of scale.

According to Wikipedia, while the G4 was being manufactured at 200 or 180 nm, the P4 went from 180nm to 130nm and then 90nm.

Even now, AMD and NVidia are stuck around 28nm until next year while Intel is expected to release their first 14nm Broadwell chip this quarter.


> It was hard to tell who would win.

Hard for whom? 99% of the world bet on x86 winning. x86 won.

> AMD getting almost 50% of the market share was equally a shocking.

They assembled a good team, made a good product, priced it reasonably, and it sold well. There isn't anything particularly shocking about that, it's how things are supposed to work.

As for the actually unpredictable: at the time of the article's writing, a person would have probably had trouble predicting how well and how quickly AMD and Intel would integrate one another's features, including 64-bit instructions, but it shouldn't be regarded as shocking. The moment in 2002 when Dave Cutler started making public pronouncements about how good Opteron was, it was clear that Microsoft was happy about having x86 manufacturers competing on features and price. (Microsoft had already abandoned PPC as a big waste of time, even the DEC Alpha port of NT lasted longer)

> x86 beat Power because of Windows

Power was just never really much of a contender as far as a desktop CPU. People apparently forget how hard a time even Apple had getting enough chips and how hard a time IBM had ramping up PPC performance. PPC ended up being great in so many areas but the economics of desktop PC CPU production are pretty brutal.


Also remember what happened on the PowerPC side, it wasn't just x86 victories but also fateful PowerPC moves:

Motorola folded from the market, focusing on embedded.

IBM put out the PPC970 aka G5, a cut-down version of their POWER4 server chip. It had competitive performance vs x86 on desktop.

PA Semi made a good laptop chip which, despite Apple buying PA Semi, never made it to a laptop before Apple's x86 transition.

PA Semi chip coverage: http://arstechnica.com/uncategorized/2005/10/5486-2/

Really Apple jumped ship because they were the only major customer for the PowerPC CPUs they were buying, and not really large enough to securely sustain their suppliers.

Now they are large enough, hence their own ARM CPUs.


If you're comparing "narrow and deep" to "wide and shallow", neither won, really. Modern high-performance designs are basically "wide and deep" with lots of speculation, better branch prediction, and various tricks to almost eliminate the need to flush the pipeline.

Admittedly, none have quite the pipeline length of the P4, but Haswell's 14-19 stage pipeline definitely qualifies as "deep". Even ARM has gone deep - Cortex-A15 has a 15 stage integer pipeline, plus up to an additional 10 stages for some NEON/FPU instructions.

If anything won, it was designing to the reality of silicon transistor frequency scaling for 90nm and smaller.


PowerPC at the high end still exists in the form of POWER8 for servers, but they're very expensive and don't perform all that well both in absolute terms and energy efficiency - it looks like they've gone down the same path as the P4 for the POWER series, with emphasis on high clock frequency and (extreme) hyperthreading - resulting in ~200W TDP and frequencies in the 4-5GHz range. .


4-5 GHz is actually a low frequency today; a NetBurst/Power6-like processor would be running at 8-10 GHz today while consuming 500 W or more (if it were possible).


Given that we're speaking hypothetically (Since physics gets in the way) not sure why're saying Netburst would be consuming 500W - as the manufacturing processes shrink so does the power consumption.


Last few die shrinks have done little for power consumption.


It's because they keep shoving those pesky transistors in there.


There is some advantage, but

32nm Sandy Bridge vs 22 nm Ivy Bridge they take the same 2700K chip from 95W to 77 W.

Which seems ok progress, but at the high end:

Sandy Bridge: 32 nm Core i7 3820 (4 cores 10 MB cache) @ 3.6 GHz = 130W

Ivy Bridge: 22 nm Core i7 4820K (4 cores 10 MB cache) @ 3.7 GHz = 130W

So, you gained .1GHz granted their not identical but they have vary similar transistor counts.


For what it's worth, while Intel did win the architecture wars here, the "deep and narrow" pipeline being described here died out with Netburst. The Core microarchitecture that replaced it (which is the predecessor of today's Intel CPUs) used a much more "wide and shallow" pipeline, and benefited greatly from it.


Indeed, the P4 was a very odd CPU to program and optimise for - in some ways, it's the "most RISC-like" microarchitecture Intel has attempted. It was far more sensitive to things like instruction alignment and branch prediction than its successors and predecessors, and likely in the pursuit of higher clock frequencies, some instructions (e.g. shifts/rotates) were made several times slower. This meant near-optimal code sequences for the P6 family and often before that would perform horribly on the P4, and vice-versa. It could beat the PIII in "straight-line" execution of simple integer instructions with no branches, but the PIII was faster (even at a lower clock frequency) with more complex and branch-heavy instructions. It's probably the only x86 where a significant speed advantage can be obtained by extreme loop unrolling, a practice that is mostly counterproductive for post-Nehalem.

One of the more amusing P4 oddities is that certain 32-bit add/sub instructions will have a very slightly higher latency if there is a carry/borrow between the two 16-bit halves - it's very difficult to detect (I believe it's ~0.5 cycle), but it's there. This is probably a consequence of pipelining in the ALU itself.


> One of the more amusing P4 oddities is that certain 32-bit add/sub instructions will have a very slightly higher latency if there is a carry/borrow between the two 16-bit halves

This kind of data-dependent delay gives crypto people hives, for what it's worth. It's the sort of thing that can make timing attacks possible.


> This is probably a consequence of pipelining in the ALU itself.

Yep, the P4 "fast ALU" was double-pumped and did 16-bit chunks of a 32-bit add/sub in adjacent half-cycles [1]. This meant that dependent chains of, e.g., adds could still issue back-to-back, since the first half of a dependent instruction requires only the low 16 bits of its sources. Always struck me as very clever!

[1] http://www.ecs.umass.edu/ece/koren/ece568/papers/Pentium4.pd...


I miss Jon Stokes articles. They were the best thing of Ars, along with Siracusa's writings.


I must admit that while the article is fantastic (and inspired me to get into low level development), I had an ulterior motive in posting it.

I wish Ars would bring back these in depth architecture overviews. Maybe by bring traffic to them, Ars will notice there is still demand.


While I agree, the possibility is remote. Conde Naste is generally not interested in deep content (and I think that Stokes has moved on to other ventures).



Unfortunately RWT's David Kanter was recently hired by MPR and announced a near-hiatus from writing RWT in-depth articles.


"For a look at two instructions as they travel through the G4e, check out this animated GIF. Modem users should beware, though, because the GIF weights in at 355K."


Funny thing is the CSS file for Ars weighs in at 381KB.


That is because we embed fonts in the CSS file to cut down on HTTP requests. It's about 100k without the fonts. Sure it's still much larger than existed 10 years ago, but it's pretty standard these days.

Also note that we're gzipping, so the transmitted size is much smaller. And we also correctly return 304 responses after the first request.


Funny? Sad, rather.

(There was no 381 kB CSS back in 2001, obviously.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: