Hacker News new | past | comments | ask | show | jobs | submit login
The IBM Pentium 4 64-Bit CPU (cpushack.com)
207 points by sohkamyung 12 days ago | hide | past | web | favorite | 78 comments

You may wonder, what was in this for IBM? The answer is fairly straightforward. IBM used to make proprietary chipsets for Intel chips!

The pride of the xServer/xSeries systems were "complex" setups -- multiple chassis -- with up to 32 sockets, and 512 GB of RAM. These required a lot of IBM internal engineering, where they made pin-compatible sockets for what Intel was offering at the time, and glued those chips into _hugely_ different topologies than Intel had in mind.

These systems sound small today, but back in 2001, this was a really big deal for x86. The IBM-proprietary chipsets were much more expensive than off-the-shelf systems, but still a fair bit cheaper than going with NCR or Unisys, competing vendors with proprietary x86 MP designs.

IBM had a lot at stake when that socket changed. Achieving pin-compatibility is hard! Intel is very jealous of their documentation, and prefers to offer paper only, with water-marked copies. It's like owning a gutenberg bible. Engineering something pin-compatible with an Intel x86 CPU has never been easy.

It was no doubt worth it to their "big" x86 server business to ask Intel to make a special run of chips with the old socket layout, but the new emt64 extensions. I bet it was a complete no-brainer compared to the costs of integrating a new socket!

In the era of dual socket, single core xeons, ServerWorks also produced non Intel chipsets, North and South bridge, for motherboard makers to use. There were tyan and supermicro boards that used them. Common at the time in 1ru and 2ru size servers. Also found on big quad socket supermicro boards.

From 2002: https://www.extremetech.com/extreme/73498-serverworks-ships-...

I actually think the ibm systems used the grandchampion chipset.


Serverworks was later acquired by broadcom.

IBM was locked in a brutal fight with HP and Dell to take share in the early x86 server market. Their chipsets were neat but they looked silly next to an HP Opteron box. Integrated memory control ended the chipset race.

Still their chipsets had neat features worth remembering, such as the ability to use local main memory as a last-level cache of memory read from remote nodes. And of course they went to 64 sockets which was respectable.

These are pretty bold words, my friend.

Opteron had vastly better architecture than the Intel chipsets of the day, but it topped out at four or eight sockets, I forget.

IBM, at the time, could offer you Opteron-like architecture and performance, with up to 32 sockets, using Intel chips. That was worthwhile to some customers. "Intel" wasn't the selling position. It was x86 or x86-64, with "big" as the selling position.

I'm not here to apologize for Intel. I'm just saying, those IBM proprietary chipsets had their nice bits.

Early Opteron (with Socket 940) topped at 8 sockets glueless - with the same chipset one could drive a socket 754 chip.

However, with custom glue logic, one could expand it very far - AMD offering in that space was "Horus" chipset which connected 4 sockets with external fabric (infiniband, iirc) to create 64 socket systems. Similar tactic was (and still is) offered by SGI in their UltraViolet systems which utilize the same principle using NUMAlink fabric and Xeon cpus.

But in the end, Xeon systems with anything more than 4 sockets, with 8, 16 or 32 CPUs were a rare weird niche market compared to on one hand, things like zseries mainframes and big Sun and SGI machines, and on the other hand, people who learned to write software to distribute workloads across a couple of dozen $2000 to $4000 1RU dual socket servers rather than buying one beastly proprietary thing with a costly support contract.

As I recall HP also used serverworks for a lot of dual socket 2ru systems, for people who didn't want to buy opteron (and from 2000 to 2003 before the release of opteron)

I started following hardware news maybe a year or two before these IBM P4 were manufactured. I remember all the buzz on sites like Geek.com and TheInquirer.net about the soon to be released AMD Clawhammers, Intel constantly pushing Itanium, and the slow feature crawl of Intel Xeons marching over Big Iron regardless of what the Itanium team wanted.

IBM flirted with Itanium. They spent a lot of money on software support for Itanium. They sold a few hardware Itanium systems to customers who really wanted one.

There just wasn't enduring customer demand to make an ongoing hardware/software product out of it.

I think theregister and theinquirer helped steer a lot of people away from itanium in that era, showing the $/performance that could be achieved with a larger number of much less costly dual socket Xeon and opteron boxes. People who really wanted to centralize everything on one godly giant machine went to things like zseries mainframe, not itanium. Everyone else started running Linux on x86...

I don't think the press had anything to do with it.

Customers got engineering samples in their hands and they were very, very slow.

Everyone was promised sufficiently-smart compilers would make great use of them, but that never works out as a plan.

They very much underestimated the complexity of such compilers. But the concept is fine. Itanium had a lot of raw power, but you had to do demoscene level trickery to get that performance.

> But the concept is fine.

The concept is not fine. Itanium was predicated on saving transistors in the OoO machinery, and spending those on making the machine wider and thus performance. However, it turns out that without any OoO, the machine is terrible at hiding memory latency, and the only way to get this back was to ship it with heroically large and low latency caches. And implementing those caches was harder and many times more expensive than just using OoO.

In the end, Itanium saved transistors and power from one side, only to have to spend much more on another side to recoup even part of the performance that was lost.

The concept was fine based on knowledge available at the time. Processor design occurs on a long cycle, sometimes requiring some guesswork about surrounding technology. The issue of having to hope/guess that compilers would figure out how to optimize well for "difficult" processors had already arisen with RISC and its exposed pipelines. In fact, reliance on smart compilers was a core part of the RISC value proposition. It had worked out pretty well that time. Why wouldn't it again?

VLIW had been tried before Itanium. I had some very minimal exposure to both Multiflow and Cydrome, somewhat more with i860. The general feeling at the time among people who had very relevant knowledge or experience was that compilers were close to being able to deal with something like Itanium sufficiently well. Turns out they were wrong.

Perhaps the concept is not fine, but we should be careful to distinguish knowledge gained from hindsight vs. criticism of those who at least had the bravery to try.

Perhaps the concept is not fine, but we should be careful to distinguish knowledge gained from hindsight vs. criticism of those who at least had the bravery to try.

So how many times do you have to fail before being brave is just a bad business decision. The "concept" wasn't fine for Multiflow or the i860 (I used both, and would call it terrible for the i860). It didn't work for Cydrome. Trimedia is gone. Transmeta flamed out. There's, what, a couple of DSP VLIW chips that are actually still sold?

But, hey, let's bet the company on Itanium and compilers that will be here Real Soon Now. I remember the development Merced boxes we got.

The general feeling at the time among people who had very relevant knowledge or experience was that compilers were close to being able to deal with something like Itanium sufficiently well.

That's revisionism. There was a general feeling we were getting good at building optimizing compilers, but I don't recall any consensus that VLIW was the way forward. The reaction to Itanium was much less than universally positive, and not just from the press.

Turns out they were wrong.

Very, very wrong. Again.

> how many times do you have to fail before being brave is just a bad business decision

That's a very good question. More than once, certainly. How many times did Edison fail before he could produce a working light bulb? How many times did Shockley/Bardeen/Brattain fail before they could produce a working transistor? Even more relevantly,how many ENIAC-era computer projects failed before the idea really took off? Ditto for early consumer computers, mini-supers, etc. Several times at least in each case, sometimes many more. Sure, Multiflow and Cydrome failed. C6X was fairly successful. Transmeta was contemporaneous with Itanium and had other confounding features as well, so it doesn't count. There might have been a couple of others, but I'd say three or four or seven attempts before giving up is par for the course. What kind of scientist bases on a conclusion on so few experiments?

> The reaction to Itanium was much less than universally positive, and not just from the press.

Yes, the reaction after release was almost universal disappointment/contempt, but that's not relevant. That was after the "we can build smart enough compilers" prediction had already been proved false. During development of Itanium, based on the success of such an approach for various RISCs and C6X, people were still optimistic. You're the one being revisionist. It would be crazy to start building a VLIW processor now, but it really didn't seem so in the 90s. There were and always will be some competitors and habitual nay-sayers dumping on anything new, but that's not an honest portrayal of the contemporary zeitgeist.

Mhm, yeah, it depends on how you look at I guess. I meant if you fine tuned your code to take advantage of the strengths, it could be very good for those workloads. But maybe they built something which from afar, if you squint a lot, has more of the strengths of a programmable GPU today, while they pitched it as a general CPU.

> But the concept is fine.

Is it? What’s the point of a processor that we don’t know how to build compilers for? We still don’t know how to schedule effectively for that kind of architecture today.

The point is that they didn't know it's impossible. They had good reasons for believing otherwise (see my other comment in this thread) and it's the nature of technological progress that sometimes even experts have to take a chance on being wrong. Lessons learned. We can move on without having to slag others for trying.

So we would never build a computer because back in 1950 we didn't know how to make compilers, only raw bytes of machine code (we couldn't even "compile" assembly language). Life somethings require that you create a prototype of something that you think will work to see if it really does in the real world.

But with 1950s-era machines, it was expected that programmers were capable of manually scheduling instructions optimally, because compilers simply didn't exist back then.

VLIW architectures are often proposed to simplify the superscalar logic, but the problem with VLIW is that it forces a static schedule, which is incompatible with any code/architecture where the optimal schedule might be dynamic based on the actual data. In other words, any code that involves unpredictable branches, or memory accesses that may hit or miss the cache--in general CPU terms, that describes virtually all code. VLIW architectures have only persisted in DSPs, where the set of algorithms that are trying to be optimized is effectively a small, closed set.

> So we would never build a computer because back in 1950 we didn't know how to make compilers

No that's different - the big idea with the Itanium was specifically to shift the major scheduler work to the compiler. We didn't build the first computers with the idea we'd build compilers later.

But we did build an awful lot of RISC machines with exactly that idea. And it worked.

It begs the question whether or not any current compiler optimizations for a new theoretical VLIW-ish machine (Mill?) would prove to be an effective leg-up on the Itanium.

Being a bit more charitable, I think the problem is that people look at VLIW generated code and think 'wow that's so wasteful look at all the empty slots' without realising those 'slots' (in the form of idle execution units pipeline stages) are empty in OOO processors right now anyway. The additional cost is in the ICACHE, as already described.

Also, these days you would pretty much just need to fix LLVM, C2, ICC, and the MS compiler, and almost everyone would be happy.

The focus on vertical scale servers and FP performance was a mistake. Had they focused on single socket servers with INT performance the history might have been different. Also today’s compilers are much more capable so maybe Itanium was simply too early.

This article makes me nostalgic for the days when every office building on the planet was carpet bombed with black and charcoal Dell Dimensions.

I spent a lot of time pulling drives and reselling systems that were end of life. I mostly dealt with switches and networking gear I was still encountering old 32bit dual and quad socket X-series semi regularly when I moved on in 2015ish.


"Having such a unique processor at your disposal, it’s absurd not to build a powerful x64-retro system on it. One of the options for using such a system in general can be to build a universal “PC-harvester” that supports all Microsoft operating systems from DOS to Windows 10."

All Microsoft OS'es on the same PC? Cool! Also, if that's indeed the case, my guess is that most versions of x86 Linux and other x86 OS'es, historic to present, would work too... which is no small feat for a single PC...

I really admire all the lengths that the author went for this. Hard to see types of efforts now a days.

Right. I imagined this as a loose script to a Netflix special.

I was once looking for a used Intel Core2Duo processor on eBay, not sure, I think it was the E6300, 7x266Mhz, great FSB overclocking potential, and then settled on buying a CPU that was advertised as such, but its lid had "Intel Confidential" on it.

IDK, did the seller delid and relid it with something exotic? Or did I snag one of the first prototypes? What was up with that lid? Does anyone have a clue?

MB identified an E6300, and it had amazing overclocking potential, it went 7x333Mhz without any voltage increase. Not sure what the max was, but considering only a few people were bidding on it (I'm guessing most were turned away due to that lid), I was quite lucky to get a CPU with a lot of potential for very little cash.

I've had Intel engineering samples from work (we got them under NDA and such). They were tossing some Sandy Bridge engineering samples at some point and let us take them home. The hardware was buggy and didn't get microcode updates IIRC. The case and mobo was leaf blower loud and very unwieldy. I could run Linux for a few hours before it would segfault. I ended up trashing it. So I don't think an Intel engineering sample is better, I think it is worse.

Did those have "Intel Confidential" on them?

The CPU I bought was working fine though, the seller guaranteed it and he had the reputation on eBay to back it.

My (possibly) naive logic then, was that an early sample was likely to be made from the best silicon, which often correlates with good OC potential...

Take a look at the picture in the Intel Slashes Prices article posted here. It has "Intel Confidential" etched on it.

> https://news.ycombinator.com/item?id=21132809

Yep, that's exactly it, except it was Core2Duo, IDK, must've been around 2009 or so, when I bought it used.

> Having such a unique processor at your disposal, it’s absurd not to build a powerful x64-retro system on it. One of the options for using such a system in general can be to build a universal “PC-harvester” that supports all Microsoft operating systems from DOS to Windows 10.

I thought all modern intel/amd cpus are backwards compatible back to the 8086, and so capable of doing that?

Unfortunately not, the biggest change being "pure" UEFI-without-legacy-BIOS firmware that a lot of motherboards already have.

That and the question of drivers for OSs newer than DOS (which is not as big of a problem, since they can still be written and doing so is easier than changing the BIOS. The existence of USB drivers for DOS, and HD Audio for Windows 3.1x[1] are examples of that.)

Modern x86 CPUs are theoretically still backwards compatible, but I suspect they don't test things like 16-bit mode and VME[2] much anymore.

[1] http://www.vcfed.org/forum/showthread.php?50867-Windows-3-1-...

[2] https://news.ycombinator.com/item?id=14328237

Those disturbingly happy articles should really be titled something like "the openness of the PC is about to end forever" --- because that was likely the end-goal all along.

These new machines are not PCs. It's not out of the question for some company to build a new BIOS-architecture machine in the future, though it will probably eventually require sourcing x86 chips from vendors other than the Big Two as crypto keys will probably be required to even boot the chip if they aren't already.

You would need the firmware to also be backwards compatible.

It's presumably easier to find a socket 478 motherboard with ISA slots and very old-style firmware.

I had to do this. Needed ISA, and a newer OS. Search for 'industrial motherboard' and it gets way easier, but somewhat more expensive.

Please tell me what you ended up with, I have a waferprober that needs some love.

What specifically does it need to interface with? An ISA slot? Non-USB-dongle RS232?

I need three ISA slots for Weird Proprietary 80s Shit, probably parallel+serial as well

Agh, didn't see your reply :(

A couple seconds of digging finds motherboards that cost an average of $300. Assuming you (unfortunately) _have_ to support whatever it is you're supporting...

- Socket 478 (P4/Celeron), 2GB max, RS422/485, MiniPCI, $???: https://adek.com/product/MB-800V

- Socket 478, 2GB max, $???: https://www.bressner.co.uk/products/motherboards/mb-g4v620-b...

- Socket 478, 6xUSB2, $???: http://vegashine.sell.everychina.com/p-104455439/showimage.h...

- Socket 478, 2GB max, CF, 6xUSB2, $???: http://vegashine.sell.everychina.com/p-104455440-4-pci-3-isa...

- LGA775 (Core 2 Quad), 4GB max, 1xPCIx16, M-SATA, 2xLAN, IrDA, 5xRS232, $320: https://www.ebay.com/itm/283550673178

- LGA775, 6xRS232, 8xUSB2.0, CF, RS422/485, $179: https://www.aliexpress.com/item/32892452763.html

- LGA775 (Intel G41), 4GB max, DVI, 4xSATA 3G, 2xLAN, 8xRS232, (does not support ISA bus mastering), $???: http://vegashine.sell.everychina.com/p-104455435/showimage.h...

All of the above have 3xISA, at least one RS232 and at least one LPT.

What's striking is that 3.4GHz 64 bit Intel CPUs were shipping 15 years ago. That's still about where we are, but with more cores per package.

The single core IPC has gone up >3.5 times in that time. So even if the advancement isn't as fast as it was in the gigahertz race times, it's not exactly standing still either.


IPC has improved significantly over the last 15 years. You cannot compare the raw clocks.

Frequency stalled because we stopped being able to efficiently cool the CPUs at that point. Heat output is proportional to power consumption, which is proportional roughly to the cube of frequency (power is normally proportional to the square of frequency, but to drive higher frequencies, you often need to drive up the voltage--which also has an effect on power).

That can't be the only reason, with every shrink power requirements have been reduced. P4 had up to 115W TDP. The same frequency and raw power could probably be achieved on 15W today. But you can't get most current CPUs to run stable beyond 4GHz base clock, even with liquid cooling.

Not sure what cpus you're looking at but everything in the desktop space is over 4ghz nowadays. My last cpu was at 4.6ghz its entire life and my 3900x stays at over 4ghz on all 12 cores when doing a render.

Your CPU has 3.8GHz base clock (guaranteed), it can't sustain 4.6GHz turbo clock on all cores.

Yes, I know, but it never actually goes below 4.0 for me on all cores when doing a render. The 4.6 was referring to my 4770k which was at 4.6 on all cores all the time.

Prescott was 64-bit from the first stepping (C0), but EM64T was disabled with fuses. It wasn't just introduced for E0, and it was disabled in the G steppings that were 32-bit the same way. That big a change is too large to do in a mask change (the letter incrementing in a stepping indicates a mask change, the number a metal change) rather than a microarchitectural one. There are probably engineering samples of C and D steppings floating around out there that don't have those fuses blown.

The same is true for the LGA775-based 32-bit Prescotts. 64-bit disabled by fuses.

Was actually pretty good read. How far are we from 128-bit architecture ?

IBM's AS/400 (and all of its renames) is a 128bit architecture. The huge address space is beneficial for implementing capability security on memory itself, plus using single level store for the whole system (addresses span RAM and secondary storage like disks, NVMe, etc)

One of my mentors is one of the IBM engineers who developed the original AS/400’s capability-based security architecture way back in the early eighties. I can confirm that (according to her) the 128-bit addressing was indeed a very convenient manner of implementing the system. However, nobody ever expected (nor expects, I suspect) that those addresses will ever be used to actually access that amount of memory. It’s a truly astronomical amount of memory, on the order of grains-of-sand-on-countless-planets...

To put it another way it's not just enough to count the grains on sand on a beach, it's enough to count all the atoms in all the grains of sand on planet Earth. Give or take a few orders of magnitude[1].


$ echo '2^128' | bc | rev | sed 's/.../&,/g;s/,$//' | rev



Actually let me line up the https://en.wikipedia.org/wiki/Orders_of_magnitude_(data)

          ..  ??  YB  ZB  EB  PB  TB  GB  MB  KB
(There's no meaningful notation of size here - the denotations are just to show just how much data you can fit in 128 bits of space.)

Blinks a few times

Ultimately fails to mentally grasp and make useful sense of the number due to its sheer size

As an aside, apparently DNA can store a few TB/PB (I don't remember which). The age of optimizing for individual bytes as a routine part of "good programming" is definitely over, I guess. (I realize this discussion is about address space and not capacity, but still)

> plus using single level store for the whole system (addresses span RAM and secondary storage like disks, NVMe, etc)

wouldn't 64bit (16 Exabytes theoretical max) already allow for this? are there any projects in that direction?

We're still VERY far away to utilize 6 Exabytes of addressable space 64-bit offers us. If we talk about RAM, CPU architecture limits it to 4PB.

What for? AVX512 instructions can work on, well, 512bits of data at a time, while 64bits worth of address lines offer way more ram than available today. What's your use case?

AVX512 is not comparable to 128-bit native. AVX/SSE are split into lanes of maximum 64-bits. You can not compute a 128-bit result only multiple 64 bit ones in parallel.

That's a microarchitectural detail, point is you can view modern CPUs as more than 64bit wide, at least databus-wise.

By that logic the Pentium 3 was a 128-bit CPU. Vector width isn’t how these things are measured because its much harder to make a 128-bit ALU then add two 64-bit ones.

It's not just a detail if you are looking for hardware acceleration on floating point operations on quads (float128 type). AFAIK, nobody has a hardware quad FPU, but there are certainly applications for one. I know that things like big integer arithmetic could greatly be accelerated by a 128-bit computer.

I don't think big integer could be greatly accelerated by 128-bit computer. If a 128-bit add takes two cycles of latency, or it causes the frequency to drop (since you need to drive a longer critical path in the ALU in terms of gate delay, which means you need longer cycle times), then you're going to lose a lot in any code that isn't directly involved in the computation, such as loading memory values.

Furthermore, the upside is only at best 2x. It's likely to be worse, because you're still going to be throttled by waiting for memory loads and stores. Knowing that we have 2 64-bit adders available to us to use each cycle, we can still do 128-bit additions at full throughput, although it requires slightly more latency for the carry propagation.

Hardware quad-precision floating point is a more useful scalar 128-bit value to support.

I agree hardware quad is more useful. The big int problem I was talking about would benefit from hardware quad but not hardware 128-bit ALU. The big int problem I have a bit of knowledge of is squaring for the Lucas-Lehmer algorithm to find huge primes (Mersenne primes). The best algorithm in this space is the IBDWT (irrational base discrete weighted transform). You perform an autoconvolution (compute the FFT, square each term in the frequency domain, and then take the IFFT). You want the FFT lengths to be as short as possible, since FFT is an O(N log N) algorithm. Quads would let you use shorter FFTs since you have more bits available for each element.

Even though it is a big int problem, floating point is used. Their are integer FFT algoritms that are usable (NTTs), but they are much slower than floating point FFTs on modern CPUs.

> AFAIK, nobody has a hardware quad FPU

Power9 has for a few years.

Today I learned... That's awesome, I'd love to try one out in the cloud sometime and get benchmarks.

No particular use case. I think its one of these things when they first built machine then all of sudden bunch of different uses case are found.

Why do I need to build a 128-bit machine to imagine a usecase for it? What you refer to is not applicable to usecases, only to business cases. We first had to build broadband internet before video streaming websites became a viable business model, but surely someone thought of video streaming during the dialup era?

If I recall correctly, Apple was shipping Mac Pros with 128-bit PowerPC processors around two decades ago.

I think they had AltiVec instructions operating on 128 bits, but that's somewhat like Intel SSE - it's not referring to the overall address bus.

This really isn't anything special, at the time of release it was a low end single socket 1ru server. It overlapped with available 64bit lga775 versions of the same product.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact