I've never been able to figure out if IA64 was poorly designed, or if the migrat...

jcranmer · on Feb 15, 2023

IA64 is effectively a VLIW architecture, and... VLIW architectures just keep ending up not working out. There's a few things that make VLIW architectures hard to do well:

* You're basically encoding microarchitectural details (which operations each execution port can run, how many execution ports, etc.) into the ISA, which makes changing that microachitecture difficult. (See also branch delay slots, which have a similar issue).

* Several instructions have data-dependent execution time, and are very difficult to statically schedule. Dynamic scheduling can handle it much better. The common instruction classes are division, branches, and memory accesses, the latter two of which are among the most common instructions.

* Static scheduling is limited by the inability to schedule around barriers, such as function calls. Dynamic scheduling can overlap in these scenarios.

At the end of the day, the idea that you can rip out all of this fancy OoO-execution hardware and make it the compiler's problem just turns out worse than having the OoO hardware, with the smarter compiler having to manage the instruction stream to maximize the ability of the OoO hardware to get good performance.

giantrobot · on Feb 15, 2023

> At the end of the day, the idea that you can rip out all of this fancy OoO-execution hardware and make it the compiler's problem just turns out worse than having the OoO hardware

With Itanium and then NetBurst, it seemed to me that Intel had really bad tunnel vision with the overall designs. It's like they got over focused one a couple use cases and then designed towards those architecturally.

As an example I remember the marketing for Itanium and then NetBurst really hammered on tasks like media encoding. The chips could tear through MPEG macroblocks! Wow! Of all workloads what percent are tearing through MPEG encoding or other highly ordered cache-friendly things? Most real world code is cache-hostile super branchy pointer chasing.

The philosophy of making the compiler do all the instruction scheduling serves the cache-friendly predictable data stream model. The only way it can serve the real world model of code is to explode memory requirements by having the compiler emit hundreds of variants of routines and use some sort of function multi-versioning to select one appropriate for the current data shape.

This is an unscientific observation on my part but it's how the situation seemed to me two decades ago. When Intel didn't have marketeers demanding "moar media encoding" they ended up with genuinely good chips like the Tualatin (which became the Core line).

ghaff · on Feb 15, 2023

I'm finishing up writing a (very short) book on lessons from Itanium and I don't think I really appreciated all the VLIW backstory at the time. While I sort of understand why HP got excited about EPIC, I don't really understand how they got Intel to go along ESPECIALLY given that Intel had its i860 experience and could have presumably just maintained their market position with 64-bit x86 easy peasy.

hawflakes · on Feb 21, 2023

Curious to read the book when it's out. I did work on some IA-64 software some time ago before Merced taped out so I'm interested!

msla · on Feb 15, 2023

> You're basically encoding microarchitectural details (which operations each execution port can run, how many execution ports, etc.) into the ISA, which makes changing that microachitecture difficult. (See also branch delay slots, which have a similar issue).

The HP/Intel people who designed the Itanium did have an answer for this one: Stop bits, which allow the software to indicate which opcodes can run in parallel even across words such that processors with more parallelism could run instructions from multiple words in parallel. Itanium was designed around EPIC, or Explicitly Parallel Instruction Computing, which was designed as a next-generation VLIW that took into account lessons learned from previous VLIW designs:

https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...

> Each group of multiple software instructions is called a bundle. Each of the bundles has a stop bit indicating if this set of operations is depended upon by the subsequent bundle. With this capability, future implementations can be built to issue multiple bundles in parallel.

Also:

> Several instructions have data-dependent execution time, and are very difficult to statically schedule. Dynamic scheduling can handle it much better.

They tried to handle this with prefetching and speculative loads, but, you're right, they didn't handle it well enough.

> Static scheduling is limited by the inability to schedule around barriers, such as function calls. Dynamic scheduling can overlap in these scenarios.

Itanium had speculative execution and delayed exceptions to try to get around this. Again, though, not good enough.

Itanium was an interesting design, but it seems that VLIW plus modern superscalar techniques isn't as good as superscalar techniques alone.

adql · on Feb 15, 2023

Both. It lived off an assumption that the magic compiler generating perfect assembly ideally fitting architecture will come out, and so CPU can skip the whole "crap" related to assigning incoming instruction stream to CPU's execution units. Which just didn't worked in practice.

Current approach of CPU being basically an optimizing compiler that takes assembly in and generates uops to drive whatever execution units the CPU has might be wasteful silicon wise but... caches are much more transistors than this anyway and it allows CPU design to be separated from incoming assembly and so any improvements there are instantly visible to most existing code.

tyingq · on Feb 15, 2023

It failed for the price/performance comparison to x64. Even if it performed 2x what it did, that still wouldn't have overcome the low commodity pricing for x64.

Intel figured they had control over that by not releasing a 64-bit x86. AMD ruined that for Intel.

andix · on Feb 15, 2023

Even without AMD intel would’ve needed to release cheap server, desktop and mobile 64 bit CPUs at some point, with reasonable compatibility to x86 code. It may have happened only around 2010 without AMDs pressure. Itanium could’ve never done the job for consumer grade or smaller servers. Too expensive, too exotic, and no possibility to run x86 code fast.

midoridensha · on Feb 16, 2023

>Even without AMD intel would’ve needed to release cheap server, desktop and mobile 64 bit CPUs at some point, with reasonable compatibility to x86 code.

I worked there at the time, and this was not the thinking in the company at all. The thinking was that if you needed 64-bit computing, you needed to buy an Itanium, full stop. Otherwise, 32 bits (with PAE) was all you needed. This is what Intel told all its customers. They really thought that IA64 was going to eventually replace x86 altogether.

andix · on Feb 16, 2023

Yes, at that time Intel had the belief IA64 could be the new architecture. But do you think that plan had a chance to work out? I think it would've become clear after a few years, that IA64 was not a suitable platform to replace the complete CPU lineup. For example to replace the mobile Centrino CPUs with IA64. I think it would've failed in a similar way like Apple failed to put their G5 PPC into a laptop (and switched to Intel then).

Or some pressure, because there is a strong business need to run new 64-bit code side-by-side with old x86 code (which is very slow on Itanium).

And 32 bit + PAE might have been fine in 2000, but by 2010 the de-facto limit for 4GB memory space per process would've become a huge issue.

midoridensha · on Feb 17, 2023

I'm not addressing whether things would have worked in reality, I'm telling you what the corporate leadership was telling us at the time. You can claim all you want that it wasn't realistic or whatever, but this is the same corporate leadership that thought everyone would be perfectly happy to spend $$$$ on patented RAMBUS memory. Yes, they really did seem to think that IA64 was going to be everywhere eventually.

x86 code was slow on Itanic because it was just a bolt-on PentiumPro IIRC. It was only there for compatibility mode, and was never meant to be high performance. Code needing performance was going to be compiled for IA64 using the magic compiler that didn't exist yet.

bsder · on Feb 15, 2023

When Intel announced that Itanium was going to be VLIW, the computer architects at DEC cheered. They knew full well from painful experience that VLIW was an absolute dead end.

Of course, that was before the DEC Hostile Giveaway.

PreInternet01 · on Feb 15, 2023

Nah, it was just poorly designed. Intel promised two things for IA64: more instructions per clock and much higher clock rates. The first promise failed to materialize due to suitable compilers being much too difficult to implement, which in turn stopped investments in silicon improvements (which, in hindsight, would have been extremely challenging as well).

FullyFunctional · on Feb 15, 2023

I'd offer a slightly modified take:

- the promised compilers are _impossible_ as we can't predict all branches and all cache misses in _general_ (works better for floating point heavy code),

- the failure to clock higher was IMO largely due to a ridiculously bloated and over-complicated ISA. In other words, EPIC was doomed from birth.

I've written about IA-64 many times; it actually had many neat ideas, but in the end Intel yet again failed spectacularly in moving away from their archaic legacy.

peterfirefly · on Feb 15, 2023

> the failure to clock higher was IMO largely due to a ridiculously bloated and over-complicated ISA.

Perhaps mostly due to having too many architected registers AND making most of them rotating/part of register windows... that means more work to do per instruction. Wide superscalar => you need lots and lots of forwarding paths between the ALUs. Combined with no out of order execution to allow for spreading that work out a bit (longer latencies per instruction but largely hidden by other work) and you get hard limits on the clock speed.

The ISA wasn't actually that bloated.