People let their (understandable) hatred of Intel-the-company colour their techn...

chithanh · on July 30, 2021

> People let their (understandable) hatred of Intel-the-company colour their technical judgement. Itanium was one of the more interesting architectures of its time, it fairly flew on expert-tuned assembly;

I know only few people who maintained software for Itanium, but from their reports it was a nightmare to debug code on. To have a chance to see what was going on, you would have to use special debug builds that disabled all explicit parallelism. Debugging optimized code was almost impossible, and user-provided crash dumps were similarly useless. Your only hope would be that the issue was reproducible in debug builds or on other architectures.

Needless to say, they hated it and were happy when ia64 was finally phased out.

> once the computing world finally moves on from C.

Yeah, it moves on from C... to JavaScript. Making compilers slow and complex doesn't mix well with JIT compilation.

One thing I have to give Itanium credit for is that due to EPIC it was totally safe from the speculative execution vulnerabilities like Spectre/Meltdown/etc. That was certainly a forward-looking aspect of it.

pjc50 · on July 30, 2021

> you would have to use special debug builds that disabled all explicit parallelism

Oh god. Let me guess, when it crashes, you get a pointer to the word with the failed instruction in ... but no elaboration on which of the 3 instructions it was? Or is it worse than that and it fails to maintain the in-order illusion?

clacke2 · on Aug 9, 2021

> Making compilers slow and complex doesn't mix well with JIT compilation.

Funny, I was just thinking the opposite: Compiler-driven parallelism loses against CPU-driven parallelism because the CPU has live profiling. With a JIT the compiler can have it too.

The debugging problem on the machine-code level becomes less of an issue when most people write higher-level code too.

throw0101a · on July 30, 2021

> it fairly flew on expert-tuned assembly

There's your problem.

Given the bajillion programs out there already, how many companies wanted to dig into assembly instead of just waiting 18-24 months for Moore's Law to speed up their software?

It's all very well and nice to have nice hardware in theory, but if you can't compile existing code to be fairly fast, then in practice you just have some expensive sand (silicon) in the shape of a square.

> People let their (understandable) hatred of Intel-the-company colour their technical judgement.

So getting back to your first statement: no they didn't. Everyone was basically all-in on Itanium. All the Unix vendors (except Sun) dropped their own architectures and steered their customers toward Intel. Microsoft released software for it.

But it seems the market didn't like what they saw, and just kept on with x86—and then amd64 came out and gave 64-bits to everyone in a mostly compatible way.

OldHand2018 · on July 30, 2021

> how many companies wanted to dig into assembly instead of just waiting 18-24 months for Moore's Law to speed up their software?

The people that bought Itanium-powered servers certainly weren't replacing them every 18-24 months. At the price they paid, you were looking at 5-8 years of computing before replacement. Or more.

My employer bought a pair of the final batch of Itanium servers. To replace 10-year old ones. This was an insurance purchase. The original plan was to shift all of that workload into the cloud, but that's neither going quickly enough nor is it saving any money. If you have a workload for which Itanium does well, it does it really well.

throw0101a · on July 31, 2021

> The people that bought Itanium-powered servers certainly weren't replacing them every 18-24 months.

I was referring to the software vendors: why would they go through the effort of optimizing their code for this new architecture when they could simply wait a little while for the "old" one to get faster via Moore's Law?

Nursie · on July 30, 2021

> All the Unix vendors (except Sun)

cough IBM cough

They were never going to ditch power... Did they ever even have an Itanium product? I know they've had x86/x86_64, all sorts of power variants like Cell and god knows what.

I did briefly work on an Itanium system at IBM, but it was an HP box.

trasz · on July 30, 2021

Of course they did: http://public.dhe.ibm.com/systems/support/system_x_pdf/x455o...

Nursie · on July 30, 2021

Oh interesting, they were going to roll it into xSeries.

MichaelZuo · on July 30, 2021

I imagine then there will be a great resurgence of interest after Moore’s Law hits the atomic scale wall.

gpderetta · on July 30, 2021

I Can't find any hit for 12ghz P4. I thought the record is around ~8GHz (And you can push modern processors in that ballpark).

I doubt that even a 8GHz P4 would be able to beat a lower clocked more modern design even on single threaded integer workloads. The P4 had a lot of glass jaws (the non-constant shifter, load replays on misses, very narrow decoder when running out of trace cache).

fogihujy · on July 30, 2021

I've heard about 8+ GHZ Celerons (Netburst-based ones) and they were definitely on top a few years ago. I haven't kept track lately, though, and those records may have been beaten by now.

selectodude · on July 30, 2021

https://valid.x86.fr/records.html

I think that's still pretty much the bible for frequency records.

ngold · on July 30, 2021

That is crazy fascinating. It seems windows xp and celeron and amd fx chips with 2 to 4 gigs of ram are where it's at.

pjc50 · on July 30, 2021

> the computing world finally moves on from C.

The computing world has moved on from C, mostly. To Javascript. The main impact of that seems to be a couple of numeric conversion instructions on ARM?

(OK, not entirely fair: the computationally heavy stuff has moved away to GPUs. But if you ask the question for every button press a human makes on a computer where the dominant execution time is you might have some interesting answers, and for a lot of them it is going to be JITted Javascript)

I think it's fairly clear that for general purposes VLIW is not what either the programmer or the compiler writer wants to deal with. In-order execution is such a convenient mental model that people are willing to accept any tricks that keep it working.

kmeisthax · on July 30, 2021

The numeric conversion instructions you're thinking of are branded "JavaScript", but actually exist to emulate Intel x86 floating-point behavior. It just so happens that the ECMA specs call for said behavior because existing code relied upon it.

IshKebab · on July 30, 2021

Nonsense. A ton of code is still written in C/C++. What do you think runs all of that Javascript?

The C world isn't moving to Javascript, it's moving to Rust, Zig and Go.

the_only_law · on July 30, 2021

Kinda veering a bit off topic, but I’ve always seen go marketed as a systems languages alongside C, Rust, etc., but in practice I’ve only really ever seen it used to developer high level web applications.

sophacles · on July 30, 2021

Docker and k8s, flannel, etc are all written in go and something I'd consider "systems programming" - I mean they have to do some pretty complex coordination w/ the kernel to do thier work.

api · on July 30, 2021

My understanding is that the problem with VLIW is that it exposes too much. Anything you expose via the instruction set becomes fixed permanently, so if you have say 4X wide VLIW there is no way to ever make it wider or change how things like dispatching work. The only way to do that would be to start pipelining and scheduling VLIW chunks, in which case you are back where you started.

Instruction level parallelism achieved by decoding a single stream and then sorting and scheduling requires more silicon and a bit more power, but low power high performance superscalar chips like modern ARM64 CPUs have shown that the cost is not that high and that you can go very wide. The M1's Firestorm cores are 8X wide from what I read, which is better than Itanium.

Since the whole superscalar architecture is hidden, it can evolve freely.

That being said I don't think VLIW was a horrible idea at the time, and it might still have a chance if it were revived in specialized high performance or ultra-low-power use cases. The mistake was betting the farm on it.

The other big thing we learned since then is that the important part of RISC wasn't reduced instruction set size, but uniform instruction size and encoding. That allows you to decode arbitrarily wide chunks of instructions in parallel without crazy brute force hacks like those required to do parallel decoding of the variable length X86 instruction stream. The problem with CISC isn't how many instructions there are, but the complexity of the encoding and the presence of a lot of confounding requirements that arise from instructions that to very different things at once (e.g. complex math with memory operands). You want the instruction stream to be trivial to decode and easy to schedule.

In the end the best approach seems to be a simple general purpose instruction set augmented with special instructions for common special cases that can be greatly accelerated this way (e.g. vector operations, floating point, cryptography, etc.), and all with a logical fixed length encoding that is easy to decode in parallel. Load-store architecture and a relaxed memory ordering model seem to also be performance wins since separation of concerns simplifies the scheduler. The future (for conventional CPUs) looks a lot like ARM64 and RISC-V.

lizknope · on July 30, 2021

My professor in college back in 1997 was doing research on maintaining binary compatibility between different generations of the same VLIW architecture. If you had a different number of execution units and stuff like that. He had a few ideas and one of them was preprocessing the compiled binaries and rewriting them. Some of them were having flags in the architecture for what generation of chip they were to have the OS do on the fly changes

twic · on July 30, 2021

In a world where all software is JITted (Java gang rise up), the fixedness of a VLIW ISA doesn't matter, because you always compile specifically for the target machine anyway. What you describe sounds like applying that strength of JITting to AOT-compiled code.

Vaguely related ideas from the distant past are ANDF:

https://en.wikipedia.org/wiki/Architecture_Neutral_Distribut...

And TaOS's VP Code:

https://sites.google.com/site/dicknewsite/home/computing/byt...

clacke2 · on Aug 9, 2021

This is what IBM did with IBM i / AS/400 / System/38 and https://en.wikipedia.org/wiki/IBM_i#TIMI.

IBM i is on a POWER CPU today, but can still run System/38 binaries from the 70s, thanks to install-time compilation to whatever CPU the system is running this decade.

xorcist · on July 30, 2021

What types of languages do you see would enable more efficient VLIW compilers?

From my limited perspective, I find C one of the easier languages to write optimizing compilers for, and would therefore expect optimizing compilers to be the most efficient there. 40 years of collective experience of optimizing for C-like languages also helps of course.

Or is it the lack of explicit parallelism in the language that is limiting? Somehow I suspect the limited uptake of better suited languages to be a sign that they aren't very helpful most of the time, and most of the parallel operations people do is more like serving a lot of individually sequential transactions per second, which is something C and unix is pretty good at.

chalst · on July 30, 2021

C is hard to optimise. Graydon Hoare gave a nice introductory compilers talk that went into some of the reasons why

http://venge.net/graydon/talks/CompilerTalk-2019.pdf

throwaway81523 · on July 30, 2021

One well known obstacle to optimizing C is the difficulty of alias analysis. It's easier to do that for languages that don't have C's unrestricted pointers.

jonathrg · on July 30, 2021

Doesn't the restrict keyword solve this?

tjalfi · on July 30, 2021

The paper Why Programmer-specified Aliasing is a Bad Idea[0] evaluated the effectiveness of restrict in 2004. They found that adding optimal restrict annotations provided only a minor performance improvement, on average less than 1% across the SPEC2000 benchmarks.

[0] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94....

dralley · on July 31, 2021

How much of this is because nobody puts effort into these optimizations?

The Rust compiler has repeatedly found critical bugs in LLVM's restrict / noalias support, bugs that would impact C / C++ as well if any real-world C / C++ programs actually used it.

If compilers produce straight-up broken code in these situations, I can only imagine they're not putting a lot of effort into these optimization strategies.

tjalfi · on Aug 1, 2021

> How much of this is because nobody puts effort into these optimizations?

restrict is rare in C and C++ but common in Fortran; array parameters in Fortran aren't allowed to alias. Intel and IBM both have great Fortran compilers so I would expect their C and C++ compilers to have good support for restrict.

user-the-name · on July 30, 2021

I don't think anyone has ever used the restrict keyword, or understood what it does.

jonathrg · on July 30, 2021

What? I use it whenever I have a function that takes two or more pointers if I know they can't refer to overlapping memory. And it's part of the signature for memcpy since C99

IshKebab · on July 30, 2021

When he said "anyone" he meant "almost anyone". You're an outlier if you use `restrict` regularly.

I checked on grep.app, there are 10k results for `restrict` for C code, compared to 700k for `struct` (I know they're not directly comparable but that gives an idea.

jonathrg · on July 30, 2021

That seems like about the right proportion to me - struct solves a much more common problem than restrict does. And to be fair, it is a lesser known feature. But user-the-name is implying that restrict is somehow difficult to use or understand, which I don't agree with at all.

icedchai · on July 30, 2021

That probably explains it. In many shops, C may as well have stopped at C89.

I've been working with C since the early 90's. I've never seen any code use restrict.

enjoy-your-stay · on July 30, 2021

Also large chunks of libc use it as well, e.g. the printf family of functions.

moonbug · on July 30, 2021

derp.

pantulis · on July 30, 2021

So Rust comes to mind, right? Anything else?

rwmj · on July 30, 2021

FORTRAN or any language with lots of arrays and matrices.

jabl · on July 30, 2021

I suspect if you want a HW architecture for running array operations you'll end up with something like a vector machine (e.g. ARM SVE(2) ) or a GPU rather than a VLIW CPU?

marcosdumay · on July 30, 2021

VLIW is basically a more flexible kind of vector machine.

jabl · on July 30, 2021

And a traditional scalar architecture is more flexible still. The trick is to pick the correct set of tradeoffs for the targeted applications. I claim that for most array style workloads vector/GPU architectures are flexible enough, and offer better perf/watt and perf/chip area.

Semaphor · on July 30, 2021

So, APL?

rwmj · on July 30, 2021

Absolutely yes that would make sense (or more likely the modern "derivatives" like J & K)

jiggawatts · on July 30, 2021

C is very "pointer heavy", and much code involves chasing linked lists and the like. This tends not to suit VLIW well.

Modern languages like Rust tend to produce more instructions for the same high-level logic, but those instructions are easier to schedule for superscalar CPUs. It typically ends up as a bit of a wash on CISC processors, but could be better than C/C++ on VLIW.

I guess we'll never know now...

secondcoming · on July 30, 2021

C is pointer heavy if you write pointer heavy code

lmm · on July 31, 2021

> What types of languages do you see would enable more efficient VLIW compilers?

I was thinking of languages where dependencies are more explicit and the idea of a global evaluation order isn't there in the first place. I'd be very interested to see a reduceron-style effort that implemented graph-reduction evaluation on a VLIW processor.

> Somehow I suspect the limited uptake of better suited languages to be a sign that they aren't very helpful most of the time, and most of the parallel operations people do is more like serving a lot of individually sequential transactions per second, which is something C and unix is pretty good at.

Heh, that was the idea that those barrel-processor SPARCs were designed around. But they weren't so successful in the market either in the end.

TickleSteve · on July 30, 2021

The TMS320C6678 (C66x architecture) DSPs still use VLIW and work pretty well. Like most DSPs, they're typically programmed using C, for which TI supplies optimised libraries for processor-intensive operations. IIRC, the compiler itself was fairly standard.

sjagoe · on July 30, 2021

I had a netburst P4 for a while.

MATLAB simulations were comically faster on my lower clocked Pentium M laptop.

veltas · on July 30, 2021

I malign Netburst because I owned one (actually still own it) and it was slower than the previous generation of processors (under certain loads) despite costing more.

pwdisswordfish8 · on July 30, 2021

> I still believe we'll see a return to its ideas once the computing world finally moves on from C.

If only for the reason that Itanium was one of the few architectures not affected by Spectre-family attacks.

saati · on July 30, 2021

The later ones are out of order, they are very likely affected, just no one cares enough to prove it.

kevin_thibedeau · on July 30, 2021

Going "faster" by doubling pipeline stages doesn't gain anything but fat bonuses for marketing.