The first of Intel's many expensive lessons about the problems with extremely complicated ISAs dependent on even more sophisticated compilers making good static decisions for performance.
Then they did it again with the i860.
Then they did it again with Itanium.
iAPX 432 was sort of a different failure from i860 and Itanium, no? My understanding is that the issue with iAPX 432 was that the architecture provided object-oriented instructions, but they turned out to be slow in practice, and the compiler didn't know how slow they were, so it abused them in situations where they should have used scalar ops instead, and that in tandem, the ABI relied too heavily on pass-by-value. Basically, that the iAPX was explained to compiler authors as an object-oriented CPU, when it should have been treated as a CPU with object-oriented extensions.
Whereas i860 and Itanium were just trying to shoehorn VLIW into general-purpose computing, which is generally incredibly challenging. VLIW is great for places like DSP, where you have a defined real-time stream of data and limited context switching. In this case, you can use the spare die space you didn't spend on dispatch, prediction, and retirement on more MACs or ALUs or vectors, and the compiler can accurately predict the latency of a given operation because the source is defined. Fundamentally, compiler scheduling is intractable in a multiuser or task switching environment, because you have _no idea_ what will be in cache ahead of runtime and always end up with the i860/Itanium problem, where you stall your entire execution pipeline every time you miss cache unexpectedly.
Have we (finally) realized the dream? By basically putting the "smart" part of the compiler in the chip itself, or do we still run relatively simple ISAs?
I argue about this a lot. Some reasonably substantiated opinions:
1. Highly sophisticated large-scale static analysis keeps getting beaten by relatively stupid tricks built into overgrown instruction decoders, working on relatively narrow windows of instructions.
2. The primary reason for (1) is that performance is now almost completely dominated by memory behavior, and making good static predictions about the dynamic behavior of fancy memory systems in the face of multitasking, DRAM refresh cycles, multiple independent devices competing for the memory bus, layers of caches, timing variations, etc. is essentially impossible.
3. You can give up on a bunch of your dynamic tricks and build much simpler more predictable systems that can be statically optimized effectively. You could probably find an good local maxima in that style. The dynamic tricks are, however, unreasonably effective for performance, and have the advantage that they let you have good performance with the same binaries on multiple different implementations of an ISA. That's not insurmountable (eg. the AOT compilation for ART objects on Android), but the ecosystem isn't fully set up to support that kind of thing.
Note that AOT compilation on Android is a mix of JIT with PGO metadata, where the generated AOT binary is only a subset of the application.
Changes on the execution flow, or updates render the generated binary invalid and there is again another cycle of ASM based interpreter, JIT, gathering PGO metadata, and finally new AOT compilation when device is idle.
It's only an HN comment and I don't see why it honestly matters. At the end of the day, more people will see his tweet and learn about these failed architectures then some random comment on some random HN post. Significantly more people read twitter than HN.
The way you're reacting to this is like it's 2007 and he stole the blueprints to the iPhone.