

Linus Torvalds: x86 versus other architectures (2003) - api
http://www.yarchive.net/comp/linux/x86.html

======
aartur
I find it confirmed today also. One month ago by an Intel engineer:
[http://www.reddit.com/r/IAmA/comments/15iaet/iama_cpu_archit...](http://www.reddit.com/r/IAmA/comments/15iaet/iama_cpu_architect_and_designer_at_intel_ama/c7mpcn6)

~~~
api
Honestly it seems like the the advantage boils down to two points:

(1) x86's baroque encoding is actually a data compression mechanism, making
code size relatively small. Code size killed RISC and VLIW.

(2) A _lot_ of work has been done to make x86's cache very, very good in order
to compensate for its small register file and other minor shortcomings. This
in turn resulted in the whole issue being sidestepped in a way that yielded
better overall performance.

There's also kind of a third, broader point:

(3) There are all kinds of exotic architectures that could theoretically be
faster with a very smart compiler, but that very smart compiler never
materializes. This is also an issue with arguments about high level languages
potentially being as fast as C. The mythical super-smart compiler never
materializes. On the x86 angle, x86's "hacks" made it very fast for the kind
of code real-world compilers generate, which makes it perform well on real
workloads instead of just contrived benchmarks.

Interestingly, x86 even beats vanilla ARM for code size.

[http://vanshardware.com/2010/08/mirror-the-coming-war-arm-
ve...](http://vanshardware.com/2010/08/mirror-the-coming-war-arm-versus-x86/)

But ARM has an encoding called Thumb-2 that turns the tables. Hence the
upcoming ARM/x86 war. It really seems like code size is one of the big
parameters affecting a CPU architecture's performance.

What's interesting though is this: now that code size is comprehended to be
the huge issue it is, could we see new experimental architectures that do
radical things in that department such as the use of actual data compression
algorithms in instruction encodings?

~~~
cube13
>What's interesting though is this: now that code size is comprehended to be
the huge issue it is, could we see new experimental architectures that do
radical things in that department such as the use of actual data compression
algorithms in instruction encodings?

I'm not sure that's going to result in that much improvement. Linus points out
in one of the later emails that instruction fetching and decoding ends up
being the major bottleneck on high-performance code, especially when the
architecture contains optimizations like out of order execution. Higher
compression means more cycles to decompress, and you still need cache space to
store and cache the decoded instructions, so you don't get multiple hits for
running the same instructions in a loop.

The additional prefetch hit can certainly be mitigated by additional
pipelining, but it's still more silicon every instruction needs to go through.

Realistically, I think the best approach may be more RAM and larger onboard
caches.

~~~
filereaper
Pardon me, I'm not a CPU architect but I have a couple of questions.

Why would prefetch and decode affect out of order execution?

Prefetch and decode is done on the frontend of the chip, so you'll have
prefetch happening on L3, L2 and L1, along with decode and u-op breakdown.
This is a matter of memory throughput getting instructions from the bus into
the caches.

The out-of-order execution should be a matter of issuing to the backend
execution units.

It is true that the backend will be stalled if we don't issue quickly, but
that's where you have staging and prefetch on the front end side.

At least the above is what I thought to be the case. I could be very wrong.
Please educate if I've gotten something wrong.

Also we do have the matter of throughput vs completion of an instruction,
basically a tradeoff. Deeper pipelines so can have more throughput but take
the hit of longer completion times. Again might be bad for certain tasks, but
going through more sillicon isn't necessarily always bad.

Thanks.

