
X86 versus other architectures (Linus Torvalds) (2003) - tambourine_man
http://yarchive.net/comp/linux/x86.html
======
bitwize
Ask any high-level language compiler implementor; they will tell you that
x86's register-starvedness and anisotropic instruction encoding are huge
losses. A number of high-quality Lisp implementations (e.g., T) never saw the
light of day on x86 -- and never will -- because their compilers assumed a
register file of at least sensible size (16 or so).

The world would have been better off if the 68k architecture had won back in
the 80s.

~~~
ANTSANTS
>they will tell you that x86's register-starvedness and anisotropic
instruction encoding are huge losses.

Eh, not sure I buy that. Since at least the Pentium, x86 implementations have
had very fast stack variable access compared to contemporary RISC
architectures, practically treating some number of bytes around the stack
pointer as registers. As Linus points out, everyone architecture has to
confront this problem anyway sooner or later because some stack access is
unavoidable; the x86 implementers just had to confront it sooner because of
their ISA had half or less of the registers of the RISC architectures. So
while of course it's better to keep everything in registers if possible, I'm
sure that the old Lisp compilers would have performed reasonably well on x86
if they'd tried instead of throwing their hands in the air and saying "we
can't do it!"

And the "anisotropic instruction encoding" makes so much sense that ARM
adopted it (in a much simplified fashion) with Thumb-2. Fixed width
instruction encodings waste a lot of space because the common operations need
2-3 more bytes than they ought to, and because absolute addresses and
constants above a certain size must be loaded from an indirect access to a
location close by the instruction stream, instead of just being inline with
the instruction. Instead of 1 byte of instruction + 4/8 bytes of data/address,
you get 2-4 bytes of instruction including relative address + the 4-8 bytes of
data you wanted to access in the first place. All this adds up after a while,
and at some point, the waste of code cache hurts more than the more
complicated ISA.

I love 68k too (which had a somewhat "anisotropic instruction encoding"
itself, by the way), but let's give credit where it's due: it took a lot of
engineering talent and confidence ("RISC is the future!") for Intel to prove
that CISC chips could beat RISC chips on performance. It might have never been
possible for Motorola to overcome their management problems and infighting to
progress 68k at the zenith of RISC hype.

~~~
comex
> And the "anisotropic instruction encoding" makes so much sense that ARM
> adopted it (in a much simplified fashion) with Thumb-2.

And then abandoned it in their 64-bit architecture in favor of the old uniform
32-bit, FWIW.

~~~
ANTSANTS
Oops. I haven't worked with ARM in a while, wasn't aware of that. I knew it
was simplified a bit (like no more funky barrel-shifter stuff), but I'm kinda
surprised they went back to a fixed-length encoding scheme when they're
already taking much greater complexity hits with the high performance OoO
implementations.

------
Spidler
Fun to see ten years later that this approach by AMD won.

 _AMD 's x86-64 approach is a lot more interesting not so much because of any
technical issues, but because AMD _can_ try to avoid the "irrelevant" part. By
having a part that _can_ potentially compete in the market against a P4, AMD
has something that is worth hoping for. Something that can make a difference._

------
FullyFunctional
For Linus things are always black and white and people who disagree with him
are "stupid". That's a shame. It's amusing to read this today because x86
compatibility is obviously of declining value; that is, his "XScale" (ARM)
prediction came true but markets do not change overnight.

I would invite some factual data on the "cold cache" issue of RISC. With the
switch to solid state storage, is the binary load time still really an issue?
(I assume the L2 -> L1 transport can be made arbitrarily fast).

~~~
jeffreyrogers
It's true that he's often abrasive, but he usually has his facts straight.

~~~
mwfunk
The think the abrasiveness thing is a separate issue from the black-and-white
opinions thing.

For the most part he is extremely knowledgeable within his field and has a
pragmatic outlook that I like a lot.

At the same time, anyone who presents incredibly complicated topics as black
and white rarely has their facts straight. That by itself is a huge red flag.
They're either misinformed at best or lying to you at worst.

In Linus' case, I usually give him the benefit of the doubt and assume that
his stated black-and-white opinion is a vastly simplified version of his much
more nuanced and informed opinion that he keeps to himself. Sometimes he makes
me think I'm giving him too much credit though. Abrasiveness always makes me
question someone's credibility when used in defense of a binary opinion on a
complicated topic (this applies to anyone, not just Linus).

~~~
vacri
Where is the abrasiveness in this article?

~~~
mwfunk
The post I was responding to was talking about Linus' behavior in general, not
anything specific to this article.

~~~
vacri
My point is more that the talk of Tovald's abrasiveness is largely a strawman.
People take selected excerpts of his conversations, show them out of context,
and then a tsunami of "what a bastard!" results. Almost every time I've seen
this happen on HN, the excerpt is taken from a longer discussion, where he's
already spent some time talking to the person in question before 'turning on
them'. And then when an article like this comes along, with no offensiveness
in it, people just have to talk about his 'issue with being abusive'.

The funny thing is that here on HN, Jobs generally gets a nod for behaving
like a bastard ("because that's what's needed to make a good product"), but
Torvalds gets demonised. Even de Raadt, famed for his abrasiveness, is
accorded respect for it. It puzzles me on two fronts, because Torvalds really
isn't that abrasive unless you really push him, and that for some reason, he
doesn't get respect that other tech 'names' do for (supposedly) being so.

------
millstone
I remember Apple's transition from PPC to x86. The software certainly became
much faster, but debugging became more difficult. For example, on PPC, all
instructions were 4-byte aligned, so if you saw a PC that wasn't a multiple of
4, you knew right away that something had gone off the rails. Or if you wanted
to disassemble around an address, it was easy to know where to start. Not so
with x86 on either count.

Another example is that developers liked to use -fomit-frame-pointer to get an
extra register. This means that arguments were at some varying offset from the
the stack address that only the compiler could keep track of. You'd have to
hunt for them on the stack. With PPC, they'd generally be in registers, and it
was often easy to trace their movement through the function since they'd be
kept in registers.

Lastly, the way that x86 stores the return address on the stack makes
exploiting buffer overflows especially easy. PPC has a dedicated register for
the return address, which by no means gives immunity, but at least doesn't
hand the attacker the keys on a silver platter.

Linus may or may not be right about the performance issues, but I am certain
that x86's "charming oddities" made low-level debugging slower and thereby
contributed to buggier software.

~~~
Scaevolus
PPC only leaves the return address in a register for leaf functions.
Otherwise, it's saved to the stack and exploitable just like any other
architecture.

~~~
ANTSANTS
EDIT: I'm misread your post because I'm tired, but I'll leave this explanation
of -fomit-frame-pointer up anyway.

With most architecture+compiler combinations, a register is dedicated to
holding the frame pointer, a pointer to a fixed position in the stack frame of
a function containing the return address to the caller. For x86, ebp (rbp in
64 bit mode) is used for this purpose, and a stack frame looks something like
this:

    
    
      var_2 at [ebp-8] <-- top of stack, lowest address (yes this is counter-intuitive)
      var_1 at [ebp-4]
      return_address [ebp] <-- ebp points here
      arg_1 at [ebp+4]
      arg_2 at [ebp+8]
    

No matter how much data you push onto the stack in local variables (even
variable length structures eg with alloca or variable length arrays in C), the
parameters are always accessible at positive offsets from ebp while the
(fixed-size) local variables are at negative offsets.

Without a frame pointer, parameters must be accessed relative to the stack
pointer instead. Disadvantages of this include:

Both parameters and local variables are at positive offsets in comparison to
esp, so it's harder to tell the two apart.

The offsets can change throughout the function as variables are pushed and
sometimes popped from the stack, making analysis even more difficult.

You can't allocate variable width structures on the stack because then the
compiler would not know at what offsets the parameters/variables would lie.

And it's not even necessarily a performance win either, because x86 uses
longer instruction encodings for sp-relative addressing vs. bp-relative
addressing.

x86_64 and most RISC architectures don't have this problem, there are enough
spare registers that the negatives of -fomit-frame-pointer outweigh the
positives.

~~~
ANTSANTS
>I'm misread your post because I'm tired

Ok, it's officially past my bedtime.

------
JakaJancar
> by the time Itanium 2 is up at 2GHz, the P4 is apparently going to be at
> 5GHz, comfortably keeping that 25% lead.

Funny looking back...

~~~
leetNightshade
Some Japanese dude overclocked a P4 to 5GHz in 2004, and there are reports of
people getting a P4 to 8GHz in 2007. There's also this cooler that Intel made
in 2006, though I'm not sure of it's success: [http://www.bit-
tech.net/news/hardware/2006/03/08/intel_liqui...](http://www.bit-
tech.net/news/hardware/2006/03/08/intel_liquid_cooler/1)

------
pcwalton
It's interesting to see how the rise of ARM has vindicated Linus' point in
some ways (as it rose to success despite accruing a fair amount of the cruft
in x86) and refuted it in other ways (as the effects of the cruft on
x86—notably power consumption—seem to have had real effects in Intel's ability
to make headway in the mobile market).

~~~
DSingularity
Honestly, I think by the very end we will see Intel dominating mobile markets.

The thing is, Intel are process kings. Nobody else comes close. Both the
economics and the performance. And when it comes to processors design that has
to respect many physical limits (wire delay, transistor behavior, ..) tuning
process knobs will gain you much more than tuning architectural knobs. I think
in the end process technology wins this fight.

The reason we haven't seen competitive Intel products is that they just
haven't cared about power for like ever. Their company is built for speed.
Their income is from performance sensitive markets. All these power sensitive
markets were pretty new and until recently -- pretty inconsequential in terms
of revenue. Their process is not tuned for this. But I guess now that all
these power sensitive users are starting to want more performance maybe Intel
can rely on some of its strengths.

When you have a massive company like Intel it takes time to change directions.
I honestly think by the time Broadwell or Skylake comes out you will start to
see Intel power-competitive in the low performance segments.

~~~
csirac2
I hope they get there soon. ARM SoCs have a bad habit of running only heavily
hacked, out-of-tree, unmaintained kernels abandoned so long ago (or
inexplicably they launch with an already 2-year-old kernel) that they'll never
boot systemd. Yet even the weakest/slowest of Intel's Bay Trail lineup still
uses at least 80% more power than the ideal candidate ARM solution I'm
evaluating.

~~~
papaf
This is not always true. Arch support for ARM SoCs is good and you get a shiny
new kernel with systemd everywhere:

[http://archlinuxarm.org/](http://archlinuxarm.org/)

~~~
csirac2
I'm not talking about linux distributions. Arch can only boot whatever kernel
happens to be available for a particular SoC just like all the other distros.

Here are some TI chips, they're still pushing kernel 2.6 for some processor
families:
[http://processors.wiki.ti.com/index.php/Sitara_Linux_SDK_Sup...](http://processors.wiki.ti.com/index.php/Sitara_Linux_SDK_Supported_Platforms_and_Versions)

------
Gravityloss
He sees the big value in keeping interfaces the same. There must be a lesson
there.

~~~
minimax
Last year I saw Bjarne Stroustrup give a talk about C++ and one of the things
he pointed out was that a modern C++ compiler can still compile 40 year old C
code. He thought the backwards compatibility was a strength and was one of the
reasons C++ is so popular. It was a good talk and I wish I could find a video
of it somewhere.

~~~
mikeash
Did he mean because C++ compilers usually include C compilers as well? C++
compilers will reject a lot of good C code, mostly because it disallows
implicit casts from void *.

~~~
sjolsen
> C++ compilers will reject a lot of good C code, mostly because it disallows
> implicit casts from void *

When did code that implicitly casts from void * become "good?" Implicit
conversions from A to B should _never_ happen unless it's statically known
that A can be represented as B, and that's generally not the case when A is
type-erased, as when it's referred to through void *.

~~~
mikeash
The typical case is when you call malloc. There's a ton of C code that looks
like:

    
    
        sometype *ptr = malloc(...);
    

A C++ compiler will reject that.

~~~
sjolsen
> The typical case is when you call malloc.

And if there were a proper type-parameterized entry point to malloc (be it a
function or macro) for general-purpose contiguous-object allocation, you
wouldn't have to compromise the already weak type system with what is possibly
the single most absurd implicit conversion possible from a type system
perspective just to avoid typing out the type name three times.

I suppose it's more of a language quality issue than a program quality issue,
but reducing verbosity by undermining the type system rather than providing a
simple "allocate<T>(n)" function, especially in a language that doesn't care
about verbosity enough to provide even the thinnest of syntactic abstractions,
is just ridiculous.

~~~
mikeash
I agree completely, but none of that changes the nature of C nor the code
written in it. If you want to change the API like this, more power to you, but
don't then say that existing code still compiles fine.

~~~
sjolsen
> but don't then say that existing code still compiles fine.

Oh, I'm not disputing that a lot of _existing_ C code wom't compile as C++;
I'm just nitpicking the claim that _good_ C code won't compile as C++. ;)

But to be fair, there are perfectly good, valid C programs that don't compile
as C++, void * aside.

------
GFK_of_xmaspast
Employee of company making x86-compatible chips vigorously defending x86? Wow.

------
fizixer
Please. This is not relevant almost 12 years later!

Moore's law is saturating. Benchmarks numbers are more and more a function of
core-count not GHz-count. It's time to embrace the RISC-y tiny speed-demon
cores (so you can have them by the dozens) over complex brainiac ones (too
much space). It's been almost 9 years Intel introduced dual cores and the
mainstream is still quad-core (with 6 and 8 on the fringes) while GHz is less
that what P4 had more than 10 years ago! What is going on? If Moore's law
still holds, in 9 years since 2005 Intel should be at 24/32/48 core ballpark,
something that is clearly not the case.

In that vein I would also state that linux needs a major overhaul, i.e., to
shift itself from a single core kernel to one that leverages n-cores.

good read:
[http://www.lighterra.com/papers/modernmicroprocessors/](http://www.lighterra.com/papers/modernmicroprocessors/)

~~~
mikeash
Moore's Law is only about transistor density. Any relationship to clock speed
or core count is merely because of how transistor density can enable those.

If you look at transistor density, Moore's Law has held up fine so far.
There's no "if," it has held.

~~~
fizixer
So here's the transistor count from 2004 and 2014:

\- 2004: Pentium 4 on 112 mm2 die: 125M transistors [0]

\- 2014: Core i7 177 mm2 die: 1.4B transistors [1]

let's scale the P4 count up to 177 mm2 die so we can make a better comparison:

\- 2004: Pentium 4 on 177 mm2 die: ~200M

\- 2014: Core i7 on 177 mm2 die: 1.4B transistors

on the surface it looks like a lot of progress in 10 years, 7 fold increase in
transistor count. But we're talking about an exponential law, doubling of
count every two years. This is what it should've given us:

\- 2006: 400M

\- 2008: 800M

\- 2010: 1.6B

\- 2012: 3.2B

\- 2014: 6.4B

\- 2016: 12.8B ... !!!

if we consider 3 years to be the doubling time instead of two years (which is
already a failure of Moore's law that originally stated 18 months) we should
get this:

\- 2007: 400M

\- 2010: 800M

\- 2013: 1.6B

\- 2016: 3.2B

not only we were not at 1.6 billion transistors a year ago, I don't think
Intel has any announcement that they'll be at 3.2 billion (on a 177 mm2 die
equivalent) a year an a half from today.

But who counts the transistors? if Intel itself, then I'm afraid I can't take
it very seriously. Then what other measure do we have? benchmarks mostly.
Unfortunately the problem with the benchmarks is that over the years, the
auxiliary hardware improves a lot not just the processor. So not a lot of
processor improvement still results in overall system speed up due to other
components. For example, during P4 days the RAM fronts side bus was a lot
slower than it is on the latest motherboard with latest RAM and Core-i7. And
not a lot of people are interested in benchmarking a P4 with latest
motherboard and RAM, cz for one thing it's very difficult to set things up,
and secondly, you're only annoying a silicon valley giant as a result, not a
very good marketing strategy.

I'm willing to bet if you can make a 2004 P4 run on modern motherboard and
ram, and then compare it to 2013 Haswell, the single core performance would be
meh at best, and multi-core performance would be more like linear (even sub-
linear) rather than exponential.

Moore's law has become quite a marketing gimmick over the past few years and
public seems to be okay with it.

interesting read: [http://spectrum.ieee.org/semiconductors/devices/the-
status-o...](http://spectrum.ieee.org/semiconductors/devices/the-status-of-
moores-law-its-complicated)

[0]
[http://en.wikipedia.org/wiki/List_of_Intel_Pentium_4_micropr...](http://en.wikipedia.org/wiki/List_of_Intel_Pentium_4_microprocessors#Prescott_.2890.C2.A0nm.29_2)

[1]
[http://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Des...](http://en.wikipedia.org/wiki/Haswell_\(microarchitecture\)#Desktop_processors)

~~~
mattnewport
We've probably seen a bit of a slow down in Moore's law over the last 10
years, at least when it comes to Intel CPUs but I think you're exaggerating
the extent of the slowdown.

Some benchmarks are heavily dependent on RAM latency and / or bandwidth but
improvements in those have been slower than improvements in CPU performance
and if CPU performance is what you're interested in then pick a benchmark that
is largely independent of main memory (a compute bound benchmark whose working
set fits in L2). Haswell will still do a lot better than a P4 in a compute
bound benchmark, perhaps even beating it by more than in a memory bound
benchmark.

The other thing to consider is that over time more attention is being paid to
performance per Watt than raw FLOPS and in that regard Haswell looks rather
better compared to the P4.

~~~
fizixer
Okay I've done some thinking/browsing and I've come up with an approximate way
to figure out (edit: relative) Moore's law progress without relying on
official transistor count by Intel:

1 - Take this Haswell image (Intel claim: 2.6 billion in 355 mm2):

[http://images.anandtech.com/doci/8426/HSW-E%20Die%20Mapping%...](http://images.anandtech.com/doci/8426/HSW-E%20Die%20Mapping%20Hi-
Res.jpg)

(source: [http://www.anandtech.com/show/8426/the-intel-haswell-e-
cpu-r...](http://www.anandtech.com/show/8426/the-intel-haswell-e-cpu-review-
core-i7-5960x-i7-5930k-i7-5820k-tested))

2 - Take this Prescott 2M image (Intel claim: 169 million in 135 mm2):

[http://images.anandtech.com/reviews/cpu/intel/pentium4/6xx/p...](http://images.anandtech.com/reviews/cpu/intel/pentium4/6xx/prescott2m.jpg)

(source:
[http://www.anandtech.com/show/1621/3](http://www.anandtech.com/show/1621/3))

3 - Process these images: remove extra text and resize one so that the ratio
of the areas of the two dies is 355 to 135.

4 - Extract one core from Haswell (e.g., top left one)

5 - Remove the per-core cache (e.g., top left square in the top left core)

6 - Estimate the area of the resulting core.

7 - For Prescott, remove the visible left side (which is L2 cache)

8 - Remove about 30% from the top, likely overhead of the Netburst
architecture (not present in the Haswell core) (you might disagree with this
step in which case, you can do the calculation without removing it as well)

9 - Estimate the area of this.

10 - The number in step 6 (area-6) should be smaller than in step 9 (area-9)
and would represent the true shrinkage over 4 technology nodes.

11 - area-9 divided by area-6 is the percentage of transistor-count-increase
over 4 nodes.

This wouldn't give absolute numbers and won't corroborate Intel's quotes for
either Haswell or Prescott. But it will give the ratio which can be compared
with the claim of going from 90nm to 22nm, i.e., 4 generations, over 8 years.

According to which the percentage in step 11 should be 2^4 = 16 = 1600% (My
guess: it would be way less than 1600%) even ignoring the fact that the time
difference between two articles is 9.5 years instead of 8.

(P.S.: I don't have time atm but I'll try to do this myself)

