Ask any high-level language compiler implementor; they will tell you that x86's register-starvedness and anisotropic instruction encoding are huge losses. A number of high-quality Lisp implementations (e.g., T) never saw the light of day on x86 -- and never will -- because their compilers assumed a register file of at least sensible size (16 or so).
The world would have been better off if the 68k architecture had won back in the 80s.
>they will tell you that x86's register-starvedness and anisotropic instruction encoding are huge losses.
Eh, not sure I buy that. Since at least the Pentium, x86 implementations have had very fast stack variable access compared to contemporary RISC architectures, practically treating some number of bytes around the stack pointer as registers. As Linus points out, everyone architecture has to confront this problem anyway sooner or later because some stack access is unavoidable; the x86 implementers just had to confront it sooner because of their ISA had half or less of the registers of the RISC architectures. So while of course it's better to keep everything in registers if possible, I'm sure that the old Lisp compilers would have performed reasonably well on x86 if they'd tried instead of throwing their hands in the air and saying "we can't do it!"
And the "anisotropic instruction encoding" makes so much sense that ARM adopted it (in a much simplified fashion) with Thumb-2. Fixed width instruction encodings waste a lot of space because the common operations need 2-3 more bytes than they ought to, and because absolute addresses and constants above a certain size must be loaded from an indirect access to a location close by the instruction stream, instead of just being inline with the instruction. Instead of 1 byte of instruction + 4/8 bytes of data/address, you get 2-4 bytes of instruction including relative address + the 4-8 bytes of data you wanted to access in the first place. All this adds up after a while, and at some point, the waste of code cache hurts more than the more complicated ISA.
I love 68k too (which had a somewhat "anisotropic instruction encoding" itself, by the way), but let's give credit where it's due: it took a lot of engineering talent and confidence ("RISC is the future!") for Intel to prove that CISC chips could beat RISC chips on performance. It might have never been possible for Motorola to overcome their management problems and infighting to progress 68k at the zenith of RISC hype.
Oops. I haven't worked with ARM in a while, wasn't aware of that. I knew it was simplified a bit (like no more funky barrel-shifter stuff), but I'm kinda surprised they went back to a fixed-length encoding scheme when they're already taking much greater complexity hits with the high performance OoO implementations.
I don't think a large number of architectural registers is necessary - look at what people have been able to do with the highly-constrained 6502 in the C64, for example. If anything, it's the compilers and compiler writers that need to be more intelligent about how they approach the problem, since humans have been better at writing Asm than compilers for... ever since compilers existed.
I think MIPS is a good example of how trying to make a "dumb and simple" architecture to cater to compiler writers, with the promise of cheap, tiny, and fast CPUs that required far less effort to design for the same performance, didn't work out so well:
On the 2nd page of that article are the interesting results, and there's also this odd sentence: "Loongson also struggles, possibly due to poor compiler optimization."
...but isn't MIPS supposed to be one of the most compiler-friendly ISAs, with a very orthogonal and large number of registers (32!), and a uniform instruction format? In many compiler courses it is the "model architecture" of choice for its easy code generation and "better performance", but in the real world its only real advantage seems to be being small and cheap - and ARM, which is a bit more complex, is not far from it but surpasses it easily in efficency.
Not sure which architecture you're looking at, but my x86 machines all have 16 symmetric GPRs. I'm a little confused at the idea of a high-quality Lisp implementation refusing to build in 64 bit mode.
And I'm especially confused at the idea of an architecture with explicitly asymmetric registers being so much of an improvement that "the world would have been better off". I can see that argument being made over MIPS, I guess. But not 68000...
I dont think this is NOT true. Two reasons why. The first is register allocation algorithms have gotten really good at breaking up lifetimes of registers and producing near optimal spillcode. And the second reason is that most of the time the penalty you take is L1_ACCESS_TIME - REG_ACCESS_TIME + pushtime + poptime. On modern OOO machine with modern day L1 caches with optimized HW prefetching this is almost 0. Seriously.
'68k is easier to write a compiler for' and 'writing a good compiler for x86 is very hard, but if you have one, x86 is faster' need not rule out each other. Maybe, both are true.
Maybe, the denser x86 instruction set helps more than the its complex instruction decoder hurts. Now that CPUs with a billion transistors are 'small', that decoder doesn't look that large anymore.
Did you actually read any of the linked messages in which Linus makes a strong case for why this oft repeated claim about x86 turned out not to be a real problem in practice? If you're going to disagree with what Linus is saying can you explain why you think he's wrong?
The 68k address/data register split is ugly, and the double-indirect addressing modes probably cause more difficulties for a fast implementation than any x86 quirk (though this can be cured with R&D money).
Fun to see ten years later that this approach by AMD won.
AMD's x86-64 approach is a lot more interesting not so much because of any technical issues, but because AMD _can_ try to avoid the "irrelevant" part. By having a part that _can_ potentially compete in the market against a P4, AMD has something that is worth hoping for. Something that can make a difference.
For Linus things are always black and white and people who disagree with him are "stupid". That's a shame. It's amusing to read this today because x86 compatibility is obviously of declining value; that is, his "XScale" (ARM) prediction came true but markets do not change overnight.
I would invite some factual data on the "cold cache" issue of RISC. With the switch to solid state storage, is the binary load time still really an issue? (I assume the L2 -> L1 transport can be made arbitrarily fast).
It's ironic that people who call Torvalds as being black-and-white only see him as black-and-white-themselves. That page is a long page full of back-and-forth over different aspects of architectures, well-written, and he has issues against all of them (hardly black-and-white)
If you search for 'stupid', there's only one place where it's directed at people rather than technical issues, and then it's about a 'stupid question' rather than a stupid person. And then he goes on to describe why he thinks it's stupid - because the question implies no-one had tried before, when there had actually been a lot of effort for no gain.
TLDR: You're using Torvalds as a punching bag. Which is ironic, given that you're pinging him for using others as a punching bag.
The think the abrasiveness thing is a separate issue from the black-and-white opinions thing.
For the most part he is extremely knowledgeable within his field and has a pragmatic outlook that I like a lot.
At the same time, anyone who presents incredibly complicated topics as black and white rarely has their facts straight. That by itself is a huge red flag. They're either misinformed at best or lying to you at worst.
In Linus' case, I usually give him the benefit of the doubt and assume that his stated black-and-white opinion is a vastly simplified version of his much more nuanced and informed opinion that he keeps to himself. Sometimes he makes me think I'm giving him too much credit though. Abrasiveness always makes me question someone's credibility when used in defense of a binary opinion on a complicated topic (this applies to anyone, not just Linus).
My point is more that the talk of Tovald's abrasiveness is largely a strawman. People take selected excerpts of his conversations, show them out of context, and then a tsunami of "what a bastard!" results. Almost every time I've seen this happen on HN, the excerpt is taken from a longer discussion, where he's already spent some time talking to the person in question before 'turning on them'. And then when an article like this comes along, with no offensiveness in it, people just have to talk about his 'issue with being abusive'.
The funny thing is that here on HN, Jobs generally gets a nod for behaving like a bastard ("because that's what's needed to make a good product"), but Torvalds gets demonised. Even de Raadt, famed for his abrasiveness, is accorded respect for it. It puzzles me on two fronts, because Torvalds really isn't that abrasive unless you really push him, and that for some reason, he doesn't get respect that other tech 'names' do for (supposedly) being so.
The instruction cache gets cold every few hundred ms, when processes are rescheduled because of a tick, or trying to do I/O. Disk speed never had much relevance here.
I remember Apple's transition from PPC to x86. The software certainly became much faster, but debugging became more difficult. For example, on PPC, all instructions were 4-byte aligned, so if you saw a PC that wasn't a multiple of 4, you knew right away that something had gone off the rails. Or if you wanted to disassemble around an address, it was easy to know where to start. Not so with x86 on either count.
Another example is that developers liked to use -fomit-frame-pointer to get an extra register. This means that arguments were at some varying offset from the the stack address that only the compiler could keep track of. You'd have to hunt for them on the stack. With PPC, they'd generally be in registers, and it was often easy to trace their movement through the function since they'd be kept in registers.
Lastly, the way that x86 stores the return address on the stack makes exploiting buffer overflows especially easy. PPC has a dedicated register for the return address, which by no means gives immunity, but at least doesn't hand the attacker the keys on a silver platter.
Linus may or may not be right about the performance issues, but I am certain that x86's "charming oddities" made low-level debugging slower and thereby contributed to buggier software.
PPC only leaves the return address in a register for leaf functions. Otherwise, it's saved to the stack and exploitable just like any other architecture.
EDIT: I'm misread your post because I'm tired, but I'll leave this explanation of -fomit-frame-pointer up anyway.
With most architecture+compiler combinations, a register is dedicated to holding the frame pointer, a pointer to a fixed position in the stack frame of a function containing the return address to the caller. For x86, ebp (rbp in 64 bit mode) is used for this purpose, and a stack frame looks something like this:
var_2 at [ebp-8] <-- top of stack, lowest address (yes this is counter-intuitive)
var_1 at [ebp-4]
return_address [ebp] <-- ebp points here
arg_1 at [ebp+4]
arg_2 at [ebp+8]
No matter how much data you push onto the stack in local variables (even variable length structures eg with alloca or variable length arrays in C), the parameters are always accessible at positive offsets from ebp while the (fixed-size) local variables are at negative offsets.
Without a frame pointer, parameters must be accessed relative to the stack pointer instead. Disadvantages of this include:
Both parameters and local variables are at positive offsets in comparison to esp, so it's harder to tell the two apart.
The offsets can change throughout the function as variables are pushed and sometimes popped from the stack, making analysis even more difficult.
You can't allocate variable width structures on the stack because then the compiler would not know at what offsets the parameters/variables would lie.
And it's not even necessarily a performance win either, because x86 uses longer instruction encodings for sp-relative addressing vs. bp-relative addressing.
x86_64 and most RISC architectures don't have this problem, there are enough spare registers that the negatives of -fomit-frame-pointer outweigh the positives.
That's right. But it's more difficult to exploit, because you have to smash the caller's return address, not your own, and then wait for the caller to return. Anyways, my point is merely that if you wanted to design an ISA that maximizes exploitability of buffer overflows, you'd be hard pressed to do better than x86.
Some Japanese dude overclocked a P4 to 5GHz in 2004, and there are reports of people getting a P4 to 8GHz in 2007. There's also this cooler that Intel made in 2006, though I'm not sure of it's success: http://www.bit-tech.net/news/hardware/2006/03/08/intel_liqui...
It's interesting to see how the rise of ARM has vindicated Linus' point in some ways (as it rose to success despite accruing a fair amount of the cruft in x86) and refuted it in other ways (as the effects of the cruft on x86—notably power consumption—seem to have had real effects in Intel's ability to make headway in the mobile market).
Honestly, I think by the very end we will see Intel dominating mobile markets.
The thing is, Intel are process kings. Nobody else comes close. Both the economics and the performance. And when it comes to processors design that has to respect many physical limits (wire delay, transistor behavior, ..) tuning process knobs will gain you much more than tuning architectural knobs. I think in the end process technology wins this fight.
The reason we haven't seen competitive Intel products is that they just haven't cared about power for like ever. Their company is built for speed. Their income is from performance sensitive markets. All these power sensitive markets were pretty new and until recently -- pretty inconsequential in terms of revenue. Their process is not tuned for this. But I guess now that all these power sensitive users are starting to want more performance maybe Intel can rely on some of its strengths.
When you have a massive company like Intel it takes time to change directions. I honestly think by the time Broadwell or Skylake comes out you will start to see Intel power-competitive in the low performance segments.
You make a good argument, but you forget one thing - app compatibility. There are two dominant platforms, and they're both based on ARM. "Competitive" processors won't be enough to make a difference - even clearly superior processors aren't going to be compelling.
Apple has shown a willingness to switch horses in the past, but now that they're designing their own chips I think we've seen the last of it.
"You make a good argument, but you forget one thing - app compatibility. There are two dominant platforms, and they're both based on ARM."
Android can be compiled for x86 (and in fact there are even gadgets deployed with it on the market¹) and the applications for Android don't run native. Consider therefore that there aren't any issues like those faced by other architectures when were trying to compete with Intel desktop monopoly.
I hope they get there soon. ARM SoCs have a bad habit of running only heavily hacked, out-of-tree, unmaintained kernels abandoned so long ago (or inexplicably they launch with an already 2-year-old kernel) that they'll never boot systemd. Yet even the weakest/slowest of Intel's Bay Trail lineup still uses at least 80% more power than the ideal candidate ARM solution I'm evaluating.
I'm not talking about linux distributions. Arch can only boot whatever kernel happens to be available for a particular SoC just like all the other distros.
It's based around Cortex A9. Admittedly the modules presented with bay trail had a few features the ARM solution didn't, so it wasn't apples-to-apples.
Last year I saw Bjarne Stroustrup give a talk about C++ and one of the things he pointed out was that a modern C++ compiler can still compile 40 year old C code. He thought the backwards compatibility was a strength and was one of the reasons C++ is so popular. It was a good talk and I wish I could find a video of it somewhere.
Did he mean because C++ compilers usually include C compilers as well? C++ compilers will reject a lot of good C code, mostly because it disallows implicit casts from void *.
It's much more likely that I am misquoting what Bjarne Stroustrup said than that Bjarne Stroustrup doesn't know what is or isn't valid C++. Anyway I found the slide deck from the talk. He was making this point while discussing the slide on page 7.
> C++ compilers will reject a lot of good C code, mostly because it disallows implicit casts from void *
When did code that implicitly casts from void * become "good?" Implicit conversions from A to B should never happen unless it's statically known that A can be represented as B, and that's generally not the case when A is type-erased, as when it's referred to through void *.
And if there were a proper type-parameterized entry point to malloc (be it a function or macro) for general-purpose contiguous-object allocation, you wouldn't have to compromise the already weak type system with what is possibly the single most absurd implicit conversion possible from a type system perspective just to avoid typing out the type name three times.
I suppose it's more of a language quality issue than a program quality issue, but reducing verbosity by undermining the type system rather than providing a simple "allocate<T>(n)" function, especially in a language that doesn't care about verbosity enough to provide even the thinnest of syntactic abstractions, is just ridiculous.
I agree completely, but none of that changes the nature of C nor the code written in it. If you want to change the API like this, more power to you, but don't then say that existing code still compiles fine.
Please. This is not relevant almost 12 years later!
Moore's law is saturating. Benchmarks numbers are more and more a function of core-count not GHz-count. It's time to embrace the RISC-y tiny speed-demon cores (so you can have them by the dozens) over complex brainiac ones (too much space). It's been almost 9 years Intel introduced dual cores and the mainstream is still quad-core (with 6 and 8 on the fringes) while GHz is less that what P4 had more than 10 years ago! What is going on? If Moore's law still holds, in 9 years since 2005 Intel should be at 24/32/48 core ballpark, something that is clearly not the case.
In that vein I would also state that linux needs a major overhaul, i.e., to shift itself from a single core kernel to one that leverages n-cores.
Moore's Law is only about transistor density. Any relationship to clock speed or core count is merely because of how transistor density can enable those.
If you look at transistor density, Moore's Law has held up fine so far. There's no "if," it has held.
let's scale the P4 count up to 177 mm2 die so we can make a better comparison:
- 2004: Pentium 4 on 177 mm2 die: ~200M
- 2014: Core i7 on 177 mm2 die: 1.4B transistors
on the surface it looks like a lot of progress in 10 years, 7 fold increase in transistor count. But we're talking about an exponential law, doubling of count every two years. This is what it should've given us:
- 2006: 400M
- 2008: 800M
- 2010: 1.6B
- 2012: 3.2B
- 2014: 6.4B
- 2016: 12.8B ... !!!
if we consider 3 years to be the doubling time instead of two years (which is already a failure of Moore's law that originally stated 18 months) we should get this:
- 2007: 400M
- 2010: 800M
- 2013: 1.6B
- 2016: 3.2B
not only we were not at 1.6 billion transistors a year ago, I don't think Intel has any announcement that they'll be at 3.2 billion (on a 177 mm2 die equivalent) a year an a half from today.
But who counts the transistors? if Intel itself, then I'm afraid I can't take it very seriously. Then what other measure do we have? benchmarks mostly. Unfortunately the problem with the benchmarks is that over the years, the auxiliary hardware improves a lot not just the processor. So not a lot of processor improvement still results in overall system speed up due to other components. For example, during P4 days the RAM fronts side bus was a lot slower than it is on the latest motherboard with latest RAM and Core-i7. And not a lot of people are interested in benchmarking a P4 with latest motherboard and RAM, cz for one thing it's very difficult to set things up, and secondly, you're only annoying a silicon valley giant as a result, not a very good marketing strategy.
I'm willing to bet if you can make a 2004 P4 run on modern motherboard and ram, and then compare it to 2013 Haswell, the single core performance would be meh at best, and multi-core performance would be more like linear (even sub-linear) rather than exponential.
Moore's law has become quite a marketing gimmick over the past few years and public seems to be okay with it.
We've probably seen a bit of a slow down in Moore's law over the last 10 years, at least when it comes to Intel CPUs but I think you're exaggerating the extent of the slowdown.
Some benchmarks are heavily dependent on RAM latency and / or bandwidth but improvements in those have been slower than improvements in CPU performance and if CPU performance is what you're interested in then pick a benchmark that is largely independent of main memory (a compute bound benchmark whose working set fits in L2). Haswell will still do a lot better than a P4 in a compute bound benchmark, perhaps even beating it by more than in a memory bound benchmark.
The other thing to consider is that over time more attention is being paid to performance per Watt than raw FLOPS and in that regard Haswell looks rather better compared to the P4.
Okay I've done some thinking/browsing and I've come up with an approximate way to figure out (edit: relative) Moore's law progress without relying on official transistor count by Intel:
1 - Take this Haswell image (Intel claim: 2.6 billion in 355 mm2):
3 - Process these images: remove extra text and resize one so that the ratio of the areas of the two dies is 355 to 135.
4 - Extract one core from Haswell (e.g., top left one)
5 - Remove the per-core cache (e.g., top left square in the top left core)
6 - Estimate the area of the resulting core.
7 - For Prescott, remove the visible left side (which is L2 cache)
8 - Remove about 30% from the top, likely overhead of the Netburst architecture (not present in the Haswell core) (you might disagree with this step in which case, you can do the calculation without removing it as well)
9 - Estimate the area of this.
10 - The number in step 6 (area-6) should be smaller than in step 9 (area-9) and would represent the true shrinkage over 4 technology nodes.
11 - area-9 divided by area-6 is the percentage of transistor-count-increase over 4 nodes.
This wouldn't give absolute numbers and won't corroborate Intel's quotes for either Haswell or Prescott. But it will give the ratio which can be compared with the claim of going from 90nm to 22nm, i.e., 4 generations, over 8 years.
According to which the percentage in step 11 should be 2^4 = 16 = 1600% (My guess: it would be way less than 1600%) even ignoring the fact that the time difference between two articles is 9.5 years instead of 8.
(P.S.: I don't have time atm but I'll try to do this myself)
The world would have been better off if the 68k architecture had won back in the 80s.