- Fetches 16-32 bytes
- Finds instruction boundaries (because of variable-length encoding)
- Branch predicts any branches in the fetch group
- Decodes 3-4 instructions in the fetch group
- Issues 3-4 decoded operations to the rest of the pipeline
This whole process has to happen every cycle (though it's pipelined over 3-5 cycles in the front end), and the performance of this process gates the rest of the pipeline. As a result, many of the micro-architectural "gotchas" in modern x86 processors tends to be in the fetch/decode phase.
We see our gotchas typically in terms of not being able to issue instructions due to data dependencies and filling up the RS with things that can't make progress (informally, a big tree of "ORs" where one source for everything has missed L1 and held everyone else up).
Other people will be hopelessly bound on cache or memory bandwidth, or floating point multiply, or whatever. Or disk access :-)
If one is curious about whether this applies to one's own code, you don't have to speculate - just bust out the Optimization Guide (http://www.intel.com/content/www/us/en/architecture-and-tech...) and go figure out where your bottlenecks are by looking at counters. VTune can do some of this stuff for you.
Not only that, but it seems from other stuff I've read that you'd have a hard time making x86 code performant on more than one microarchitecture or family thereof, which means re-writing the same code for different manufacturers and different generations from the same manufacturer.
It seems that if you accept the façade that says "x86 is x86" you're better off writing in C or even Haskell, but if you peel the façade back to write fully performant code you're doomed to either keep re-writing it or doomed to see it languish as processor technology moves on.
I'm immediately reminded of the early RISC design concept of "moving the microcode into the compiler"; having extremely simple hardware that explicitly exposed design details like pipeline length in the form of delay slots becoming ISA features and saying that compilers would handle all the resulting complexity much like how microcode handled it in the System/360 and similar.
It turns out architectural branch delay slots were a mistake, but compilers have gotten better at turning obvious source code into performant machine code unless you need so much performance you have to dig into architectural details anyway.
It's unavoidably complicated. And now we're moving towards multi-core CPUs, which bring to the fore another area which is unavoidably complicated.
We had a former performance guru who once put a huge #ifdef I_DONT_CARE_ABOUT_PERFORMANCE around the "naive" C version of a loop and had his favourite x86 asm version on the #else branch of the ifdef. Needless to say, much hilarity resulted when defining this macro improved performance. I suspect that the naive version may have been a bad idea at one stage (Pentium 4, perhaps) but became a better version from Core 2 onwards...
Ocassionally I'll get on a wild tear and think "Oh, I can do better than the compiler" after it generates some ridiculous-looking instruction sequence that seems to not be using enough registers (or whatever). I then find that hand-writing and hand-scheduling everything in accordance to what looks really good to me - actually works worse than the insane-looking code in the compiler. Put it down to stuff like store-to-load forwarding, dynamic scheduling, register renaming, etc.
Compilers are very good in some things like instruction scheduling and register allocation but only poor to mediocre in other things like instruction selection. In addition, compilers are not allowed to do some optimizations, like change the order of function calls to other translation units or do arithmetic optimizations with floating point numbers.
So if you have a piece of performance critical code, you can still be smarter than the compiler alone by co-operating with your compiler by writing compiler-friendly close to the metal code. Inline assembly code is not good for the compiler, it can't really optimize it at all. But using CPU and compiler specific intrinsics to use your hardware capabilities (SIMD, special cpu instructions) you can get your code to run close to the speed of light without compromising readability or maintainability.
Yes, compilers these days are smart but thinking that you can't do better than that is a lazy man's fallacy.
Actually it can, if you have a way to specify register usage (gcc does) and the programmer gives the right specification. Unfortunately, that almost never seems to happen in practice.
Yeah, I found this out the hard way. Turns out it's really ridiculously difficult to get GCC to put the rdtsc instruction where you want it :)
But you get the point, the compiler can optimize better if you use C and intrinsics than inline asm.
I know a lot of folks who have not reacted well to compilers becoming smarter than they are. =)
Also, wouldn't it be possible to ship the code in an intermittent form that closely resembles the hardware, but allows the OS to finish compiling it with the correct details.
That's what you can do with LLVM or Java byte code.
I could be wrong, but I think that was part of the original plan: Machine code would be thrown away along with the hardware, only the (implicitly C) sources would be saved, and you'd rebuild the world with your shiny new compiler revision to make use of a shiny new computer. Implicit in this model is the idea that compilers are smart and fast and can take advantage of minor hardware differences.
Compare this to the System/360 philosophy, where microcode is meant to 'paper over' all differences between different models of the same generation and even different generations of the same family so machine code is saved forever and constantly reused. (This way of doing things was introduced with the System/360, as a matter of fact.) Implicit in this model is the idea that compilers are slow, stupid, and need a high-level machine language where microcode takes advantage of low-level machine details.
A half-step between these worlds is bytecode, which can either be run in an interpreter or compiled to machine code over and over again. The AS/400, also from IBM, takes the latter approach: Compilers generate bytecode, which is compiled down to machine code and saved to disk when the program is first run and whenever the bytecode is newer than the machine code on disk; when upgrading, only the bytecode is saved, and the compilation to machine code happens all over again. IBM was able to transition its customers from CISC to RISC AS/400 hardware in this fashion.
As you said, the world didn't work like the RISC model, and we now have hardware designed on the System/360 model along with compilers even better than the ones RISC systems had designed for them. Getting acceptable performance out of C code has never been easier, but going the last mile to get the absolute most means making increasingly fine distinctions between types of hardware that all try hard to look exactly the same to software.
Sure... but also remember that most C code ends up having to be compiled for the lowest common denominator. All your x86 Linux distro binaries for instance will be compiled for 'generic' x86-64, i.e. something compatible with the 10 year old AMD64 3000+.
Ok, well you say... we can just turn on arch specific compiler options: -msse4.2 -march=core2 etc, and this is true, but there's still less incentive for compiler programmers to produce wonderful transforms and optimisations utilising SSE 4.2 and AVX2, or a very specific target pipeline/behaviour, than there is for generic optimisations that can be applied across platforms or processor families. In fact a lot of SSE code generated by compilers is fairly rudimentary even with these options.
Part of the 'problem' really is we have this model where x86 CPUs are being engineered with additional complexity to run old code faster, even though a lot of that old code generally doesn't need that performance. You only have to look at some of the other architectures to see some interesting approaches that x86 has ignored.
Perhaps we need to be more of the mindset that recompiling code when moving from AMD64 to Core2 is as obvious as recompiling it when moving from ARM to x86.
"Hardware Lock Elision adds two new instruction prefixes XACQUIRE and XRELEASE. These two prefixes reuse the opcodes of the existing REPNE/REPE prefixes (F2H/F3H). On processors that do not support TSX, REPNE/REPE prefixes are ignored on instructions for which the XACQUIRE/XRELEASE are valid, thus enabling backward compatibility."
We’ve been quiet ever since, I’ll get back to writing about the meat once we have something out there, and more feedback from the community!
I understand that as follows: on the K8 both repz ret and ret imm16 were microcoded; hence, the shorter repz ret was preferred.
On the K10, ret imm16 got fastpathed. That made it the preferred code, even though it is a byte longer.
I get the x86 is immensely successful, but being popular is not synonymous with being well engineered. Or engineered.
All architectures are insane. MIPS, probably the epitome of a minimal/clean architecture, still has a vestigial 1980's pipeline stage (the branch delay slot) baked right into the ISA such that it can never be removed.
Is the x86 ISA a big mess? Yeah. But ARM is hardly better (anyone remember Jazelle?) Basically, if you don't want to look at the assembly stick to your compiler. That's what it's there for. But don't pretend that this stuff isn't present everywhere -- real engineering involves tradeoffs.
Those are bits of hardware which are supposed to run around GHz speeds, they might look clean enough on the surface but when you dig deep enough "hic sunt dracones".
One of the nice things about x86 is that while the encoding is a little wonky, the semantics are great. No imprecise exceptions, no branch delay slots, strong memory ordering, etc. The last one in particular is absolutely wonderful in this day and age of multicore processors (scumbag RISC: makes you use memory fences everywhere; makes memory fences take 300 cycles).