
Why do RET instructions have a REPZ prepended on x86? - luu
http://repzret.org/p/repzret/
======
rayiner
Each cycle, a modern x86:

\- Fetches 16-32 bytes

\- Finds instruction boundaries (because of variable-length encoding)

\- Branch predicts any branches in the fetch group

\- Decodes 3-4 instructions in the fetch group

\- Issues 3-4 decoded operations to the rest of the pipeline

This whole process has to happen every cycle (though it's pipelined over 3-5
cycles in the front end), and the performance of this process gates the rest
of the pipeline. As a result, many of the micro-architectural "gotchas" in
modern x86 processors tends to be in the fetch/decode phase.

~~~
onan_barbarian
You are overgeneralizing. Being front-end bound is true for some code but not
for others.

We see our gotchas typically in terms of not being able to issue instructions
due to data dependencies and filling up the RS with things that can't make
progress (informally, a big tree of "ORs" where one source for everything has
missed L1 and held everyone else up).

Other people will be hopelessly bound on cache or memory bandwidth, or
floating point multiply, or whatever. Or disk access :-)

If one is curious about whether this applies to one's own code, you don't have
to speculate - just bust out the Optimization Guide
([http://www.intel.com/content/www/us/en/architecture-and-
tech...](http://www.intel.com/content/www/us/en/architecture-and-
technology/64-ia-32-architectures-optimization-manual.html)) and go figure out
where your bottlenecks are by looking at counters. VTune can do some of this
stuff for you.

~~~
rayiner
I'm speaking from an architectural design standpoint. Exceeding the ability of
the OOO engine to cover cache misses isn't a "gotcha" it's a design limit of
the hardware. Being able to only predict three branches in a 16-byte fetch
group, even though more than 3 branches can fit in a 16-byte stretch of code,
is a gotcha. Being able to execute only 3 SSE operations per cycle is a design
limit. Not being able to hit 3 SSE operations per cycle depending on which
registers you use and how many prefix bytes they require to encode is a
gotcha.

------
simcop2387
Things like this are why I got out of writing ASM for anything back in the
1990s. It's gotten to the point where while a tight loop might be doable by a
person in a performant way, anything else might entirely be far too
complicated to make it perform well all over. Along with that you can't make
it portable on other architectures.

~~~
derleth
> Along with that you can't make it portable on other architectures.

Not only that, but it seems from other stuff I've read that you'd have a hard
time making x86 code performant on more than one microarchitecture or family
thereof, which means re-writing the same code for different manufacturers and
different generations from the same manufacturer.

It seems that if you accept the façade that says "x86 is x86" you're better
off writing in C or even Haskell, but if you peel the façade back to write
fully performant code you're doomed to either keep re-writing it or doomed to
see it languish as processor technology moves on.

I'm immediately reminded of the early RISC design concept of "moving the
microcode into the compiler"; having extremely simple hardware that explicitly
exposed design details like pipeline length in the form of delay slots
becoming ISA features and saying that compilers would handle all the resulting
complexity much like how microcode handled it in the System/360 and similar.

It turns out architectural branch delay slots were a mistake, but compilers
_have_ gotten better at turning obvious source code into performant machine
code unless you need so much performance you have to dig into architectural
details anyway.

It's unavoidably complicated. And now we're moving towards multi-core CPUs,
which bring to the fore another area which is unavoidably complicated.

~~~
onan_barbarian
This is spot on. We use a lot of intrinsics, as we want to get at some of the
underlying features of the architecture, but we almost never write asm. It's
becomes obsolete almost immediately and is often far worse than the compiler's
first cut.

We had a former performance guru who once put a huge #ifdef
I_DONT_CARE_ABOUT_PERFORMANCE around the "naive" C version of a loop and had
his favourite x86 asm version on the #else branch of the ifdef. Needless to
say, much hilarity resulted when defining this macro _improved_ performance. I
suspect that the naive version may have been a bad idea at one stage (Pentium
4, perhaps) but became a better version from Core 2 onwards...

Ocassionally I'll get on a wild tear and think "Oh, I can do better than the
compiler" after it generates some ridiculous-looking instruction sequence that
seems to not be using enough registers (or whatever). I then find that hand-
writing and hand-scheduling everything in accordance to what looks really good
to me - actually works _worse_ than the insane-looking code in the compiler.
Put it down to stuff like store-to-load forwarding, dynamic scheduling,
register renaming, etc.

~~~
exDM69
> Ocassionally I'll get on a wild tear and think "Oh, I can do better than the
> compiler" after it generates some ridiculous-looking instruction sequence
> that seems to not be using enough registers (or whatever).

Compilers are _very good_ in some things like instruction scheduling and
register allocation but only poor to mediocre in other things like instruction
selection. In addition, compilers are not allowed to do some optimizations,
like change the order of function calls to other translation units or do
arithmetic optimizations with floating point numbers.

So if you have a piece of performance critical code, you can still be smarter
than the compiler alone by co-operating with your compiler by writing
compiler-friendly close to the metal code. Inline assembly code is not good
for the compiler, it can't really optimize it at all. But using CPU and
compiler specific intrinsics to use your hardware capabilities (SIMD, special
cpu instructions) you can get your code to run close to the speed of light
without compromising readability or maintainability.

Yes, compilers these days are smart but thinking that you can't do better than
that is a lazy man's fallacy.

~~~
notacoward
"Inline assembly code is not good for the compiler, it can't really optimize
it at all."

Actually it can, if you have a way to specify register usage (gcc does) and
the programmer gives the right specification. Unfortunately, that almost never
seems to happen in practice.

~~~
exDM69
> Actually it can

Yeah, I found this out the hard way. Turns out it's really ridiculously
difficult to get GCC to put the rdtsc instruction where you want it :)

But you get the point, the compiler can optimize better if you use C and
intrinsics than inline asm.

~~~
notacoward
Truth. I only use asm for stuff I _can 't_ do in C, like access architecture-
specific registers or use instructions for which there are no C operators.
Anything else strikes me as a bit of an exercise in machismo at the expense of
sound engineering.

------
iso-8859-1
It's fortunate, since it made Intel able to encode the new XACQUIRE/XRELEASE
of the Transactional Synchronization Extensions' Hardware Lock Elision using
these short ancient prefixes:

"Hardware Lock Elision adds two new instruction prefixes XACQUIRE and
XRELEASE. These two prefixes reuse the opcodes of the existing REPNE/REPE
prefixes (F2H/F3H). On processors that do not support TSX, REPNE/REPE prefixes
are ignored on instructions for which the XACQUIRE/XRELEASE are valid, thus
enabling backward compatibility."

from
[http://en.wikipedia.org/wiki/Transactional_Synchronization_E...](http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#Hardware_Lock_Elision)

~~~
kps
And REPNE RET, among others, similarly has an interpretation in _Intel® Memory
Protection Extensions_ (lawyers made me write that) to help make the new
pointer bounds checking ABI transparent & interoperable with other code.

[1] [http://software.intel.com/en-us/articles/introduction-to-
int...](http://software.intel.com/en-us/articles/introduction-to-intel-memory-
protection-extensions) [2] [http://download-
software.intel.com/sites/default/files/31943...](http://download-
software.intel.com/sites/default/files/319433-015.pdf)

------
diydsp
While we're at it, why did Ahmed Bougacha register an entire domain for this
topic?!?! (srsly, I'm wondering if he's just super-passionate about this topic
and had the resources to spare? Is this the beginning of a much larger set of
questions related to the topic, etc.? )

~~~
qikon
Both! I have plenty of drafts in the area. Right now I’m working on binary
analysis / translation / decompilation using LLVM (I’m actually gearing up for
an RFC/patch on the mailing list):
[http://dagger.repzret.org](http://dagger.repzret.org). If you’re curious, I
gave a talk at the LLVM Euro conference about the kind of problems you’d have,
and some interesting use cases.

We’ve been quiet ever since, I’ll get back to writing about the meat once we
have something out there, and more feedback from the community!

------
bookface
This was really interesting, but one thing it didn't fully explain was why the
fact that ret imm16 was fastpathed would make it a worse choice for this
optimization than repz ret. Excuse my noobiness, but can anybody explain that
to me?

~~~
Someone
_" in the K10 family, the ret imm16 instruction is now fastpathed"_

I understand that as follows: on the K8 both _repz ret_ and _ret imm16_ were
microcoded; hence, the shorter _repz ret_ was preferred.

On the K10, _ret imm16_ got fastpathed. That made it the preferred code, even
though it is a byte longer.

~~~
bookface
Ah, thanks. I missed that that whole part was talking about the K10.

------
rbanffy
And there are people who say the x86 ISA is just fine, that such kludges are
absolutely normal...

I get the x86 is immensely successful, but being popular is not synonymous
with being well engineered. Or engineered.

~~~
ajross
Right, because ARM cores exhibit completely predictable behavior with, say,
assignments to IP or instruction predicate bits?

All architectures are insane. MIPS, probably the epitome of a minimal/clean
architecture, still has a vestigial 1980's pipeline stage (the branch delay
slot) baked right into the ISA such that it can never be removed.

Is the x86 ISA a big mess? Yeah. But ARM is hardly better (anyone remember
Jazelle?) Basically, if you don't want to look at the assembly stick to your
compiler. That's what it's there for. But don't pretend that this stuff isn't
present everywhere -- real engineering involves tradeoffs.

~~~
simias
As someone who's fought for months with the ARM cortex-a9 cache prefetch
behaviour and behaviour of the various memory attributes (not intuitive and
only partially documented), I wholeheartedly agree with you.

Those are bits of hardware which are supposed to run around GHz speeds, they
might look clean enough on the surface but when you dig deep enough "hic sunt
dracones".

------
ballard
Deep pipelines, single cycle RISC and the history of MIPS (eg Hennessy is a
badass)

[http://en.m.wikipedia.org/wiki/MIPS_architecture#History](http://en.m.wikipedia.org/wiki/MIPS_architecture#History)

------
elmindreda
This is why I like having compilers.

