
An optimization guide for assembly programmers and compiler makers (2018) [pdf] - muricula
https://www.agner.org/optimize/microarchitecture.pdf
======
inetsee
The page containing the link to this pdf
"[https://www.agner.org/optimize/"](https://www.agner.org/optimize/") has
links to four other pdfs describing other topics on optimization.

~~~
ignoramous
Clickable and non 404:
[https://agner.org/optimize/](https://agner.org/optimize/)

~~~
inetsee
Sorry, I had forgotten that HN doesn't like quotes around links.

------
rolph

      I like this site very much, its simple and old school.

Take a tour around the site for some other aspects of agners work, there is a
good treatise regarding the fundamentals of digital electronics, as well as a
proposed archetecture standard [ForwardCom]. some [as in a couple] of the
links on agners site are stale or dead but do give you the general idea what
to look for.

[https://www.agner.org/digital/digital_electronics_agner_fog....](https://www.agner.org/digital/digital_electronics_agner_fog.pdf)
[PDF]

and:

[https://forwardcom.info/](https://forwardcom.info/)

------
CalChris
There is also uops.info which has a different, more formal characterization of
latency:

[https://arxiv.org/abs/1810.04610](https://arxiv.org/abs/1810.04610)

~~~
pedagand
And llvm-exegesis:

[https://llvm.org/docs/CommandGuide/llvm-
exegesis.html](https://llvm.org/docs/CommandGuide/llvm-exegesis.html)

"llvm-exegesis is a benchmarking tool that uses information available in LLVM
to measure host machine instruction characteristics like latency, throughput,
or port decomposition.

Given an LLVM opcode name and a benchmarking mode, llvm-exegesis generates a
code snippet that makes execution as serial (resp. as parallel) as possible so
that we can measure the latency (resp. inverse throughput/uop decomposition)
of the instruction. The code snippet is jitted and executed on the host
subtarget. The time taken (resp. resource usage) is measured using hardware
performance counters. The result is printed out as YAML to the standard
output.

The main goal of this tool is to automatically (in)validate the LLVM’s
TableDef scheduling models. To that end, we also provide analysis of the
results."

------
anitil
Sometimes I think I'm a 'full stack' developer, and then I read something like
this. Sure I can dabble in simple assembly, but there's always a bigger fish.

~~~
dragontamer
Heh, the people who can dabble in Assembly for a living won't be able to
(typically) setup a major Web PHP + Database and all that. Assembly
optimization is a different skill than web development, or other stuff in the
stack.

We all gotta pick our specialties. For me, low level high-performance stuff is
more of hobby project level. I got a few cool things I've learned by myself
but it isn't really what I do for a living.

~~~
fxfan
I dabbled in assembly, for a living, at a major software firm. I can still do
it after brushing up a bit. And I have no trouble setting up a pg database
with web frameworks and writing optimized code.

Web isn't some magical beast- its just rote programming.

Actual business logic is where the categorization starts.

~~~
amelius
It's all plumbing and bookkeeping.

If you want to be humbled, go to a conference on high energy physics.

~~~
freyir
They've spent their lives studying high energy physics. It's going to seem
impressive to a non-expert, but to them it's mostly plumbing and bookkeeping.

------
heyjudy
I still remember the olden days when

    
    
       XOR AX,AX
    

was faster than

    
    
       MOV AX,0
    

and Abrash'es book was the magnum opus of the pre-P6 era.

I guess I better throw most of those assumptions out for modern
microarchitectures.

~~~
kijiki
Probably not directly faster, but "xor %rax, %rax" is still smaller (and thus
faster if you're icache constrained) than "mov $0x0, %rax"

~~~
userbinator
It is still faster because the register renamer handles it --- ditto for "sub
reg, reg":

[https://randomascii.wordpress.com/2012/12/29/the-
surprising-...](https://randomascii.wordpress.com/2012/12/29/the-surprising-
subtleties-of-zeroing-a-register/)

------
cpeterso
I'm interested in seeing how the section about "Future branch prediction
methods" (page 17) will change in our post-Spectre world.

~~~
souprock
This just came up today. Anybody know the answer? The problem involves a JIT
(for opcodes, not scripting).

Chunks of code are generated. Sometimes, a chunk will need to branch to a
chunk that isn't yet generated or even allocated. Chunks can be cleared away
for various reasons like evictions to make room, so the branches to a chunk
might need to not go there anymore.

One strategy is to use indirect branches. When the destination doesn't exist,
the branch goes to code that will resolve that. It's a bit like how dynamic
linkers work.

Another strategy is to replace the code. The processor may stumble a bit over
what is essentially self-modifying code, but then it eventually runs faster.
The degree to which the processor will pause is a key consideration here.
Anybody know?

BTW, if you like digging into this stuff, we're hiring.

~~~
CoolGuySteve
On a multithreaded implementation, you can do it with 2 stub functions and an
array/hash of function pointers (hereto refered to as the "table") that itself
acts as a spinlock array. Initialize the table with stub 1 and atomic write
stub 1 to the table on eviction.

stub 1) upon entry, atomic write stub 2 into the table. Then compile or
enqueue onto a compilation thread/process. At the end of compilation, atomic
write the newly compiled address to the table.

stub 2) spinlock on function pointer array entry != Stub 2 address. At end of
spinlock, call new function pointer. If you care about power consumption or
timing attacks, the spinlock can have a backoff corresponding with the speed
of your JIT compiler. This function should be able to fit into a single
cacheline, so it should usually be hot.

inlined thunk: Given a function, read address from table and call it.

The client uses the thunk to call the function address in the table without
knowing if it's compiled or not. On Intel, you shouldn't need a read fence due
to MESI rules.

The branch/fetch predictor works by hashing the address of the branch
instruction into its own lookup table. So by forcing stub 3 to be inlined,
you're relying on the hash to prefetch the correct function address and call
it directly.

You should also be able to put the JIT into a separate process that mmaps with
the NX bit set, and have the client load without the NX bit set. So it would
be difficult to exploit the JIT compiler itself (but the produced code could
still be unsafe).

You might want to cachealign the function table or pack similar chunks in the
same function table entry to prevent false sharing whenever there's an atomic
write. Also, this table is going to thrash your cache by incurring an extra
cacheline load for the trampoline that it does every time.

Anyways, that's my back of the napkin design for how to do it.

~~~
souprock
Thanks.

That is certainly a different way to think about threading. My instinct would
be to keep a separate JIT cache for each thread, keeping the threads from
stealing resources from each other but also keeping them from sharing the JIT
effort.

I think your approach is pretty much how /lib/ld-linux.so.2 deals with the
PLT. I didn't see an explanation for stub 3, but I guess you mean the part of
the JIT output that does the indirect jump. I take it that you believe the
speed advantage of a direct jump is not enough to overcome the occasional hit
caused by modifying it, with all the invalidations (icache, trace cache, etc.)
that it might cause. Being similar to the dynamic linker is probably well-
aligned with what Intel is trying to optimize for upcoming chips. OTOH, the
spectre problem might limit what Intel does.

Putting the JIT in a separate process would have to add latency. Upon hitting
a missing chunk of code, translation can't really wait. The idea is to run
modern stuff, such as a recent desktop OS, via the JIT. We do use more than
one mapping on Linux, as required to keep SE Linux happy.

~~~
CoolGuySteve
Yes, I'm assuming once code is JIT'ed, it will take a while before being
JIT'ed again. But in the meantime the caches will cover some of the trampoline
costs.

With regards to latency, if you have a dedicated JIT thread, you can read
ahead and JIT the start of all the jump/calls up to the next current ret in
parallel to executing the first chunk. Like a super fancy prefetch.
Heuristically, I think it's safe to assume that most code is local but I could
be wrong.

Also, if you load hwloc on x86-64, you can read your CPU topology at
initialization time. If hyperthreading is present (which is somewhat common),
you can set affinity for JIT threads to be on the same core as the emulated
CPUs. This will minimize read concurrency overhead since you'll be
reading/writing to the physical core's L1 almost every time. (This part is
crazy, but you _might_ even be able to get away without using atomics due to
how cache-associativity works, but I've never tried it and it might not be
guaranteed into the future.)

------
bogomipz
I've heard this individual's name come up quite a few times(it's very
distinctive)in discussions about the compiler/optimization/cpu architecture
space.

Does anyone know if this individual still teaches at the Technical University
of Denmark and if there might be any lectures of their's online?

I've been watching Onur Mutlu's CMU lectures online and this material is a
nice addition to computer architecture self-study.

