

Secrets of dispatch_once; x86 performance trickery - mattgodbolt
https://mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html

======
userbinator
What I would do is have dispatch_once NOP out the call instruction that called
it as one of its first operations. Inside, an atomic exchange + compare on the
predicate to stop any other threads that slipped past the call, and that's all
there is to it. CPUs are optimised for executing NOPs since they're so
commonly used for alignment, so the only thing this costs is fetch bandwidth.
Here's a quick x86 implementation PoC off the top of my head:

    
    
        ; assumptions:
        ; * this function is called via a 2-byte call instruction
        ; * cdecl register allocation
        ; * [esp] : return addr
        ; * [esp+4] : predicate ptr
        ; * [esp+8] : ptr to func() to be called once
        dispatch_once:
            pop edx              ; edx = block
            pop ecx              ; ecx = pred ptr
            mov eax, 1
            xchg eax, [ecx]
            test eax, eax        ; was a different thread here before us?
            jnz dispatch_done
            mov eax, [esp]            ; return address
            mov [eax-2], word 0x9090  ; overwrite the call
            jmp edx                   ; execute the block +return
        dispatch_done:
            ret
    

As the saying goes, "The fastest way to do something is to not do it at all."
In this case, the compare + function call can be completely eliminated after
the first time.

~~~
Marat_Dukhan
One day the call instruction will cross cache line boundary, the two cache
lines will propogate to other cores' instruction caches with a delay, some
thread will try to execute partially changed call instruction, and some senior
programmer will get mad trying to debug it, a la "look, I swap these two lines
and it gets to work" (b/c the call instruction does not cross cache line
anymore)

~~~
userbinator
I realise this approach has its assumptions and limitations, I was just trying
to show how far you can take things if you're _really_ after absolute
performance. NOP'ing out the call instruction is the subtle part, but if you
can ensure that all your calls to this function are 2-byte calls and aligned,
then everything works out great.

Direct calls on x86 are 5 bytes so NOP'ing them out atomically is a bit more
difficult; two ways of doing it I can think of: [1] use an 8-byte load/store,
or [2] ensure the displacement part is aligned, and overwrite it with the
address of a single RET instruction so the call immediately returns. The first
way could be slower but effectively NOPs out the whole call, the second is
simpler and faster to do but leaves the useless call/ret (while still
eliminating the compare/branch.)

Incidentally, doing this is much easier on RISCs like MIPS and ARM thanks to
fixed-length instructions (that automatically align), and they tend to have
much shorter pipelines too so the flush penalties are lower.

------
comex
Copying my comment from the post:

I don't think the cpuid is actually required, but feel free to tell me that
I'm wrong.

The comment in dispatch_once talks about systems with weakly ordered memory
models like ARM. But x86 has a strongly ordered model: among other things,
"any two stores are seen in a consistent order by processors other than those
performing the stores" [Intel]. After calling dispatch_once, I expect to see
any writes performed by the once routine, but once the store to the
dispatch_once_t flag has been observed, this is already guaranteed.

The main reordering pitfall in x86 is that reads may be reordered with earlier
writes to different locations, but I don't see how that would cause a problem
here.

~~~
nkurz
I'm tending to agree with your reasoning, but am reaching a slightly different
conclusion: there might be a problem, but if there is, I don't see how 'cpuid'
solves it.

Like any delay, it reduces the chances that it will occur, but I don't think
it provides any guarantee. What is to prevent a process from reading the
initial predicate==0 and then being swapped out for thousands of cycles just
before the conditional is executed? I presume the solution for this is that
the "write side" does CAS before it actually executes the block. But since
it's doing this anyway, I don't see how the 'cpuid' is actually helping
things. I think the CAS will be doing a read-to-own whether or not it
succeeds, and thus it should never see the wrong state.

Edit: Looking at the comments in the source through the link that Marat
supplied below, I'm now realizing that the (supposed) purpose of the 'cpuid'
is not to prevent double execution of the initialization block, but to prevent
the process that performs the initialization from pre-reading a data variable
that is not yet initialized. Since this isn't necessary on x86, it seems even
more likely the 'cpuid' is doing anything useful here.

------
nkurz
Great article! I'm still trying to wrap my head around this, but in the
meantime I'll quibble about one tiny throwaway aside:

 _DISPATCH_EXPECT is a macro that tells the compiler to emit code that tells
the CPU that the branch where 8predicate is ~0l is the more likely path. This
can improve the success of branch prediction, and thus improve performance._

While this is what the macro would do on processors that support such code,
x86/x64 isn't one of those processors (Itanium was). Instead, it's more
important purpose here is to hint to the compiler (not the processor) to lay
out the conditional in assembly such that the expected result is on the faster
"branch not taken" path.

That said, is the actual implementation of the "write side" viewable online?

~~~
Marat_Dukhan
There are hints for taken/non-taken branches in x86. However, the only family
to use them was Pentium 4; other processors silently ignore those hints.

As for the write side, you'll find it here:
[https://opensource.apple.com/source/libdispatch/libdispatch-...](https://opensource.apple.com/source/libdispatch/libdispatch-339.90.1/src/once.c)

~~~
gsg
Hints are useless, but arranging for cold code to be on a forward branch can
be beneficial.

------
colin_mccabe
The missing element from this explanation (or maybe I missed it?) is that
dispatch_once uses atomic instructions (cmpxchg, to be exact), to ensure that
even if multiple threads attempt to create the singleton, only one succeeds.
The fast path for reads only kicks in after everything else has been tidied
up.

Another way to do this kind of thing on Linux is to use ELF TLS to have a
thread-local variable identifying if the expensive operation has been
completed. If the TLS is not there, you can take a mutex and fall back to the
slow path.

------
rossjudson
This feels racy, and I'm not sure it will work on multi-node NUMA
architectures. It's one thing to delay and hope that your write is visible in
the local processor cache; it's another to ensure that write is visible to
other processor nodes, when there's no explicit cache coherence rule that says
it will be.

------
quotemstr
Relying on an implementation detail of the cpuid instruction seems like a very
bad idea. It'd be safer, I think, and just as fast, to initially use the
atomic version of the code, then NOP out the LOCK prefix once initialization
is complete.

~~~
userbinator
Intel specifies CPUID to be serialising, this is unlikely to change in the
future.

~~~
cfallin
It's actually part of the architectural definition, so Intel could not change
this without breaking ISA compatibility. And they take x86 compatibility very
seriously!

------
n0rm
Can someone link a mirror?

I get timeout for some reason.

