
Java's Atomic and volatile, under the hood on x86 - lorelei
http://brooker.co.za/blog/2012/11/13/increment.html
======
jcdavis
I thought java had switched to using a slightly more optimal lock xadd for
AtomicInteger/etc updates? (see
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898>) Doesn't address
the main issue of memory synchronization between cores, but should make things
a little better?

~~~
pron
This fix is very recent. Not sure whether it's in the stable builds of JDK 7.

------
haberman
Maybe I'm missing something, but I'm very surprised that the JVM would be
implementing atomic increment with a loop that does "lock cmpxchg", retrying
if it fails. The same can be accomplished much more easily (and safely, and
probably with better performance) with "lock add".

For example, take this C program which uses the GCC atomic builtin
__sync_add_and_fetch():

    
    
        void f(int *x) { __sync_add_and_fetch(x, 1); }
    

This compiles into:

    
    
       lock add DWORD PTR [rdi],0x1
       ret
    

cmpxchg is also vulnerable to the ABA problem
(<http://en.wikipedia.org/wiki/ABA_problem>), which the "lock add" approach is
not.

"inc" is also an instruction best avoided these days, since it doesn't update
all of the flags and can therefore cause an EFLAGS stall
([http://stackoverflow.com/questions/12163610/why-inc-and-
add-...](http://stackoverflow.com/questions/12163610/why-inc-and-add-1-have-
different-performances))

~~~
stephencanon
The partial flags update stall to which INC is vulnerable is tiny -- on the
order of 10 cycles -- and generally not a significant factor except in tiny
loops with carried flag dependencies that are executed millions of times.
Also, the stall has been largely eliminated on recent Intel µarchs
(Sandybridge and later). That said, ADD is no worse and sometimes better, so
there's really no good reason to use INC.

~~~
haberman
> Also, the stall has been largely eliminated on recent Intel µarchs
> (Sandybridge and later).

How is that accomplished? Isn't the stall required by definition, since the
partial flags update creates extra data dependencies?

~~~
stephencanon
Intel hasn't published much documentation on the precise architectural
techniques that are used; examples 3-25 and 3-26 and the surrounding text in
their optimization manual give a vague description ("In Intel
microarchitecture code name Sandy Bridge, the cost of partial flag access is
replaced by the insertion of a micro-op instead of a stall.") and include and
example of a code sequence that would incur such a stall on earlier µarchs but
is faster than a sequence that avoids the stall on Sandy Bridge.

One simple and common optimization is to perform register rename on the flags
register. More often than not, the flags updated by INC or DEC are simply
overwritten by a later arithmetic instruction without ever being used; simple
rename can eliminate the stall in these cases.

Another simple optimization would be for the front-end perform macro-op fusion
of an INC or DEC with a following branch that is known to _only_ use the flag
bits written by the INC; then the fused macro-op can issue without waiting on
the other flag bits, and avoid stalling control flow.

However, I have no inside knowledge about what particular changes Intel made
or didn't made to remove this particular hazard; I only know that it seems to
have been nearly entirely eliminated.

------
kyrra
Just a random fact, for those that use Snort[1], this article brings up the
same reason why stream processing in snort is single threaded. See their
attempt to make it multi-threaded:

[http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...](http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-released.html)

Edit: to clarify, snort can run across multiple threads, but a single stream
is handled by a single thread. When they tried to process the same data in
multiple threads at once, cache synching killed performance.

[1] <http://snort.org/>

~~~
brooksbp
Poor cache utilization on traditional shared memory arch when passing data
(packet) from core to core. I wonder if, while going down this road, they made
sure key structures were cache-aligned.

Flow affinity, locking "streams" to cores, is the way to go. Relevant linux
kernel support:
[http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=Documentation/networking/scaling.txt;h=579994afbe067bf9bf6d79bf50c62986dda2765d;hb=HEAD)

------
wtracy
I did a double take when I saw the generated assembly reference %r8d.

Apparently this is a register that was added with AMD64. I guess that's what I
get for not keeping up with x86 assembly over the last five years.

~~~
0x0
That's not valid in standard 32bit x86 assembly, is it? So in fact this is
actually x86_64 / amd64 assembly, almost a different architecture.

Edit: It is not valid according to
[http://www.codeproject.com/Articles/45788/The-Real-
Protected...](http://www.codeproject.com/Articles/45788/The-Real-Protected-
Long-mode-assembly-tutorial-for) \- so headlining with "x86" is misleading at
best.

~~~
technogeek00
Correct the %r8 - %r15 registers were added by the x86_64 architecture and are
64-bit in length just as the others are now. They are commonly used for
passing arguments to functions.

------
d2fn
"For it to end up with the right value at the end (M x N), two things need to
be true." <\-- the two things (immediate visibility and atomicity) are not
strictly required. this is only required if all intermediate values are to be
observed by the running threads. otherwise you can end up with the correct
answer (M x N) without requiring threads to coordinate each write.

~~~
mjb
Right, that's true. Perhaps I oversimplified.

What I was trying to get across is that, in the trivial implementation,
visibility and atomicity are required. There are obviously better ways for
threads to correctly count in parallel with much better performance - but not
ones that Java will automatically recognize based on the obvious
implementation of the code.

------
3825
perhaps there is a reason why they wrote Java's instead of Java is?

~~~
jbri
The reason is that it's a possessive apostrophe, not a contraction.

