

Low-Lock Singletons In D - andralex
http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/

======
WalterBright
This is an elegant solution to the double checked locking bug
<http://en.wikipedia.org/wiki/Double-checked_locking>, incurring insignificant
overhead.

~~~
rbehrends
Unfortunately, the space overhead is not quite that insignificant. It requires
at least one bit of storage per thread; this does not matter much for a
singleton class, but if you want to implement more general write-once data
structures (such as arrays or hash tables), it's not negligible.

Somewhat less problematic is that fast TLS is not natively available on all
architectures (in particular, Mac OS X). Luckily, this can be worked around
relatively easily.

~~~
WalterBright
D does its own TLS implementation on OS X. But now that newer versions of OS X
have native TLS, we'll be scrapping ours and using the native one.

------
jamesaguilar
Or you could use atomic swaps on the initialized bit. That's both cheaper and
simpler.

psuedocode:

    
    
        member atomic32 init
        member mutex lock
        member obj ptr
    
    
        membarrier()
        if (atomic_load(&init)) {
          return ptr
        } else {
          locked_init()
          atomic_save(&init, 1)
          membarrier()
        }
    

Not an expert, use this at your own risk.

~~~
vy8vWJlco
If you have atomic swaps (hardware "lock-free"), the global shared instance
pointer should be enough - since it either holds a valid pointer or it
doesn't.

~~~
jamesaguilar
No, you don't want two threads trying to allocate the object at once. You
still need a locked init.

~~~
vy8vWJlco
Yes, there would be an allocation race, so if you cannot accept allocating
unnecessarily in a discardable/recoverable way (but also one that preserves
the semantics of the singleton: one and only one instance returned over its
life), you would still need a flag. Though you might instead use an
"incomplete"/place-holder reference value to simulate the flag and prevent
duplication of effort/wasting resources while the allocation happens - all in
the space of the pointer.

------
brazzy
Doesn't D have the ability to do stuff on class initialization? Or does it not
initialize classes lazily?

In Java, this is how a threadsafe Singleton looks like:

    
    
            private static MySingleton instance = new MySingleton();
    

It will be initialized when the class is first used.

The whole double-checked locking and TLS thing is IMO a huge pointless
smartassery contest in VM lawyering hinging on the eminently false premise
that having singletons in a heavily multithreaded environment initialized
sooner than they're used is a problem that anyone needs solved.

Whereas in reality, 90% of all applications don't need lazy loading, and for
90% of those that do, the above is lazy enough.

~~~
spand
I think you missed a static keyword in that line of code.

~~~
brazzy
You're right; fixed.

------
vy8vWJlco
I don't think thread-local variables are necessary.

Consider a global flag that gets instantiated and synchronized across
threads/cores to a state of 0 such that if it ever gets set to 1, it was after
publishing and synchronizing the shared instance via a global pointer. This
pointer, and the global flag, will not revert for the rest of the singleton's
life (noted in the article) - so there is nothing to synchronize past that -
and it is only for the guarding flag's 0 (unallocated) case that there is a
question if things are synchronized. That check and potential allocation are
what need a standard lock (provided by D's syncronized block), which the
example already provides.

~~~
rbehrends
To get such a global flag with the desired behavior, you need explicit memory
barriers. Otherwise, you cannot guarantee that it won't become visible to
another processor before the object's data itself becomes visible.

The problem here is that the memory accesses for the path that bypasses the
synchronization block are not guaranteed to occur in the correct order without
barriers.

~~~
vy8vWJlco
Yep - it should be enough to force a coherent view for all as part of the
initialization.

~~~
rbehrends
Yes, it's well-known that you can fix the problem with memory barriers. But
the article's point was that such memory barriers can be avoided entirely for
the common execution path if you have fast thread-local storage (because you
can make sure that every thread is forced through the full synchronization
block once before accessing anything). Given that memory barriers can be
expensive on some processors, that's a not insignificant win.

~~~
vy8vWJlco
Does D's synchronization not contain memory barriers (I don't know)?

(Edited)

~~~
WalterBright
The problem with memory barriers is that they are slow. David's solution using
TLS avoids them entirely for the usual case, hence it is much faster (as his
benchmark results show).

~~~
vy8vWJlco
Thanks for your comment. I do see that the TLS is used in a conditional to
avoid the barriers in most cases (the already-initialized case), but the
initialization still happens in a synchronized block. I may be wrong, but the
docs for D say that effectively wraps the block in a critical section which
seems likely to include rw barriers, meaning the code given has barriers at
init _and_ also uses TLS. If so, the TLS would be unnecessary, for my reasons
above - causing me to suspect that the example has just hidden the barriers
behind the _synchronized_ keyword.

~~~
WalterBright
Yes, when it is synchronizing there are memory barriers. But the point is the
synchronization and memory barriers can be skipped once the singleton is
initialized. The usual case can then run without barriers.

~~~
vy8vWJlco
Thanks again. If I am right though, the memory barriers could also have been
skipped without thread-local values since, due to the barriers on init, no
thread's view would revert to it's pre-init state anyway. As rbehrends
mentions, barriers are not the only factor... For example, global memory
access may be slower on some systems due to decreased locality (a property
provided by the TLS), but I don't know what if any value that would have in
this benchmark. There is also a comment on that page that does the same thing
as my TLS-free approach using a more reasonable static global initialization
syntax in Java in response to a question of if D's initialization would be
semantically equivalent too.

Edit: (Not waiting for reply link to appear...) When I say "the memory
barriers could also be skipped" I am referring to any other memory barriers
past those associated with initialization, not those associated with
initialization. There would be memory barriers at initialization, but that is
the only need for memory barriers and I believe that is provided here by the
synchronized keyword. The object presented in the article then looks like a
cache-friendlier version of a single global variable with the necessary
barriers limited to the initialization of the global instance pointer (which
might also be done as in the Java example given in the comments on the page).
The TLS does not eliminate the need for any barriers, only helps eliminate
non-local reads. That might allow a performance gain depending on the cache
behavior (or, for example, in the presence of NUMA), but it hasn't reduced the
need for locking IMHO.

~~~
WalterBright
If multiple threads are accessing two globals, two threads are not guaranteed
to see changes in one or the other in the same order. This is called
"sequential consistency". The only way to get sequential consistency between
threads is to insert memory barriers or use synchronization.

There is no way around this. Your statement that memory barriers could be
skipped is exactly the source of the double checked locking bug. I know this
is hard to understand, which is why it has tripped up a looong list of very
smart people over the years.

David's article solves the problem by having one of the globals not be global
at all, but thread local.

As for performance, David's article shows with benchmarks that his solution is
a huge win.

