Hacker News new | past | comments | ask | show | jobs | submit login
Low-Lock Singletons In D (davesdprogramming.wordpress.com)
30 points by andralex on May 6, 2013 | hide | past | favorite | 30 comments



This is an elegant solution to the double checked locking bug http://en.wikipedia.org/wiki/Double-checked_locking, incurring insignificant overhead.


Unfortunately, the space overhead is not quite that insignificant. It requires at least one bit of storage per thread; this does not matter much for a singleton class, but if you want to implement more general write-once data structures (such as arrays or hash tables), it's not negligible.

Somewhat less problematic is that fast TLS is not natively available on all architectures (in particular, Mac OS X). Luckily, this can be worked around relatively easily.


D does its own TLS implementation on OS X. But now that newer versions of OS X have native TLS, we'll be scrapping ours and using the native one.


Or you could use atomic swaps on the initialized bit. That's both cheaper and simpler.

psuedocode:

    member atomic32 init
    member mutex lock
    member obj ptr


    membarrier()
    if (atomic_load(&init)) {
      return ptr
    } else {
      locked_init()
      atomic_save(&init, 1)
      membarrier()
    }
Not an expert, use this at your own risk.


It's well-known that you can do double-checked locking right with memory barriers, but memory barriers aren't always cheaper. On TSO architectures (x86, most SPARCs), read barriers are no-ops, but with weaker memory models (e.g., ARM), they aren't.

Edit: Also, the memory barriers are in the wrong place in the code. You need a read barrier between the init check and returning ptr, and a write barrier between the actual initialization of the object and writing the init variable.


Interesting. I didn't know that. Sorry for the error, I'm not an expert in this subtle stuff. Barely even conversant to be honest.

On those architectures with weaker models the real simpler thing to do is to make the singleton thread-unsafe for initialization.


You force a memory barrier in all cases, while the original code only passes through a barrier once - if the instantiated_ flag is false.


If you have atomic swaps (hardware "lock-free"), the global shared instance pointer should be enough - since it either holds a valid pointer or it doesn't.


No, you don't want two threads trying to allocate the object at once. You still need a locked init.


Yes, there would be an allocation race, so if you cannot accept allocating unnecessarily in a discardable/recoverable way (but also one that preserves the semantics of the singleton: one and only one instance returned over its life), you would still need a flag. Though you might instead use an "incomplete"/place-holder reference value to simulate the flag and prevent duplication of effort/wasting resources while the allocation happens - all in the space of the pointer.


That's not sufficient on its own (leaving aside the problem of multiple initialization). While compare and swap will likely have the necessary write barrier built in, you need a matching read barrier when accessing the object.

I.e. reads could otherwise get re-ordered so that you see the pointer and the not-yet-initialized object at the same time.


Right. I was still thinking in terms of my thread-local-less model posted before when I wrote the above, which took for granted those barriers once past init.


What do you swap into the pointer? A new instance?


If the pointer is 0 through a normal test, you would allocate the resource and do a compare-and-swap to record the allocation if the pointer is still 0. As noted beside this comment, you might wind up allocating several times in parallel (which may be undesirable too), but you would never clobber an existing allocation or change the instance that is dispensed.


I don't think thread-local variables are necessary.

Consider a global flag that gets instantiated and synchronized across threads/cores to a state of 0 such that if it ever gets set to 1, it was after publishing and synchronizing the shared instance via a global pointer. This pointer, and the global flag, will not revert for the rest of the singleton's life (noted in the article) - so there is nothing to synchronize past that - and it is only for the guarding flag's 0 (unallocated) case that there is a question if things are synchronized. That check and potential allocation are what need a standard lock (provided by D's syncronized block), which the example already provides.


To get such a global flag with the desired behavior, you need explicit memory barriers. Otherwise, you cannot guarantee that it won't become visible to another processor before the object's data itself becomes visible.

The problem here is that the memory accesses for the path that bypasses the synchronization block are not guaranteed to occur in the correct order without barriers.


Yep - it should be enough to force a coherent view for all as part of the initialization.


Yes, it's well-known that you can fix the problem with memory barriers. But the article's point was that such memory barriers can be avoided entirely for the common execution path if you have fast thread-local storage (because you can make sure that every thread is forced through the full synchronization block once before accessing anything). Given that memory barriers can be expensive on some processors, that's a not insignificant win.


Does D's synchronization not contain memory barriers (I don't know)?

(Edited)


The POSIX standard requires a full memory barrier for the underlying operations. The problem is that synchronization is expensive: not only because of the memory barriers, which have to include the most expensive ones, but also because locking and unlocking operations need to access main memory directly, bypassing all caches.

The problem is, in any event, not the synchronization, but the common path -- for which double-checked locking tries to avoid synchronization --, which needs at a minimum a load-load barrier. That can be done efficiently on TSO architectures (where a load-load barrier is a no-op), but is not guaranteed to be fast on other architectures.


The problem with memory barriers is that they are slow. David's solution using TLS avoids them entirely for the usual case, hence it is much faster (as his benchmark results show).


Thanks for your comment. I do see that the TLS is used in a conditional to avoid the barriers in most cases (the already-initialized case), but the initialization still happens in a synchronized block. I may be wrong, but the docs for D say that effectively wraps the block in a critical section which seems likely to include rw barriers, meaning the code given has barriers at init and also uses TLS. If so, the TLS would be unnecessary, for my reasons above - causing me to suspect that the example has just hidden the barriers behind the synchronized keyword.


Yes, when it is synchronizing there are memory barriers. But the point is the synchronization and memory barriers can be skipped once the singleton is initialized. The usual case can then run without barriers.


Thanks again. If I am right though, the memory barriers could also have been skipped without thread-local values since, due to the barriers on init, no thread's view would revert to it's pre-init state anyway. As rbehrends mentions, barriers are not the only factor... For example, global memory access may be slower on some systems due to decreased locality (a property provided by the TLS), but I don't know what if any value that would have in this benchmark. There is also a comment on that page that does the same thing as my TLS-free approach using a more reasonable static global initialization syntax in Java in response to a question of if D's initialization would be semantically equivalent too.

Edit: (Not waiting for reply link to appear...) When I say "the memory barriers could also be skipped" I am referring to any other memory barriers past those associated with initialization, not those associated with initialization. There would be memory barriers at initialization, but that is the only need for memory barriers and I believe that is provided here by the synchronized keyword. The object presented in the article then looks like a cache-friendlier version of a single global variable with the necessary barriers limited to the initialization of the global instance pointer (which might also be done as in the Java example given in the comments on the page). The TLS does not eliminate the need for any barriers, only helps eliminate non-local reads. That might allow a performance gain depending on the cache behavior (or, for example, in the presence of NUMA), but it hasn't reduced the need for locking IMHO.


If multiple threads are accessing two globals, two threads are not guaranteed to see changes in one or the other in the same order. This is called "sequential consistency". The only way to get sequential consistency between threads is to insert memory barriers or use synchronization.

There is no way around this. Your statement that memory barriers could be skipped is exactly the source of the double checked locking bug. I know this is hard to understand, which is why it has tripped up a looong list of very smart people over the years.

David's article solves the problem by having one of the globals not be global at all, but thread local.

As for performance, David's article shows with benchmarks that his solution is a huge win.


Doesn't D have the ability to do stuff on class initialization? Or does it not initialize classes lazily?

In Java, this is how a threadsafe Singleton looks like:

        private static MySingleton instance = new MySingleton();
It will be initialized when the class is first used.

The whole double-checked locking and TLS thing is IMO a huge pointless smartassery contest in VM lawyering hinging on the eminently false premise that having singletons in a heavily multithreaded environment initialized sooner than they're used is a problem that anyone needs solved.

Whereas in reality, 90% of all applications don't need lazy loading, and for 90% of those that do, the above is lazy enough.


I think you missed a static keyword in that line of code.


You're right; fixed.


That requires an acquire on every access, so it's generally slower.


No, it doesn't. The JVM guarantees class initialization to be threadsafe (the lock is implicit), so you don't need mutual exclusion to access the reference - it's initialized before any thread could attempt to access it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: