
Concurrent Memory Deallocation in the Objective-C Runtime - chmaynard
http://www.mikeash.com/pyblog/friday-qa-2015-05-29-concurrent-memory-deallocation-in-the-objective-c-runtime.html
======
pron
That's a cool GC algorithm, but it has two big drawbacks:

1\. It works well only if the number of GCed data structures -- i.e. the
number of entry/exit points -- is relatively small (in this case, just the
method cache).

2\. It prevents inlining of the data structure's access functions, which adds
quite a bit of overhead to each access (though not as much as two fenced
CASes).

It can, however, be very useful for hot code swapping (provided there's no
inlining or you have a JIT that can un-inline).

~~~
pcwalton
I think the way to look at this is more of as a synchronization technique as
opposed to a general-purpose GC algorithm.

I don't understand the objection about inlining--could you explain? You could
in principle not use function boundaries to delimit the critical section: a
range of PC addresses within a function would do just as well.

~~~
pron
> a range of PC addresses within a function would do just as well

Yes, but you'll need to know where those ranges are (i.e. you'll need some
feedback from the compiler), and then be very careful that no tool injects any
instructions into, or moves code around your executable/library.

------
pcwalton
Fascinating. I had never considered sending signals to stop all threads and
check their PCs as a synchronization technique. But it works, and it brings
down the uncontended lock acquisition and release time down to zero in the
case where no synchronization needs to be performed.

Now I'm trying to think of other places to use this trick. :)

~~~
lgg
It isn't actually stopping the threads. The thing to understand is that you
don't need to ensure that all threads are outside of a critical section at the
time you free the cache, what you need is the weaker guarantee that all
threads have to have exited any critical sections they were in at the time of
the pointer swap before you can call free.

You don't have to do anything heavy weight like stop all the threads, instead
you can just pull their PCs out and check them. Sure, they may have moved on
(and even back into the critical section) by the time you call free, but it
doesn't matter because they'll be using the new cache buffer. The downside is
that you might get false positives for the critical section, but you can
either run the dealloc in a loop or defer it and try again later.

~~~
asveikau
No, you pretty much have to stop them. If you observe the PC of other threads
at a point in time while they are running, that is no guarantee that they
aren't _about to enter_ the unsafe region immediately after you make the
check.

Your suggestion sounds akin to replacing a mutex acquire with an "is the lock
held?" [like trylock with 0 timeout, then immediate release]. Observing that
it's unheld at time _t_ makes no guarantee it won't be held by the time of
your next instruction. It therefore becomes a meaningless check, not useful at
all for real synchronization.

[PS: mentioned this on HN before, but my favorite "observe the PC as part of a
synchronization primitive" hack was this one from Linux on armv5:
[http://lwn.net/Articles/314561/](http://lwn.net/Articles/314561/)]

[PPS: How much does objc_msgSend() do inline and how much is external calls?
This PC hack seems like it could have huge holes if some of its critical work
is done in a non-inlined function.]

~~~
dilap
Hmm? No I don't think you have to stop them.

Basically you have:

    
    
        cache = <newval>
        if ok() {
            free(<oldval>)
        }
    

The problem you have is if anyone is still using <oldval> at the point you
free it. So:

    
    
        // maybe someone here gets a reference to <oldval> (1) 
        cache = <newval>
        // from this point onward no one can reference <oldval>
        if ok() {
            free(<oldval>)
        }
    

So ok() only has to check if threads have a reference to <oldval> that they
are still using.

Imagine ok() pauses all threads. It sees if any threads threads are in
BAD=[PC_BAD_START, PC_BAD_END]. If yes, return false, if no, return true.

Now imagine PC doesn't pause the threads. What can happen?

A thread that was in BAD leaves BAD. That's fine.

A thread that wasn't in BAD enters BAD. But that will use <newval>, so that's
fine too.

That thread that was in BAD leaves BAD, and then reenters. That's also fine.

So there's no problem.

(re your question, I believe objc_msgSend is hand optimized assembly that
doesn't make any function calls; if it did, you'd just have to make sure to
include those functions in the range of bad PC addresses.)

~~~
asveikau
> So ok() only has to check if threads have a reference to <oldval> that they
> are still using.

Right. The problem with doing this while other threads are running is that
ok() can return true, correctly so for its point in time, then _immediately
after_ ok() returns another thread could enter objc_msgSend() while you are
inside free(). Maybe ok() ran on thread A while thread B was right at
unrelated function foo()'s "call objc_msgSend" instruction. The check is OK at
a point in time, but perhaps by the time you enter free() thread B did an
unsafe read.

> A thread that wasn't in BAD enters BAD. But that will use <newval>, so
> that's fine too.

You can't make guarantees that it will see newval. Whether or not it does
depends on timing. For example, maybe the guy who calls ok() gets a page fault
or is on a very busy CPU with lots of preemption happening. That will alter
timing in the direction of this being unsafe. Cache coherence may also be an
issue here.

~~~
dilap
You have to be atomically reading and writing cache (which you can do). So
then after

    
    
        cache = <newval>
    

...any future reads of cache will use <newval> (or <newer-val>, but not
<oldval>).

That's what makes the trick work w/o having to pause threads.

~~~
asveikau
> cache = <newval>

As written, that is _not_ a guarantee that all cores will see <newval>
immediately. It's very CPU-specific but you may need memory fences to achieve
this.

Further, in my opinion it's kind of playing with fire.

Edit: Also, it is my impression reading the article that <oldval> is actually
a shared list of old caches to be freed (gOldCachesList). That makes it a lot
more complicated than your example snippet and leaves more potential for nasty
synchronization problems.

~~~
ridiculous_fish
Strictly speaking you're right. However it's not necessary that other cores
see the new store immediately: it's only necessary that they see it before the
memory is freed, which occurs after a substantial delay sufficient to prevent
these out-of-order issues on other CPUs. This is the technique by which memory
barriers are elided on the read side (but not the write side).

~~~
vitalyd
This is both clever and terrifying :). How do they prevent compiler from
scheduling the load early, observing, and using the stale cache pointer? Do
they have at least a compiler fence on the read side?

~~~
bdonlan
The read-side code in question is hand-written assembly code, isn't it? And on
the hardware side, observing pc will probably result in an actual memory
barrier happening anyway (even if no signal is sent internally the kernel must
still suspend the thread to capture register state, and then it needs a
barrier to communicate that result back to the asking thread).

~~~
vitalyd
Yes, perhaps that part is in the hand written assembly portion that's
mentioned.

