

Python the GIL, unladen swallow, reference counting, and atomic operations. - illumen
http://renesd.blogspot.com/2009/12/python-gil-unladen-swallow-reference.html

======
sreque
I thought most of us agreed by now that ref-counting is slower than GC?
<http://www.idiom.com/~zilla/Computer/javaCbenchmark.html> cites a paper
empirically showing just that. I quote:

" In a well known paper [2] several widely used programs (including perl and
ghostscript) were adapted to use several different allocators including a
garbage collector masquerading as malloc (with a dummy free()). The garbage
collector was as fast as a typical malloc/free; perl was one of several
programs that ran faster when converted to use a garbage collector."

And this is just with single-threaded performance! I really hope both unladen
swallow and macruby succeed in their goals of removing the GIL from their
respective scripting languages, and I suspect that the numbers will prove that
python gets better performance with a GC as opposed to reference counting.

------
wingo
The author misunderstands the performance impact of atomic operations.

~~~
mrkurt
1-line throwaway critiques really aren't very useful. Do you have more to add
than a simple "he's wrong"?

~~~
saucetenuto
I'm not the OP, but:

Whenever you update a reference count atomically, you need to use a memory
barrier to ensure that that memory write becomes visible to the other cores
before any other memory operations you want to make. Those are not cheap, and
doing one on every Py_IncRef and Py_DecRef will have performance implications
somewhere between murderous and tragic.

~~~
wingo
This is what I meant, yes. Note that you incur the cost on immutable data
structures as well as mutable ones. You really trash your caches this way.

~~~
haberman
You only pay a cost in cache if there is contention over a cache line. If
there is no cache-line contention, I do not believe there is any overhead to
atomic increment. See:

[http://software.intel.com/en-us/forums/intel-threading-
build...](http://software.intel.com/en-us/forums/intel-threading-building-
blocks/topic/59268/reply/59046/)

~~~
haberman
And yet empirically there is a 20-40% slowdown:

[http://mail.python.org/pipermail/python-
ideas/2009-November/...](http://mail.python.org/pipermail/python-
ideas/2009-November/006599.html)

The message is not too specific; is this on an SMP machine? Are these multi-
threaded tests? Given the previous message I cited, I would expect the single-
CPU, single-threaded case to be unaffected by making incref and decref atomic.

------
unwind
My favorite supporting argument:

 _Simple Direct Media Layer also has a cross platform atomic operations API in
SDL 1.3._

That is like using Duke Nukem Forever as supporting reference when discussing
game design. :)

Note that this is very tongue-in-cheek; I'm just frustrated that SDL 1.3 seems
to never arrive.

