
100,000 tasklets: Stackless and Go - timwiseman
http://www.dalkescientific.com/writings/diary/archive/2009/09/15/100000_tasklets.html
======
dons
Comparing Go and Haskell :
[http://www.reddit.com/r/programming/comments/a4n7s/stackless...](http://www.reddit.com/r/programming/comments/a4n7s/stackless_python_outperforms_googles_go/c0ftumi)

This is all about how cheap threads are.

Go is about twice as slow on this benchmark, and doesn't scale past 400k
threads. GHC 6.12 goes to >3 million threads (limited by available ram at that
point).

~~~
jrockway
I have done a similar benchmark; it is actually the MVars that kill you with
respect to memory. If you can omit them, you can do way more than 3 million
threads.

(I have not had enough coffee today, so I can't say how useful this setup
would be. I have done threads-without-mvars for a TCP server, but my OS
refused to give me any more than 300,000 pipes. Disappointing.)

~~~
dons
> If you can omit them, you can do way more than 3 million threads.

Via IORefs? Or?

------
seunosewa
This problem is inherently single-threaded. Stackless Python is single-
threaded. Go distributes the goroutines over several processors using a thread
pool so the performance is naturally lower for a no-op program like this one.
Conclusion: inter-processor communication is expensive?

~~~
dons
> Go distributes the goroutines over several processors using a thread pool so
> the performance is naturally lower for a no-op program like this one

Performance is certainly worse with more cores. And you can end up with
exponential slowdowns due to thread migration. The best strategy is to pin
parts of the thread ring on each core.

Go doesn't have a good SMP scheduler, so you see that slowdown, and the GHC
team have written about the solutions. See this thread:
[http://www.reddit.com/r/programming/comments/a4n7s/stackless...](http://www.reddit.com/r/programming/comments/a4n7s/stackless_python_outperforms_googles_go/c0ftvvj)

------
decode
Don't forget to take your grain of salt with this article. The timing
comparison is between two different computers of unknown hardware, running two
different operating system versions.

~~~
decode
I decided to do the comparison for him. Using a 386 build of today's tip of Go
from the hg repo and stackless python 2.6.4 built from source. Running on
Ubuntu 9.04, running the binary 2.6.28-16-generic kernel (32-bit) from Ubuntu.
Hardware is a Core 2 Duo at 3Ghz and 4GB of RAM.

    
    
      ~/development/go/dev$ time ./8.out 
      100000
      
      real	0m0.632s
      user	0m0.288s
      sys	0m0.344s
      
      
      
      ~/install/stackless-2.6.4$ time ./python 100k.py 
      100000
    
      real	0m0.350s
      user	0m0.292s
      sys	0m0.060s

~~~
bajsejohannes
It seems sys is what makes the big difference here. Does anyone have a theory
why?

~~~
barrkel
I would guess contention on sync primitives in the dispatcher. The queues look
like they need to be locked before they can be modified:

    
    
        // Scheduling helpers.  Sched must be locked.
        static void gput(G*);   // put/get on ghead/gtail
    

This is the comment on the lock:

    
    
        in the uncontended case,
         * as fast as spin locks (just a few user-level instructions),
         * but on the contention path they sleep in the kernel.
         * a zeroed Lock is unlocked (no need to initialize each lock).
    

A better approach for multicore would probably be work-stealing queues.

~~~
ori_b
From what I know of the authors, they would have picked the simpler solution
until it actually turns out to be a problem in the real world.

------
dalke
At least a grain. Would be nice if someone contributed actual timing numbers
from the same box, since I was unable to compile Go.

The point is that Go is a compiled language designed for concurrent processing
and Stackless Python is a byte-code language which also had to parse the input
program as part of the timing. Even then the raw numbers put Stackless as
twice as fast as Go. Assuming my machine is 4 times faster than Pike's, that
still not enough to justify what I interpreted as praise for Go's concurrent
processing performance.

