100,000 tasklets: Stackless and Go

dons · on Nov 15, 2009

Comparing Go and Haskell : http://www.reddit.com/r/programming/comments/a4n7s/stackless...

This is all about how cheap threads are.

Go is about twice as slow on this benchmark, and doesn't scale past 400k threads. GHC 6.12 goes to >3 million threads (limited by available ram at that point).

jrockway · on Nov 15, 2009

I have done a similar benchmark; it is actually the MVars that kill you with respect to memory. If you can omit them, you can do way more than 3 million threads.

(I have not had enough coffee today, so I can't say how useful this setup would be. I have done threads-without-mvars for a TCP server, but my OS refused to give me any more than 300,000 pipes. Disappointing.)

dons · on Nov 15, 2009

> If you can omit them, you can do way more than 3 million threads.

Via IORefs? Or?

vegai · on Nov 16, 2009

This is as flawed as the stackless python guy's tests, aren't they? The haskell program's threads can block other threads (unless you build with -threaded, which will change the results surely).

seunosewa · on Nov 15, 2009

This problem is inherently single-threaded. Stackless Python is single-threaded. Go distributes the goroutines over several processors using a thread pool so the performance is naturally lower for a no-op program like this one. Conclusion: inter-processor communication is expensive?

dons · on Nov 15, 2009

> Go distributes the goroutines over several processors using a thread pool so the performance is naturally lower for a no-op program like this one

Performance is certainly worse with more cores. And you can end up with exponential slowdowns due to thread migration. The best strategy is to pin parts of the thread ring on each core.

Go doesn't have a good SMP scheduler, so you see that slowdown, and the GHC team have written about the solutions. See this thread: http://www.reddit.com/r/programming/comments/a4n7s/stackless...

decode · on Nov 15, 2009

Don't forget to take your grain of salt with this article. The timing comparison is between two different computers of unknown hardware, running two different operating system versions.

decode · on Nov 15, 2009

I decided to do the comparison for him. Using a 386 build of today's tip of Go from the hg repo and stackless python 2.6.4 built from source. Running on Ubuntu 9.04, running the binary 2.6.28-16-generic kernel (32-bit) from Ubuntu. Hardware is a Core 2 Duo at 3Ghz and 4GB of RAM.

  ~/development/go/dev$ time ./8.out 
  100000
  
  real	0m0.632s
  user	0m0.288s
  sys	0m0.344s
  
  
  
  ~/install/stackless-2.6.4$ time ./python 100k.py 
  100000

  real	0m0.350s
  user	0m0.292s
  sys	0m0.060s

dalke · on Nov 15, 2009

Thanks decode! I've updated my page with a pointer to here as well as your conclusion numbers. Let me know if that's a problem.

bajsejohannes · on Nov 15, 2009

It seems sys is what makes the big difference here. Does anyone have a theory why?

barrkel · on Nov 15, 2009

I would guess contention on sync primitives in the dispatcher. The queues look like they need to be locked before they can be modified:

    // Scheduling helpers.  Sched must be locked.
    static void gput(G*);   // put/get on ghead/gtail

This is the comment on the lock:

    in the uncontended case,
     * as fast as spin locks (just a few user-level instructions),
     * but on the contention path they sleep in the kernel.
     * a zeroed Lock is unlocked (no need to initialize each lock).

A better approach for multicore would probably be work-stealing queues.

ori_b · on Nov 15, 2009

From what I know of the authors, they would have picked the simpler solution until it actually turns out to be a problem in the real world.

leif · on Nov 15, 2009

I would tender a guess that since go is compiled to native code, it uses a lot more syscalls to manage its threads. Stackless probably does a lot more of the management inside the vm, and doesn't need to make as many (or any) syscalls to do so.

This is just a guess from someone that knows nothing about either (stackless does run in a vm, right?), so take many large grains of salt.

uriel · on Nov 15, 2009

This is silly, stacklets and goroutines are not comparable, stackless can't even read/write on multiple channels (ie., 'select' or in Limbo parlance 'alt'), nor can it do any blocking calls.

In short, you are comparing apples with bazookas. (Also, for a test that runs in such a short time, and running it only once, it is likely most of the time it is taken by things unrelated to what is intended.)

dalke · on Nov 15, 2009

Sure it can. I mocked up an example solution for the Stackless list, here: http://www.stackless.com/pipermail/stackless/2009-November/0... with a followup using the channel itself as the dispatcher and not some text string name. It's not as nice as syntactical support, but it gets the job done.

Thing is - it doesn't seem to have been missed from Stackless, which makes me wonder when it's needed.

barrkel · on Nov 15, 2009

Fully agree, TFA is almost information free IMHO.

dalke · on Nov 15, 2009

TFA (or MFA for My FA?) asks why Pike emphasized Go's concurrency performance when it was almost certainly worse than Stackless Python's, then showed why I inferred that it was worse.

barrkel · on Nov 16, 2009

On speed: Rob Pike's video demo compiled 120kloc RTL in 8 seconds (~15kloc/sec). Your C compile took 7 seconds for 25kloc (3.6kloc/sec) or 2.75 (9.1kloc/sec).

On concurrency: you compared a parallelized CSP implementation, with a single-threaded one, where the test was dominated by communication costs. Single-threaded communication is much faster than potentially contended parallel communication.

On benchmark hygiene: there are a plethora of different machines, different performance profiles and different numbers flying around here. The kloc/sec numbers don't mean anything unless they're on the same machine; similarly, the performance numbers are dependent on degrees of hardware parallelism and the kind of workload (proportion of per-task vs communication vs task startup, all the different variables in cost). Several different benchmarks would need to be run to actually tease out these different variables, to really figure out which one is better.

dalke · on Nov 16, 2009

Hmm. I think I may have gotten the compile numbers from another video about Go, and as I'm reaching the end of my stamina for this topic, I'll just assume I got that wrong.

On concurrency, I thought Go right now was configured by default for only a single kernel thread running goroutines, which would make both of them single-threaded implementations.

A few people now have reported benchmark numbers on their machines, with both Stackless and Go. In every case so far, Go has been slower than Stackless for the 100,000 tasklet/goroutine case. Of course, if Pike had a MacBook Air with an SDD then the compile times are absolutely not comparable.

Agreed also about how to tease out the different variables. Still, recall that I mostly want to know why Pike, for example, stressed that there were no tricks going on underneath the covers to make the performance fast, with the implication that people wouldn't quite believe how fast it was, when the performance does not seem exceptional compared to other similar languages.

uriel · on Nov 16, 2009

I quote from the stackless mailinglist:

"Go is stackless where Stackless is not. Its goroutines use allocated stacks (starting at 4k in size) and can continue running on different threads where Stackless tasklets cannot. In fact, when a goroutine blocks on a system call, the other goroutines in its scheduler are migrated to another thread."

So stakcless is not really concurrent at all, given that, it is no wonder performance is different given that the functionality provided by stackless 'microthreads' is in no way comparable to goroutines.

dalke · on Nov 15, 2009

At least a grain. Would be nice if someone contributed actual timing numbers from the same box, since I was unable to compile Go.

The point is that Go is a compiled language designed for concurrent processing and Stackless Python is a byte-code language which also had to parse the input program as part of the timing. Even then the raw numbers put Stackless as twice as fast as Go. Assuming my machine is 4 times faster than Pike's, that still not enough to justify what I interpreted as praise for Go's concurrent processing performance.