I profiled this and found that the vast majority of the time was spent allocating stack segments. So the basic problem is that this benchmark is simply tuned for segmented stacks. An implementation that does not use segmented stacks will do worse on this benchmark. We've rejected segmented stacks because they don't perform well in the real world (and hurt many other benchmarks, including most of the shootout), but they do make this one particular program fast.
The debate on M:N versus 1:1 threading seems irrelevant to this benchmark. In 1:1 mode we will also allocate a lot of stack space.
Channel performance is mostly irrelevant to this benchmark.
It's hard to say what the more "idiomatic" Rust version of this program would be, because this program doesn't do anything. The fastest version of this program would be println((N + 1).to_str()). :) An implementation based on libdispatch/TBB-style blocks might be faster, but it would look quite a bit different from this benchmark.
> The fastest version of this program would be println((N + 1).to_str()).
Chinese Whispers basically represents a DSP system: each agent is "plugged into" the next one in the chain, and a signal has to propagate through the chain, being modified by each agent in an arbitrary fashion. The fact that they all just +1 the signal is irrelevant; each one could output whatever it likes. The only assumption is that the work each agent performs is O(1).
Oh, sorry, forgot to include: 1. Only the agent that outputs a signal knows where it should be propagated[1]. (The decision of propagation is part of the each agent's internal state.) and 2. Each agent may be running on a different system than the next, and may propagate the signal using a network call instead of an internal message-pass-and-yield.
Chinese Whispers is inherently a test of message passing in a distributed system. An iterator-based design (e.g. a "controller" agent that pulls in the state after each call and sends it to the next agent, forcing a hub-and-spoke series of RPC calls) will
• tend to introduce highly-degraded performance, since you've doubled the number of messages being sent;
• force a synchronous wait in the controller (it can't do anything until it's talked to everyone, whereas with the asynchronous propagation, all it has to do is "kick it off");
• possibly just not work at all--if, for example, an agent propagates into a private network that only accepts messages from that agent, not from your controller.
---
[1] The Chinese Whispers problem includes a set-up phase, where the controller instantiates the agents and wires them to one-another. This is just for the sake of convenience, though; the controller should not be assumed to have knowledge of how the agents are interconnected once it has the ring established.
This particular benchmark is especially poor if that's what it's trying to test then, because this benchmark is completely gated on task/goroutine spawning performance and message passing barely even shows up in the profile in both Rust and Go.
Since you might be reading this, why does rust do this channel/port thing instead of just channels? Is that written down somewhere? Does it have to do with the type system or the ownership sematics?
Why do people always insist on posting sub-second performance benchmarks in blogs? Is it really so hard to put a for(var i = 0;i < 1000;i++) {} around the code?
I'm not talking about making the benchmarks actually being reliable, but make them at least pass the most basic of inspections.
Your computer could be doing anything in the few seconds that your application ran. Is your blog really that worthless to you that you couldn't wait 5 minutes to establish that what you're saying holds water?
edit: okey, I need to sleep a bit better, I'm coming off bitter again :P
I'm not sure I understand concurrency for concurrency's sake. It seems a non-concurrent version of this would run at least two orders of magnitude faster than either implementation.
Maybe I'm just set in my old ways, but I tend to equate adding concurrency with adding complexity (mostly to debugging and code legibility) and really only do it when I'm CPU-bound. Even then, it has to be a pretty darn good reason.
I understand there are some benefits of CSP programming, but simple message passing where every single thread wakes up during the call? Half the time you're looking at network or memory bandwidth issues which a single thread could easily saturate.
Now maybe if each of these threads were say, an intelligent agent doing hitting the CPU hard, but this? This two microbenchmarks run for an exceptionally short period of time merged together: one is the creation of threads and one is message passing between them.
The most salient quote from Daniel Micay (strncat):
"It doesn't demonstate a pattern you would use in Rust. Rust tasks aren't a substitute for generators/iterators and they're not around as a control flow feature."
The solution is "don't spawn a huge number of tasks in a tight loop". Rust tasks are optimized for maximum performance once you spin them up, at the cost of some overhead once you spawn them. That was a deliberate choice, because not having segmented stacks makes many much more important things far faster, such as calling into C and performing tight sequential computations without thrashing on stack boundaries or using a garbage collector to move stacks around and rewrite all the pointers. Rust is about making costs explicit: when every function call could turn into a huge malloc(), you lose a lot of predictability over performance (and Go found this out too from what I gather).
Calling into C is a big one: we concluded it was basically impossible to make Rust calling into C as cheap as C calling into C in a segmented stack implementation. In practical usage, we found that the ability to call into C cheaply was much more important than the benefits of small stacks, which mostly help microbenchmarks like this at the expense of real-world code.
> In practical usage, we found that the ability to call into C cheaply was much more important than the benefits of small stacks, which mostly help microbenchmarks like this at the expense of real-world code.
For Rust's goals, I can definitely believe that the first is true. I don't think the second claim is true at all, though. Lightweight threads have been extremely successful in Haskell, Erlang, etc, and not just on microbenchmarks.
> I don't think the second claim is true at all, though. Lightweight threads have been extremely successful in Haskell, Erlang, etc, and not just on microbenchmarks.
They're mostly successful in languages that GC/heap-allocate all stack frames (paying the costs that come with it). Rust uses the machine stack so the benefits come with some very significant drawbacks, most notably unpredictability of performance that comes from stack thrashing. Large mallocs are slower than small mallocs in general, but on the typical sizes you need for machine stack segments you probably aren't going to hit the malloc implementation's free list anyway, so you might as well just allocate up front. If you GC stack frames, though, you just bump allocate in the nursery, at the cost of greatly increased GC pressure (but it makes task spawning much cheaper).
I think Simon & Simon are making exactly the point I am: lightweight threads are significantly faster than any other threading system, and work in real system, not just microbenchmarks. Rust has made some other design choices that it seems make it harder to have some nice things that most modern functional language runtimes have -- I'm sure there are good reasons for all of those choices. But I don't think your claim that the tradeoffs "mostly help microbenchmarks" is right.
I think I should have been more precise: I meant that in Rust small stacks mostly help microbenchmarks. Obviously if you're willing to pay the costs of heap allocating stack frames one by one (or have a relocating stack implementation and are willing to pay the costs associated with that) then small stacks can help a lot.
Thanks for the reply. My comment apparently came off as trite, apologies.
One of the enlightening things for me about CSP and Erlang, in particular, was using channels and messages and a bunch of lightweight tasks for flow control and state management. Reading the reddit comments, and even your reply here, seems to indicate that that type of programming isn't a good fit for rust, and that perfectly reasonable trade offs were made that led the language in a different direction. I should have phrased the comment better, but the recommendations here and in the reddit thread boil down to replying to "it hurts when I do this" with "don't do that".
Obviously no-one is trying to write programs that count to 10001, but the capability of doing it that way efficiently opens up a bunch of different options.
It will be interesting to see what happens to go's performance on things like this if they move away from segmented stacks.
It seams to me that Rust is still good at CSP, one can still start up many threads (logical) but they are not as cheap.
This might just mean that it is more senceable to span a smaller number and communicate with them via port.
Its still CSP, its still usful, but its not exactly like go. My guess is that if you start to do more computation in each thread the benchmarks are gone change.
The Go style is only cheap if the stack does not grow.
Oh, come on. Not only is your comment needlessly insulting but it's completely unconstructive.
What types of CSP things should you do in Rust, then? Which should you not? Why is this code not a demonstration of a slowness in Rust but a 'bad idea'? How would you implement it in a more idiomatic manner in Rust?
>Oh, come on. Not only is your comment needlessly insulting but it's completely unconstructive.
I find his comment pretty spot on. If anything it was the "don't do CSP in Rust" comment that was unconstructive and passive aggresively insulting. Instead of understanding what he read and the constrains it shows, the commenter preffered to just piss on the language.
As for the example, it was obviously "retarted" or rather, contrived. I mean, heck, it's "chinese whispers" with 10000 channels, what more do you need to see that this is not a serious way of solving actual problems with CSP? It's a bloody microbenchmark, and those are rarely representative of anything.
>What types of CSP things should you do in Rust, then?
The types of CSP things that people use threads and processes for. E.g getting lots of computations going on in parallel, not incrementing tens a single int in each of thousands of "computations".
CSP and channels are meant for getting real (cpu intensive) work done. You don't just use threads (or greenthreads) "because concurrency".
That some Go programmers like to use channels as a control mechanism, doesn't mean CSP was meant for that kind of abuse. That's like using Exceptions for control flow to me.
Isn't it obvious that even in Go the times to complete this are horrible (just an order of magnitude less horrible because of different design trade-offs), and that doing the same job of 10,000 incrementings properly would complete in 1/1000 the time?
> CSP and channels are meant for getting real (cpu
> intensive) work done. You don't just use threads (or
> greenthreads) "because concurrency".
The whole point of green threads is that they're orders of magnitude cheaper to create (and destroy) than real, operating system threads. They are precisely around "because concurrency" -- they allow you to nicely model concurrent problems without having to be overly concerned with the implementation detail of their creation cost.
While it's not idiomatic in Go to use a channel/goroutine combo as an iterator, for example -- that's too low-level -- it's absolutely idiomatic to use one for other types of higher-order control flow, managing state machine transitions, doing a scatter/gather, and so on.
I think that's unfair. There's no absolute threshold that defines "the point of CSP"; you might equally say Go's goroutines miss the point of CSP because they're slower than the equivalent for loop.
Besides, the only substantive difference between Go's implementation and Rust's implementation is that Go uses segmented stacks, while Rust does not. That is because segmented or relocating stacks are in opposition to Rust's design goals of no GC, fast calling into C, and predictable performance.
One might say that passing raw pointers around is against the point of CSP. As I understand it go does exactly what the Singularity OS does, pass over ownership instead of copy or raw pointer.
This seams more idimotic, and more CSPish to me.
(Passing around pointer/references to immutables is of course ok)
> As I understand it go does exactly what the Singularity OS does, pass over ownership instead of copy or raw pointer.
I may misunderstand what you're talking about, but as far as I know Go more or less passes a raw pointer (or a copy thereof) over pointer channels. It has no concept of ownership, and the pointer (and pointee) on the sender side remains valid: http://play.golang.org/p/E1bqrVFxZ6
Rudeness of the person you're replying to aside, I think it's well explained elsewhere in the thread that the Rust people have made choices that don't necessarily give great speed on pointless microbenchmarks (making a bunch of processes that do nothing) in favour of performance on CSP tasks where your CSPs actually do things.
I must really be sheltered because that is only the third time I've ever heard the term "Chinese Whispers" used for a game that I remembered as "Broken telephone".
I'm guessing that you're American, and that the author is not. My understanding is that in most of the English-speaking world, the game is called "Chinese whispers" (although that name is falling out of favor because racism), while in America, it is usually called "telephone" or some variation on that.
I'm Canadian so same thing, basically. The term is used by Rob Pike in the referred presentation and he's Canadian, but he's somewhat older than I am. I first noticed the term used by an Australian coworker about a year ago and I also heard it on a Stephen Fry talk show - so yes to your theory.
There was talk recently that rust had made the choice of m:n tasks and real threads configurable. Is it still using m:n tasks by default? Would be interesting to see a comparison between the two for something like this (which should of course heavily favor lightweight task model)
I don't expect much if any of a performance difference. Both M:N and 1:1 threading use non-segmented stacks in Rust, and that is the source of the performance issues.
You're running out-of-memory, and hitting the `llvm.trap` intrinsic call. It causes an illegal instruction to invoke a core dump. It doesn't currently print an error message.
Rename your post, it's going to distract attention from whatever it is you are trying to communicate. I'd just as soon not read it until you call it "telephone" or something more PC.
edit
Also, you'll get flame detected soon enough once this turns into a debate about racism.
I profiled this and found that the vast majority of the time was spent allocating stack segments. So the basic problem is that this benchmark is simply tuned for segmented stacks. An implementation that does not use segmented stacks will do worse on this benchmark. We've rejected segmented stacks because they don't perform well in the real world (and hurt many other benchmarks, including most of the shootout), but they do make this one particular program fast.
The debate on M:N versus 1:1 threading seems irrelevant to this benchmark. In 1:1 mode we will also allocate a lot of stack space.
Channel performance is mostly irrelevant to this benchmark.
It's hard to say what the more "idiomatic" Rust version of this program would be, because this program doesn't do anything. The fastest version of this program would be println((N + 1).to_str()). :) An implementation based on libdispatch/TBB-style blocks might be faster, but it would look quite a bit different from this benchmark.