

Chinese Whispers in Rust - arizabu
http://thornydev.blogspot.com/2014/01/chinese-whispers-in-rust.html

======
pcwalton
Copying my message on Reddit here:

I profiled this and found that the vast majority of the time was spent
allocating stack segments. So the basic problem is that this benchmark is
simply tuned for segmented stacks. An implementation that does not use
segmented stacks will do worse on this benchmark. We've rejected segmented
stacks because they don't perform well in the real world (and hurt many other
benchmarks, including most of the shootout), but they do make this one
particular program fast.

The debate on M:N versus 1:1 threading seems irrelevant to this benchmark. In
1:1 mode we will also allocate a lot of stack space.

Channel performance is mostly irrelevant to this benchmark.

It's hard to say what the more "idiomatic" Rust version of this program would
be, because this program doesn't do anything. The fastest version of this
program would be println((N + 1).to_str()). :) An implementation based on
libdispatch/TBB-style blocks might be faster, but it would look quite a bit
different from this benchmark.

~~~
derefr
> The fastest version of this program would be println((N + 1).to_str()).

Chinese Whispers basically represents a DSP system: each agent is "plugged
into" the next one in the chain, and a signal has to propagate through the
chain, being modified by each agent in an arbitrary fashion. The fact that
they all just +1 the signal is irrelevant; each one _could_ output whatever it
likes. The only assumption is that the work each agent performs is O(1).

~~~
pcwalton
In that case the fastest way to do it in Rust would be to use an iterator.

~~~
derefr
Oh, sorry, forgot to include: 1. Only the agent that outputs a signal knows
where it should be propagated[1]. (The decision of propagation is part of the
each agent's internal state.) and 2. Each agent _may_ be running on a
different system than the next, and may propagate the signal using a network
call instead of an internal message-pass-and-yield.

Chinese Whispers is inherently a test of message passing _in a distributed
system._ An iterator-based design (e.g. a "controller" agent that pulls in the
state after each call and sends it to the next agent, forcing a hub-and-spoke
series of RPC calls) will

• tend to introduce highly-degraded performance, since you've doubled the
number of messages being sent;

• force a synchronous wait in the controller (it can't do anything until it's
talked to everyone, whereas with the asynchronous propagation, all it has to
do is "kick it off");

• possibly just not work at all--if, for example, an agent propagates into a
private network that only accepts messages from that agent, not from your
controller.

\---

[1] The Chinese Whispers problem includes a set-up phase, where the controller
_instantiates_ the agents and wires them to one-another. This is just for the
sake of convenience, though; the controller should not be assumed to have
knowledge of how the agents are interconnected once it has the ring
established.

~~~
pcwalton
This particular benchmark is especially poor if that's what it's trying to
test then, because this benchmark is completely gated on task/goroutine
spawning performance and message passing barely even shows up in the profile
in both Rust and Go.

~~~
masklinn
Yes a "fairer" evaluation of "chinese whispers" would first setup the whole
thing, then start the test and send a message into the ring.

------
tinco
Why do people always insist on posting sub-second performance benchmarks in
blogs? Is it really so hard to put a for(var i = 0;i < 1000;i++) {} around the
code?

I'm not talking about making the benchmarks actually being reliable, but make
them at least pass the most basic of inspections.

Your computer could be doing _anything_ in the few seconds that your
application ran. Is your blog really that worthless to you that you couldn't
wait 5 minutes to establish that what you're saying holds water?

edit: okey, I need to sleep a bit better, I'm coming off bitter again :P

------
Aloisius
I'm not sure I understand concurrency for concurrency's sake. It seems a non-
concurrent version of this would run at least two orders of magnitude faster
than either implementation.

Maybe I'm just set in my old ways, but I tend to equate adding concurrency
with adding complexity (mostly to debugging and code legibility) and really
only do it when I'm CPU-bound. Even then, it has to be a pretty darn good
reason.

I understand there are some benefits of CSP programming, but simple message
passing where every single thread wakes up during the call? Half the time
you're looking at network or memory bandwidth issues which a single thread
could easily saturate.

Now maybe if each of these threads were say, an intelligent agent doing
hitting the CPU hard, but this? This two microbenchmarks run for an
exceptionally short period of time merged together: one is the creation of
threads and one is message passing between them.

------
kibwen
See the discussion on the Rust subreddit for a more thorough discussion of the
caveats in the Rust implementation:

[http://www.reddit.com/r/rust/comments/1vnrp8/chinese_whisper...](http://www.reddit.com/r/rust/comments/1vnrp8/chinese_whispers_in_rust_and_go/)

The most salient quote from Daniel Micay (strncat):

 _" It doesn't demonstate a pattern you would use in Rust. Rust tasks aren't a
substitute for generators/iterators and they're not around as a control flow
feature."_

~~~
kinofcain
Sadly it seems the "solution" there is "don't do CSP in Rust".

~~~
pcwalton
The solution is "don't spawn a huge number of tasks in a tight loop". Rust
tasks are optimized for maximum performance once you spin them up, at the cost
of some overhead once you spawn them. That was a deliberate choice, because
not having segmented stacks makes many much more important things far faster,
such as calling into C and performing tight sequential computations without
thrashing on stack boundaries or using a garbage collector to move stacks
around and rewrite all the pointers. Rust is about making costs explicit: when
every function call could turn into a huge malloc(), you lose a lot of
predictability over performance (and Go found this out too from what I
gather).

Calling into C is a big one: we concluded it was basically impossible to make
Rust calling into C as cheap as C calling into C in a segmented stack
implementation. In practical usage, we found that the ability to call into C
cheaply was much more important than the benefits of small stacks, which
mostly help microbenchmarks like this at the expense of real-world code.

~~~
samth
> In practical usage, we found that the ability to call into C cheaply was
> much more important than the benefits of small stacks, which mostly help
> microbenchmarks like this at the expense of real-world code.

For Rust's goals, I can definitely believe that the first is true. I don't
think the second claim is true at all, though. Lightweight threads have been
extremely successful in Haskell, Erlang, etc, and not just on microbenchmarks.

~~~
pcwalton
> I don't think the second claim is true at all, though. Lightweight threads
> have been extremely successful in Haskell, Erlang, etc, and not just on
> microbenchmarks.

They're mostly successful in languages that GC/heap-allocate all stack frames
(paying the costs that come with it). Rust uses the machine stack so the
benefits come with some very significant drawbacks, most notably
unpredictability of performance that comes from stack thrashing. Large mallocs
are slower than small mallocs in general, but on the typical sizes you need
for machine stack segments you probably aren't going to hit the malloc
implementation's free list anyway, so you might as well just allocate up
front. If you GC stack frames, though, you just bump allocate in the nursery,
at the cost of greatly increased GC pressure (but it makes task spawning much
cheaper).

There's some interesting discussion on this near the end of Simon Marlow's
paper "Extending the Haskell Foreign Function Interface with Concurrency":
[http://community.haskell.org/~simonmar/papers/conc-
ffi.pdf](http://community.haskell.org/~simonmar/papers/conc-ffi.pdf)

~~~
samth
I think Simon & Simon are making exactly the point I am: lightweight threads
are significantly faster than any other threading system, and work in real
system, not just microbenchmarks. Rust has made some other design choices that
it seems make it harder to have some nice things that most modern functional
language runtimes have -- I'm sure there are good reasons for all of those
choices. But I don't think your claim that the tradeoffs "mostly help
microbenchmarks" is right.

~~~
pcwalton
I think I should have been more precise: I meant that _in Rust_ small stacks
mostly help microbenchmarks. Obviously if you're willing to pay the costs of
heap allocating stack frames one by one (or have a relocating stack
implementation and are willing to pay the costs associated with that) then
small stacks can help a lot.

------
msie
I must really be sheltered because that is only the third time I've ever heard
the term "Chinese Whispers" used for a game that I remembered as "Broken
telephone".

~~~
mistercow
I'm guessing that you're American, and that the author is not. My
understanding is that in most of the English-speaking world, the game is
called "Chinese whispers" (although that name is falling out of favor because
racism), while in America, it is usually called "telephone" or some variation
on that.

~~~
msie
I'm Canadian so same thing, basically. The term is used by Rob Pike in the
referred presentation and he's Canadian, but he's somewhat older than I am. I
first noticed the term used by an Australian coworker about a year ago and I
also heard it on a Stephen Fry talk show - so yes to your theory.

------
kinofcain
There was talk recently that rust had made the choice of m:n tasks and real
threads configurable. Is it still using m:n tasks by default? Would be
interesting to see a comparison between the two for something like this (which
should of course heavily favor lightweight task model)

~~~
pcwalton
I don't expect much if any of a performance difference. Both M:N and 1:1
threading use non-segmented stacks in Rust, and that is the source of the
performance issues.

------
levosmetalo
After compiling this with the latest rust built from git, I get only the
"Illegal instruction" as an error message when I try to run this example.

~~~
strcat
You're running out-of-memory, and hitting the `llvm.trap` intrinsic call. It
causes an illegal instruction to invoke a core dump. It doesn't currently
print an error message.

------
hacknat
Rename your post, it's going to distract attention from whatever it is you are
trying to communicate. I'd just as soon not read it until you call it
"telephone" or something more PC.

 _edit_

Also, you'll get flame detected soon enough once this turns into a debate
about racism.

