
Rust concurrency: the archetype of a message-passing bug - polyglotfacto
https://medium.com/@polyglot_factotum/rust-concurrency-the-archetype-of-a-message-passing-bug-817b60efd8f8
======
Animats
_It should be noted that combining shared-state with event-loops is usually a
recipe for subtle disaster._

Go has that problem, too. The claim for Go was that concurrency would be done
via message passing. But, in the examples in the original manual, and often in
practice, what's being passed on the channel is some kind of reference to
shared data. In a garbage collected language, this is OK as long as the sender
only sends new objects and never touches them after sending. Still, it's easy
to screw up, and the language does not help.

In both Go and Python, there's not much thread-local data that the language
will not let you export to another thread. That's a lack. Shared state that
isn't obviously shared state is a recipe for trouble caused by later
maintenance programming. This is one of Rust's strengths. Sharing has to be
explicit.

Shutdown is always hard in an event queue environment. Go makes it harder by
making a write to a closed channel kill the whole program. So if a receiver
decides it needs to shut down, it can't just close the receive end of the
channel and let the sender hit a send error. Negotiations are required to get
the sender shut down.

~~~
linuxftw
> Go makes it harder by making a write to a closed channel kill the whole
> program. So if a receiver decides it needs to shut down, it can't just close
> the receive end of the channel and let the sender hit a send error.
> Negotiations are required to get the sender shut down.

And thus Borg was born. Too hard to write processes that handle situations
gracefully, so let's just let the process die and spin up a new one
immediately.

Literally was looking at a patch set that was proposed recently that involved
several channels in Go and one of the codepaths was os.Exit(1) for lack of any
meaningful mechanism to sanely clean everything up. Sadly, this was in library
code, so would be totally unexpected by the caller and probably not desirable
whatsoever.

~~~
Animats
_And thus Borg was born._

Yes. Basic Dijkstra producer-consumer one-way queues are simple and elegant.
But the error cases are hell.

OK, so if the consumer wants to make the producer stop, it sets an abort flag.
That's a shared variable. (Does it need a lock? Does it need a lock or fencing
on ARM, which has weaker memory concurrency guarantees than x86?)

So, producer sees the abort flag set and closes its end of the channel, right?
Maybe not. Maybe the channel is full, and the producer is blocked on a send.
OK, so, when the consumer aborts, it has to enter a drain loop, doing receives
and discarding the result until it gets an EOF/channel close. That will
unblock the producer, and then the consumer can exit.

But what if the producer doesn't have anything to send right now? Now the
consumer is stuck waiting for the producer to send more data, and the consumer
can't exit.

Should the consumer close the producer's end of the channel after setting the
abort flag? Now there's a race condition between checking the abort flag and
sending. Add a critical section that covers both actions? Now the producer can
be blocked while in a critical section. This creates a potential deadlock if
the consumer wants to set the abort flag while the producer is inside the
critical section. Now both ends are blocked and there's a deadlock.

But it all looked so simple in theory!

~~~
codr7
If you insist on making it as difficult as possible, I guess.

Otherwise the producer decides when the show stops, by simply closing the
channel.

~~~
earthboundkid
Yes, exactly. The workers can send results to the manager, but the manager
does all the management. Anything else gets too hairy, too fast to reason
about. I wrote a blog post about it:
[https://blog.carlmjohnson.net/post/share-memory-by-
communica...](https://blog.carlmjohnson.net/post/share-memory-by-
communicating/)

~~~
Animats
Does that scale to a pipeline, a chain of producers and consumers?

What happened with Go, I suspect, is that the syntax came first. Go uses a
"<-" operator to talk to channels. There's no place for an error status with
that operator. Hence the problem.

UNIX-type pipelines have SIGPIPE to unwind things when something downstream
aborts. Abort has to propagate back up the chain. I tend to argue that the
workarounds for not having exceptions are worse than the problems of
exceptions.

------
ivanbakel
This is a cool technical exploration of some concurrency ideas: but event
loops combined with Rust's borrow checker actually make for a re-invention of
the actor model from first principles.

Languages like Pony, and the actor model more generally, were designed to
prevent a whole class of concurrency bugs (same as Rust) - what's interesting
is that, like this post shows, they also have unexpected problems with message
causality.

It makes me wonder if there's an even better abstraction out there which will
solve _this_ kind of concurrency bug, and what will arise after these are
solved.

~~~
oconnor663
I think the fundamentally tricky part is that sometimes you _want_ your
concurrent code to have this sort of nondeterministic behavior. I think the
Rayon FAQ does a good job of summarizing the problem:
[https://github.com/rayon-rs/rayon/blob/master/FAQ.md#but-
wai...](https://github.com/rayon-rs/rayon/blob/master/FAQ.md#but-wait-isnt-
rust-supposed-to-free-me-from-this-kind-of-thinking)

> Consider for example when you are conducting a search in parallel, say to
> find the shortest route. To avoid fruitless search, you might want to keep a
> cell with the shortest route you've found thus far. This way, when you are
> searching down some path that's already longer than this shortest route, you
> can just stop and avoid wasted effort...Now in this case, we really WANT to
> see results from other threads interjected into our execution!

~~~
polyglotfacto
Yes, I would say the tricky part is accurately modelling what you want. Also,
instead of actual nondeterministic behavior, I think that you rather want
different things to be deterministic.

The article is an example of "task parallelism", where you have different
threads doing different things on different (local) data. Having those threads
communicate via message-passing is a nice way to model workflows crossing
thread boundaries, and in that setup you usually want some sort of
deterministic behavior related to the ordering of operations(for a given
workflow).

On the other hand, I think the Rayon example is a good case of "data
parallelism", where you have workers doing "the same thing" on "the same
(shared-)data".

In such a case, perhaps locks and shared-data are a better fit than message-
passing, and indeed what you're after is not necessarily some sort of
sequential ordering of worker operations. However you still probably want the
ability to make other deterministic assertions about the behavior of the
workers.

I wrote another article highlighting some of these "different determinism" at
[https://medium.com/@polyglot_factotum/rust-concurrency-
five-...](https://medium.com/@polyglot_factotum/rust-concurrency-five-easy-
pieces-871f1c62906a?source=friends_link&sk=fa043036d86e078fd7a50c7f109c1163)

------
jojobas
Rust docs never claimed to prevent race conditions (which he doesn't call by
name).

[https://doc.rust-lang.org/nomicon/races.html](https://doc.rust-
lang.org/nomicon/races.html)

~~~
pjmlp
True, but that is something that many miss a lot when discussing Rust virtues.

Yes the type system is a great help when dealing with thread based
concurrency/paralellism, but not so much for the scenarios where we are
already using type safe managed languages in distributed computing across
multiple processes, most likely not even written in the same language.

Which by the way, given the recent security exploits, is much preferable to go
back to, instead of relying on threads for parallelism.

~~~
fluffything
> True, but that is something that many miss a lot when discussing Rust
> virtues.

The large majority of comments I read about Rust safety claims are very
explicit that Rust provides memory safety and data-race freedom.

You seem to be arguing that these comments are leaving out important
information by not clarifying that "This does not mean that Rust prevents all
bugs, e.g., you can still write programs with logic bugs, race conditions,
dead locks, DDoS, ... and the list goes on and on... in Rust".

If somebody does not know what that "memory safety" and "data-race freedom"
is, they should just ask, but if they don't, and they make an incorrect
assumption, then its kind of their fault, not the fault of whoever wrote such
a comment for not mentioning all the things that these two properties do not
guarantee.

Also, it is actually trivial to build Rust programs that prevent some race-
conditions, like deadlocks (just a cargo flag away), so if anything, I'd
complain that people being "too correct" by restricting themselves to "data-
race freedom" are leaving out some important advantages of Rust.

~~~
ashtonkem
My experience is that Rust advocates often go well out of their way to specify
the limits of Rust’s protection, and will even rush in to correct someone who
over promises what Rust can give you.

~~~
htfy96
r/rust and users.rust-lang.org are informative and sensible when it comes to
limitations. Twitter and, well, random programming discords are often full of
cults

------
mratsim
And that's where model checking / formal verification is a tremendous help (or
would be if it was easier to use).

There are many properties of multithreaded/distributed systems that are
emergent and that current static compilers don't address.

In distributed systems it's becoming more common to use TLA+[1] to model the
system and ensure that we don't have bugs at the design level (rather than
just the system level).

In particular for message-passing, I'm quite curious about the state of the
research on Communicating Finite State Machines[1] and Petri Nets[2] and if
there are actual products that could be integrated in regular languages and
are not just of academic interest only.

I expect that Ada/Sparks[1] has some support for this pattern though Ada
multithreading story is not there yet (?).

I'm quite excited about the future Z3 prover integration in Nim[5] and the
potential to write concurrent programs free of design bugs[6] to address both
ends of the concurrency bugs:

\- bugs in the low-level concurrent data structure via formal verification
(which is something I used int he past to prove that my design was bug-free
and that the deadlock I experienced was a condition variable bugs in Glibc[7]
and I had to directly use futexes instead.

\- bugs in the high-level design, implemented as event-loop / finite state-
machine reacting to events as received by channels

Are there people working on model checkers or formal verification in Rust?
Like the author, i think addressing memory bugs.the borrow checker is only a
part of the "fearless concurrency" story and preventing design bugs would be
extremely valuable.

[1]:
[https://lamport.azurewebsites.net/tla/tla.html](https://lamport.azurewebsites.net/tla/tla.html)

[2]: [https://en.wikipedia.org/wiki/Communicating_finite-
state_mac...](https://en.wikipedia.org/wiki/Communicating_finite-
state_machine)

[3]:
[https://en.wikipedia.org/wiki/Petri_net](https://en.wikipedia.org/wiki/Petri_net)

[4]: [https://www.adacore.com/sparkpro](https://www.adacore.com/sparkpro)

[5]: [https://nim-lang.org/docs/drnim.html](https://nim-
lang.org/docs/drnim.html)

[6]: [https://github.com/nim-lang/RFCs/issues/222](https://github.com/nim-
lang/RFCs/issues/222)

[7]:
[https://github.com/mratsim/weave/issues/56](https://github.com/mratsim/weave/issues/56)

And shameless plug, my talk at the NimConf on debugging multithreading
programs (data structure and design):
[https://www.youtube.com/watch?v=NOAI2wH9Cf0&list=PLxLdEZg8DR...](https://www.youtube.com/watch?v=NOAI2wH9Cf0&list=PLxLdEZg8DRwTIEzUpfaIcBqhsj09mLWHx&index=11)

Slides:
[https://docs.google.com/presentation/d/e/2PACX-1vRq2wvd4q_mo...](https://docs.google.com/presentation/d/e/2PACX-1vRq2wvd4q_mohbu2a_bYnhJwO487XY1s3cTYs52_p5_PmF0Uc3AcNnBg45Y9AtJ0WMnanip9UewRZd3/pub?start=false&loop=false&delayms=3000)

~~~
inaseer
We've used Coyote
([https://microsoft.github.io/coyote/](https://microsoft.github.io/coyote/))
which offers an actor-based programming model, as well as as async/await
programming model on top of .NET's Tasks to express the concurrency in our
services. Coyote includes a powerful systematic testing and exploration engine
which explores the various states your programs can be in and helps you find
violations to safety and liveness properties. This gives us the same benefits
you get when using a model checker like TLA+ but on our actual production code
as opposed to a model of the code. If your services are written in .NET, I
seriously recommend taking a deeper look here.

~~~
mratsim
Wow, that looks like a killer feature for .Net

I will definitely steal some ideas from Coyote to drive Nim formally
verification story of concurrent code forward.

On the Rust side, i feel like Loom has a lot of potential
[https://github.com/tokio-rs/loom](https://github.com/tokio-rs/loom) and I'm
keeping tabs on them.

~~~
inaseer
Looms looks quite interesting and is similar to Coyote in a lot of ways. The
exploration engine used by Coyote is based on years of investigation by
Microsoft Research so it scales nicely on real world programs. I believe the
Coyote team is looking into exposing the systematic exploration capabilities
in a language and platform agnostic way so developers can create language-
specific bindings in their language of choice and benefit from all the
research done over decades. No timelines that I'm aware of but it is on the
road map I believe.

~~~
mratsim
Yes I'm in the middle of many Microsoft papers[1] to add model checking to
Nim, in particular CHESS.[2]

[1]:
[https://github.com/mratsim/weave/issues/18](https://github.com/mratsim/weave/issues/18)

[2]: Finding and Reproducing Heisenbugs in Concurrent Programs,
[http://www.cs.ucr.edu/~neamtiu/pubs/osdi08musuvathi.pdf](http://www.cs.ucr.edu/~neamtiu/pubs/osdi08musuvathi.pdf)

------
devit
Seems like the solution is to wait for the requests to complete, at least up
to the point where a "slot" is reserved in the sequence order of a single
actor.

------
Vanit
Heh, reading this is reminiscent of the shortcomings of Typescript when
external data is mistyped.

Ps I love Typescript.

------
crazypython
A good theoretical understanding will never replace a compiler that yells at
you for making a mistake in that theoretical understanding. Learn teh theory
instead.

