
Avoiding race conditions in Swift - ingve
https://www.swiftbysundell.com/posts/avoiding-race-conditions-in-swift
======
raphlinus
This might be contentious, but I'm going to say it. If your program is
multithreaded in a nontrivial way, Swift is not the language for you, at least
yet.

First, this article understates the danger significantly. It's not just that
your token loader might be called twice, if you call this from two different
threads you're fully in undefined behavior. Swift is "safe-ish" in that it's
mostly safe in single-threaded contexts, but with data races that breaks down.
It's even more confusing because some things are properly atomic (reference
counts, manipulation of value types like arrays), but some aren't (access to
fields in classes).

Second, the language is missing basic features that in modern programming
would help tame this beast - the most important of which is a wrapper for
internally mutable state that is protected by a mutex. Grand Central Dispatch
can help with some of this.

Third, there are unexpected performance losses. For example, the closure for
running a block on a sync "queue" is heap allocated (because the type of such
a closure is the same for sync and async, and of course the latter has to be
heap allocated). This will probably be fixed, but I think is symptomatic.

Fourth, there's no official concurrency model. Again, it's confusing because
there are some things that are obviously intended to run with concurrency. In
fairness, Rust doesn't have a formal concurrency model yet either, but it is
straightforward to apply reasoning from C++, and also reasonable to expect
guarantees if the primitives such as Mutex, channels, etc., are used properly.

Fifth (and related), there are no atomics in the language or the standard
library. Apparently OSAtomic is deprecated, and they now recommend the use of
C atomics, but you're in the territory about having to reason about the
polyglot combination of Swift and C.

Likely all this will improve, but in the meantime it's important to realize
how immature Swift's concurrency is.

~~~
Matthias247
I agree that there are 1000 ways to create concurrency issues in Swift, as
shown in the article. However I fail to see how it's worse than most other
multithreaded languages. C++, JVM languages, Go, Python - they all don't
provide a lot of support for writing correct thread-safe programs.

Swift/Objective-C might even have an edge compared to some of them, due to
providing dispatch queues in the standard library - which are fairly easy to
reason about, as shown in the article. I see those as better than the average
user creating lots of threads on their own, and then try to synchronize things
via mutexes. There might be a performance overhead, but I would only look at
that after the program is correct.

There might be languages that provide more help here, e.g. Erlang, Haskell or
Pony. But those are not mainstream, and might also not good choices for things
where Swift is an alternative (client-side applications). Ada might have been
better too, with builtin Task and communication constructs, but that falls
into a similar category.

Then we have Rust, which might be the thing you are referring to as a better
alternative. It is in the sense that it can prevent most threading issues at
compile time - I will refrain from saying "all" since people will write unsafe
code, and there can still be deadlocks and logical race conditions. However
even Rust thought Rust has lots of tools for preventing concurrency issues, it
still doesn't have builtin ways for easy concurrent programs - and e.g. for
solving the problem that is described in the article. If one would implement
the given solution in Rust, there might be lots of fighting with the borrow-
checker involved, and potentially the same amount of heap allocations
(callbacks isn't a paradigm that works well in Rust). Obviously there are more
idiomatic solutions.

~~~
WoodenChair
> However I fail to see how it's worse than most other multithreaded
> languages. C++, JVM languages, Go, Python - they all don't provide a lot of
> support for writing correct thread-safe programs. Swift/Objective-C might
> even have an edge compared to some of them, due to providing dispatch queues
> in the standard library - which are fairly easy to reason about, as shown in
> the article.

A couple corrections:

\- Dispatch Queues are not a part of the Swift standard library. They are part
of Grand Central Dispatch which is almost always present where Swift is
present but is actually a separate project.

\- JVM languages, well specifically Java at least, do have synchronization
primitives built-in. It's part of the Java language spec. You can mark any
method as synchronized for instance and it will automatically be mostly
thread-safe.

~~~
Matthias247
Thanks for adding it. I was actually aware of both things.

While Java has some builtin syntax (synchronized) for mutexes, as well as the
notify/wait methods on Objects, I don't see them as a big step forward in
avoiding concurrency issues. Programmers still need to be aware that they need
to use those constructs, which is in most cases already the Nr 1 issue. The
next step is using them correctly. The JVM ecosystem has a huge number of
libraries and APIs that aim to make concurrency easier (java.util.concurrent,
RxJava, Quasar, etc). But all those must first be found, and then used
correctly. Rust on the other hand will remind users that something is missing.
And Go at least shows the a bit higher level channels first, instead of
mutexes.

~~~
kjeetgill
While it's true that much of the hard work of not writing data races is still
in the users hands it's important to remember that Java having a memory model
and specification for locks/volatiles is a VAST improvement over the same
situation in most other languages. It expressly defines what operations do and
do not imply a causal relationship, guarantees things like no word tearing on
volatile longs, that variables can't "logically" be optimized out of existence
in the presence of out of thread writes.

It's actually pretty great in practice.

------
eridius
This code has a significant and common flaw: If the token is cached, it
invokes the handler synchronously.

You should never design a function with a completion handler that sometimes
invokes it synchronously and sometimes asynchronously. That can cause all
sorts of problems, the simplest of which is a caller that expects it to be
asynchronous and ends up performing work out-of-order because the handler was
invoked synchronously.

The final form of the code in this article accidentally fixes this, by moving
the work it does onto a background queue, but this just introduces a separate
flaw: it performs the callback from its internal private queue. This isn't
necessarily a problem, except in that it will block any other work from
running on that queue unnecessarily (as it's a serial queue), but in more
complicated code, having external code running on your private queue can
indeed cause problems.

\---

There's also another issue, which is that if the AccessTokenService gets
deinited before the loader calls its callback, the callback is dropped on the
floor. In general, you should not design a function that takes a completion
callback with conditions where the completion callback is never fired. There's
3 reasonable behaviors that could be implemented here:

1\. Have the callback keep the AccessTokenService alive. It's pretty common
for networking operations to keep themselves alive while executing, so this is
probably the approach I'd take.

2\. Have the AccessTokenService fire all of its pending completion handlers
with an error in deinit. The error should ideally represent the notion that
the operation was cancelled.

3\. Change the AccessTokenLoader completion handler to look more like the
initial version, where it still calls the callback even if self is nil. With
the pending completion handlers approach, this requires moving the pending
completion handlers into a reference-type wrapper and strongly referencing
that from the AccessTokenLoader callback. In this case, that's pretty much the
same as approach #1, so you should just go ahead and use approach #1, but in a
more complex class this approach might have some utility.

~~~
anothergoogler
You have it backwards, it's the caller's bug if they assume order in an async
process.

~~~
SamReidHughes
No no no no no. He's right, and this is a hard-learned lesson that many
practitioners have discovered independently.

Both the caller and callee face huge problems of reentrant function calls when
you mix up the behavior.

Generally the right choice is to make every callback invoked asynchronously,
unless you _need_ to do something else.

~~~
alexashka
Can you explain what problems someone faces if a callback is executed
immediately?

~~~
SamReidHughes
In these examples let's suppose function (or object) A passes the callback
function C to the function (or object) B. Function B is what calls or
indirectly causes to be called the callback C.

Scenario 1: Object B has a function body which looks like:

1\. Look at or modify B's state.

2\. Call callback C.

3\. With code that assumes the state hasn't changed, do more stuff to B.

If C calls back into B (for example, if we'd like to perform the operation a
second time, or if we'd like to perform a different operation), the assumption
on line 3 might be incorrect. Or, some other assumption in the reentrant call
might be incorrect.

(One common outcome is that the developers say, "Oh, there's a race
condition," and they add a mutex. Later, they say, "Oh, there's a deadlock,"
and make it a recursive mutex. This is why recursive mutexes are a major code
smell.)

Scenario 2: The developer realizes a safe callback has to be called _after_
all other side effects are complete. So the developer writes this for their
function body:

1\. Look at or modify B's state.

2\. Decide you're going to call C later.

3\. Do more stuff to B.

4\. Call callback C, literally the last thing you do in the function.

(The developer is an experienced C++ developer and knows to make sure object
destructors for variables going out of scope happen before step 4.)

The potential problem here is on A's and C's side. What if they write:

A-1. Acquire mutex M, and look at or modify A's state.

A-2. Call into B, with callback C (which affects the state of A).

A-3. With code that assumes the state hasn't changed, do more stuff to A.

A-4. Release mutex M.

In callback C (in some helper function it invokes):

C-1. Acquire mutex M.

C-2. Do stuff assuming we have exclusive access to the state of A.

C-3. Release mutex M.

If M is non-recursive, you have a deadlock. If M is a recursive mutex (or
nonexistant) the assumption on line A-3 is incorrect. Or maybe the assumption
on line C-2 is incorrect.

Scenario 3: The developer is now super-paranoid about side effects of their
callback. So they write all their callbacks like this:

    
    
        E-1: callAsynchronously(function() {
            C-1: Acquire mutex M.
            C-2, C-3: as before.
           }).
    

Well now you have no performance advantage relative to just having object B
call asynchronously.

Oh, and one more.

Scenario 4: The developer's a freakin' genius. Writes out dead-perfect
hypersecure code. Can run a 5-minute mile, and smells nice afterwards. Then
enough synchronous callbacks happen recursively to produce a stack overflow.

\----

The thing about these scenarios is that if you call the callback function
synchronously _every time_ , you're more likely to catch the bug. It'll show
up in tests, or you'll just get a deadlock, or some debug-mode "DebugMutex"
assertion failure.

Also, if F always invokes asynchronously, and G always invokes synchronously,
you can't call F from G. And you have to add a wrapper in your callback when
calling G from F.

If, for a particular function, you have inconsistent behavior, you'll have a
whole bunch of undertested edge cases. And they might involve any function up
the callstack -- If F1 calls F2 calls F3 calls F4, and once in a while F4
invokes synchronously, that could cause F3, F2, and F1 to have that same
property (if they don't do further async operations in the callbacks they pass
up).

~~~
alexashka
Thanks for the detailed reply.

Does scheduling work via queues (channels they call they now right? :))
alleviate most of these issues, while sacrificing some efficiency?

The scenarios you've described are brittle by design - I imagine better
designs are out there?

~~~
SamReidHughes
I don't know what exactly a queue or channel is as you mean it, so it could
depend, but the short answer is if you can throw in real mutexes and won't
risk deadlock then that's a big mitigation.

A good way to improve all the scenarios I've mentioned is to return a future
instead of calling a callback. Many of the problems basically go away, if you
use futures for _everything_. There is still some danger of scenario 2 at the
boundary between futures and other code. Some futures libraries in systems
that have an event loop (like JavaScript, I've been told) always schedule
futures' callbacks asynchronously partly for this reason. But others don't,
partly because they're written not to depend on an event loop. A consequence
is also better performance, but a risk of stack overflow.

------
asdfasgasdgasdg
This seems craaaazy complicated. In C++, I'm only familiar with the abseil
libraries for thread safety, but the way to make sure multiple threads don't
access shared state is:

    
    
       absl::MutexLock l(&mutex_);
       // Critical section.
    

The idea that you have to write a "waiting queue" every time you need to guard
shared state in Swift is mind boggling. Surely not!

~~~
jchb
The waiting queue mechanism in the article is related to the specific
requirements of that auth network request rather than Swift in general.

The queues provided by Apples GCD are an abstraction above threads and
mutexes. The equivalent to your critical section example is:

let dispatchQueue = DispatchQueue(label: "com.mycompany.myComponent") //
Initialize queue, just as you would with the mutex, label will show in
debugger

dispatchQueue.sync { // Critical section }

There are several wins here. One is that gcd controls thread creation as
needed. But the bigger win is when you have a longer running critical section
and mustn't block the calling UI (main) queue. Then you can simply do an async
dispatch on the queue and callback to the UI queue on completion. GCD will
also introduce the necessary memory barriers for passing data between the
queues.

~~~
coldcode
I find the lack of language level locking a benefit in Swift, using GCD is
really easy now that the syntax is much more polished. I have yet to run into
a race condition. When I worked (years ago) on a 3D game engine in C++ we
actually removed most of the multi-threading because it was too hard to reason
what it was doing and avoid threading errors and race conditions.

------
jordansmithnz
A while ago I built an app data framework that used similar logic (although a
bit more abstracted) to coalesce every GET type request.

It made me wonder why network libraries don’t do this for the actual requests
by default (or perhaps they do?). I’m sure there are some cases where
simultaneous requests are intentional, but it seems like this would be
infrequent.

~~~
skohan
I guess the argument against would be that it adds complexity for a use-case
which is probably infrequent. How often does the average HTTP client request
the same resource multiple times concurrently? In what subset of those cases
does it actually matter if requests are redundant?

Why waste your user's clock-cycles on an edge case?

------
tumdum_
This page is completely unusable without js, yet all it is, is short static
document :(

------
dep_b
Nice article. It came to a few conclusions I came before: buffering token
requests in a queue once the first request has been kicked off and using named
threads for things that need to happen off main but in-order.

I can follow some of the critiques here but implementing both solutions really
makes things a lot better than doing without.

