Avoiding race conditions in Swift

raphlinus · on Nov 4, 2018

This might be contentious, but I'm going to say it. If your program is multithreaded in a nontrivial way, Swift is not the language for you, at least yet.

First, this article understates the danger significantly. It's not just that your token loader might be called twice, if you call this from two different threads you're fully in undefined behavior. Swift is "safe-ish" in that it's mostly safe in single-threaded contexts, but with data races that breaks down. It's even more confusing because some things are properly atomic (reference counts, manipulation of value types like arrays), but some aren't (access to fields in classes).

Second, the language is missing basic features that in modern programming would help tame this beast - the most important of which is a wrapper for internally mutable state that is protected by a mutex. Grand Central Dispatch can help with some of this.

Third, there are unexpected performance losses. For example, the closure for running a block on a sync "queue" is heap allocated (because the type of such a closure is the same for sync and async, and of course the latter has to be heap allocated). This will probably be fixed, but I think is symptomatic.

Fourth, there's no official concurrency model. Again, it's confusing because there are some things that are obviously intended to run with concurrency. In fairness, Rust doesn't have a formal concurrency model yet either, but it is straightforward to apply reasoning from C++, and also reasonable to expect guarantees if the primitives such as Mutex, channels, etc., are used properly.

Fifth (and related), there are no atomics in the language or the standard library. Apparently OSAtomic is deprecated, and they now recommend the use of C atomics, but you're in the territory about having to reason about the polyglot combination of Swift and C.

Likely all this will improve, but in the meantime it's important to realize how immature Swift's concurrency is.

ken · on Nov 5, 2018

I agree. I have a heavily multithreaded program, and I had to write my own thread-safe data structures, and spent ages debugging and tuning them. Then one weekend I read about Clojure core.async, and wrote a little 50-line prototype (much simpler and easier), and got something that worked correctly every time (first try!), and was (without tuning) only about 15% slower than my production system.

I've read about how they plan to glue some better concurrency support onto Swift, and I dread it. The language is already so complex that they're having trouble with the basics. I've got workarounds in my app because switching on an enum doesn't always work right (SR-1121). Now we want to add concurrency after the fact? Oh boy.

If I had to do it again, I'd definitely write the core of my software in some other language (maybe Clojure), and just use Swift for the GUI. Swift looks like a language designed for AppKit/UIKit, so it's fine if that's all you need it to do.

saagarjha · on Nov 4, 2018

> It's not just that your token loader might be called twice, if you call this from two different threads you're fully in undefined behavior.

This is addressed in the article, and a resolution is provided.

> It's even more confusing because some things are properly atomic (reference counts, manipulation of value types like arrays)

Modifying an array is not atomic, AFAIK. You cannot safely do this concurrently.

raphlinus · on Nov 4, 2018

> This is addressed in the article, and a resolution is provided.

"Undefined state" is the language used in the article, and has no precise technical definition. The article suggests to me simply that a method might be called twice, and makes no mention of your bank account being emptied or demons flying out of your nose, which is the consequence of undefined behavior you should expect.

> Modifying an array is not atomic, AFAIK. You cannot safely do this concurrently.

I'm happy to correct errors (I'm not a Swift expert), but I'm going on the basis of copy-on-write semantics and the fact that `isUniquelyReferenced` seems to be implemented atomically - this is stated fairly clearly on the official Apple blog at https://developer.apple.com/swift/blog/?id=10 , but that might be out of date.

saagarjha · on Nov 5, 2018

Try compiling this with swiftc -sanitize=thread:

  import Foundation
  
  let q1 = DispatchQueue(label: "1")
  let q2 = DispatchQueue(label: "2")
  
  var a = [0]
  
  q1.async {
  	a[0] = 1
  }
  
  q2.async {
  	a[0] = 2
  }

You'll get an error at runtime that there's a data race.

raphlinus · on Nov 5, 2018

This is a subtle point. It's not the array assignment that's the race here, but the update to the shared (by closure capture) `a` variable. You'd get the same data race if `a` were an `Int`, and we don't say that integers are non-thread-safe. If each closure had its own reference to the array, it would be copied on write.

I believe this is pretty good evidence for my claim that understanding the thread safety is confusing.

Matthias247 · on Nov 5, 2018

I agree that there are 1000 ways to create concurrency issues in Swift, as shown in the article. However I fail to see how it's worse than most other multithreaded languages. C++, JVM languages, Go, Python - they all don't provide a lot of support for writing correct thread-safe programs.

Swift/Objective-C might even have an edge compared to some of them, due to providing dispatch queues in the standard library - which are fairly easy to reason about, as shown in the article. I see those as better than the average user creating lots of threads on their own, and then try to synchronize things via mutexes. There might be a performance overhead, but I would only look at that after the program is correct.

There might be languages that provide more help here, e.g. Erlang, Haskell or Pony. But those are not mainstream, and might also not good choices for things where Swift is an alternative (client-side applications). Ada might have been better too, with builtin Task and communication constructs, but that falls into a similar category.

Then we have Rust, which might be the thing you are referring to as a better alternative. It is in the sense that it can prevent most threading issues at compile time - I will refrain from saying "all" since people will write unsafe code, and there can still be deadlocks and logical race conditions. However even Rust thought Rust has lots of tools for preventing concurrency issues, it still doesn't have builtin ways for easy concurrent programs - and e.g. for solving the problem that is described in the article. If one would implement the given solution in Rust, there might be lots of fighting with the borrow-checker involved, and potentially the same amount of heap allocations (callbacks isn't a paradigm that works well in Rust). Obviously there are more idiomatic solutions.

WoodenChair · on Nov 5, 2018

> However I fail to see how it's worse than most other multithreaded languages. C++, JVM languages, Go, Python - they all don't provide a lot of support for writing correct thread-safe programs. Swift/Objective-C might even have an edge compared to some of them, due to providing dispatch queues in the standard library - which are fairly easy to reason about, as shown in the article.

A couple corrections:

- Dispatch Queues are not a part of the Swift standard library. They are part of Grand Central Dispatch which is almost always present where Swift is present but is actually a separate project.

- JVM languages, well specifically Java at least, do have synchronization primitives built-in. It's part of the Java language spec. You can mark any method as synchronized for instance and it will automatically be mostly thread-safe.

Matthias247 · on Nov 5, 2018

Thanks for adding it. I was actually aware of both things.

While Java has some builtin syntax (synchronized) for mutexes, as well as the notify/wait methods on Objects, I don't see them as a big step forward in avoiding concurrency issues. Programmers still need to be aware that they need to use those constructs, which is in most cases already the Nr 1 issue. The next step is using them correctly. The JVM ecosystem has a huge number of libraries and APIs that aim to make concurrency easier (java.util.concurrent, RxJava, Quasar, etc). But all those must first be found, and then used correctly. Rust on the other hand will remind users that something is missing. And Go at least shows the a bit higher level channels first, instead of mutexes.

kjeetgill · on Nov 5, 2018

While it's true that much of the hard work of not writing data races is still in the users hands it's important to remember that Java having a memory model and specification for locks/volatiles is a VAST improvement over the same situation in most other languages. It expressly defines what operations do and do not imply a causal relationship, guarantees things like no word tearing on volatile longs, that variables can't "logically" be optimized out of existence in the presence of out of thread writes.

It's actually pretty great in practice.

raphlinus · on Nov 5, 2018

JVM has two additional advantages over Swift (sorry for repeating) - an actual memory model, and data races can cause logic errors but not the full undefined behavior of LLVM languages.

pjmlp · on Nov 5, 2018

You can implement synchronized in a couple of lines in Swift, thanks to trailing lambdas.

oconnor663 · on Nov 5, 2018

> However even Rust thought Rust has lots of tools for preventing concurrency issues, it still doesn't have builtin ways for easy concurrent programs - and e.g. for solving the problem that is described in the article.

Rust's Mutex type seems like it would be a good fit for this situation. Also this kind of callback -- which is executed immediately instead of stored and executed later -- doesn't require any heap allocation. You find callbacks like this in e.g. the scoped threading or rayon::join APIs.

akvadrako · on Nov 5, 2018

Um, this is all very standard stuff - if you want thread safety you should use queues or locks. Just because a language doesn't have first-class support for multithreading doesn't mean you shouldn't use it.

You should use explicit synchronisation even when using Python, which has that terrible global lock. Relying on atomic operations is too error prone and not explicit enough.

raphlinus · on Nov 5, 2018

C and C++ are sharp tools. They have extensive undefined behavior, but also a formal memory model, so it means something to say, "this code is correct." They also have a culture of people who understand how to deal with undefined behavior, and tools (sanitizers) that make it more feasible.

Rust is safe, unless you go out of your way to be unsafe. It's not just "use a lock," it's "if you try to access shared mutable data without a lock, your program won't compile."

Java has data races but not undefined behavior. It also has a formal memory model.

Swift gives you none of that. Treating threads as a library rather than something with language support is something that might have seemed like a good idea 20 years ago, but now I think we know better.

tinus_hn · on Nov 5, 2018

The C model is ‘there is only one core, race conditions don’t exist. Multithreading is undefined.’ How is that better?

lifthrasiir · on Nov 6, 2018

That has been changed in C11 to match with C++11, in case you are not aware.

lilyball · on Nov 4, 2018

This code has a significant and common flaw: If the token is cached, it invokes the handler synchronously.

You should never design a function with a completion handler that sometimes invokes it synchronously and sometimes asynchronously. That can cause all sorts of problems, the simplest of which is a caller that expects it to be asynchronous and ends up performing work out-of-order because the handler was invoked synchronously.

The final form of the code in this article accidentally fixes this, by moving the work it does onto a background queue, but this just introduces a separate flaw: it performs the callback from its internal private queue. This isn't necessarily a problem, except in that it will block any other work from running on that queue unnecessarily (as it's a serial queue), but in more complicated code, having external code running on your private queue can indeed cause problems.

---

There's also another issue, which is that if the AccessTokenService gets deinited before the loader calls its callback, the callback is dropped on the floor. In general, you should not design a function that takes a completion callback with conditions where the completion callback is never fired. There's 3 reasonable behaviors that could be implemented here:

1. Have the callback keep the AccessTokenService alive. It's pretty common for networking operations to keep themselves alive while executing, so this is probably the approach I'd take.

2. Have the AccessTokenService fire all of its pending completion handlers with an error in deinit. The error should ideally represent the notion that the operation was cancelled.

3. Change the AccessTokenLoader completion handler to look more like the initial version, where it still calls the callback even if self is nil. With the pending completion handlers approach, this requires moving the pending completion handlers into a reference-type wrapper and strongly referencing that from the AccessTokenLoader callback. In this case, that's pretty much the same as approach #1, so you should just go ahead and use approach #1, but in a more complex class this approach might have some utility.

charleslmunger · on Nov 5, 2018

Absolutely this. See https://google.github.io/guava/releases/23.0/api/docs/src-ht...

For a discussion on some of the dangers of invoking completion listeners synchronously.

anothergoogler · on Nov 4, 2018

You have it backwards, it's the caller's bug if they assume order in an async process.

SamReidHughes · on Nov 4, 2018

No no no no no. He's right, and this is a hard-learned lesson that many practitioners have discovered independently.

Both the caller and callee face huge problems of reentrant function calls when you mix up the behavior.

Generally the right choice is to make every callback invoked asynchronously, unless you need to do something else.

alexashka · on Nov 5, 2018

Can you explain what problems someone faces if a callback is executed immediately?

Matthias247 · on Nov 5, 2018

One usual issue on the callee side is that the code inside the callback might call again into the same component (here: AccessTokenLoader::retrieveToken). When this call is done, state inside the component is modified before the callback returns. When the callback returns the component is now no longer in the state that it was, which might not be expected. In C++, it could even happen that the object was deleted within the callback. Then accessing any local field after the callback returns leads to a segfault.

One kind of remedy is to only call callbacks at the end of the the logic, and don't touch anything anymore after the callback returns. That works in some cases - however scheduling the callback might be easier to reason about.

Then there are also issues on the caller side, if the caller doesn't expect that the callback is sometimes made synchronously and sometimes asynchronously. If it's assumed that the callback happens asynchronously and some state is modified in it, the code that runs after invoking the initial function might work on wrong assumptions (wasn't aware that state could have been modified).

SamReidHughes · on Nov 5, 2018

In these examples let's suppose function (or object) A passes the callback function C to the function (or object) B. Function B is what calls or indirectly causes to be called the callback C.

Scenario 1: Object B has a function body which looks like:

1. Look at or modify B's state.

2. Call callback C.

3. With code that assumes the state hasn't changed, do more stuff to B.

If C calls back into B (for example, if we'd like to perform the operation a second time, or if we'd like to perform a different operation), the assumption on line 3 might be incorrect. Or, some other assumption in the reentrant call might be incorrect.

(One common outcome is that the developers say, "Oh, there's a race condition," and they add a mutex. Later, they say, "Oh, there's a deadlock," and make it a recursive mutex. This is why recursive mutexes are a major code smell.)

Scenario 2: The developer realizes a safe callback has to be called after all other side effects are complete. So the developer writes this for their function body:

1. Look at or modify B's state.

2. Decide you're going to call C later.

3. Do more stuff to B.

4. Call callback C, literally the last thing you do in the function.

(The developer is an experienced C++ developer and knows to make sure object destructors for variables going out of scope happen before step 4.)

The potential problem here is on A's and C's side. What if they write:

A-1. Acquire mutex M, and look at or modify A's state.

A-2. Call into B, with callback C (which affects the state of A).

A-3. With code that assumes the state hasn't changed, do more stuff to A.

A-4. Release mutex M.

In callback C (in some helper function it invokes):

C-1. Acquire mutex M.

C-2. Do stuff assuming we have exclusive access to the state of A.

C-3. Release mutex M.

If M is non-recursive, you have a deadlock. If M is a recursive mutex (or nonexistant) the assumption on line A-3 is incorrect. Or maybe the assumption on line C-2 is incorrect.

Scenario 3: The developer is now super-paranoid about side effects of their callback. So they write all their callbacks like this:

    E-1: callAsynchronously(function() {
        C-1: Acquire mutex M.
        C-2, C-3: as before.
       }).

Well now you have no performance advantage relative to just having object B call asynchronously.

Oh, and one more.

Scenario 4: The developer's a freakin' genius. Writes out dead-perfect hypersecure code. Can run a 5-minute mile, and smells nice afterwards. Then enough synchronous callbacks happen recursively to produce a stack overflow.

----

The thing about these scenarios is that if you call the callback function synchronously every time, you're more likely to catch the bug. It'll show up in tests, or you'll just get a deadlock, or some debug-mode "DebugMutex" assertion failure.

Also, if F always invokes asynchronously, and G always invokes synchronously, you can't call F from G. And you have to add a wrapper in your callback when calling G from F.

If, for a particular function, you have inconsistent behavior, you'll have a whole bunch of undertested edge cases. And they might involve any function up the callstack -- If F1 calls F2 calls F3 calls F4, and once in a while F4 invokes synchronously, that could cause F3, F2, and F1 to have that same property (if they don't do further async operations in the callbacks they pass up).

alexashka · on Nov 5, 2018

Thanks for the detailed reply.

Does scheduling work via queues (channels they call they now right? :)) alleviate most of these issues, while sacrificing some efficiency?

The scenarios you've described are brittle by design - I imagine better designs are out there?

SamReidHughes · on Nov 5, 2018

I don't know what exactly a queue or channel is as you mean it, so it could depend, but the short answer is if you can throw in real mutexes and won't risk deadlock then that's a big mitigation.

A good way to improve all the scenarios I've mentioned is to return a future instead of calling a callback. Many of the problems basically go away, if you use futures for everything. There is still some danger of scenario 2 at the boundary between futures and other code. Some futures libraries in systems that have an event loop (like JavaScript, I've been told) always schedule futures' callbacks asynchronously partly for this reason. But others don't, partly because they're written not to depend on an event loop. A consequence is also better performance, but a risk of stack overflow.

solarexplorer · on Nov 4, 2018

Exactly. UITableView is a good example of how _not_ to do things.

dep_b · on Nov 4, 2018

Yeah the mixup of MVC layers is terrible and anything beyond a hard reload data has many edge cases.

asdfasgasdgasdg · on Nov 4, 2018

This seems craaaazy complicated. In C++, I'm only familiar with the abseil libraries for thread safety, but the way to make sure multiple threads don't access shared state is:

   absl::MutexLock l(&mutex_);
   // Critical section.

The idea that you have to write a "waiting queue" every time you need to guard shared state in Swift is mind boggling. Surely not!

jchb · on Nov 5, 2018

The waiting queue mechanism in the article is related to the specific requirements of that auth network request rather than Swift in general.

The queues provided by Apples GCD are an abstraction above threads and mutexes. The equivalent to your critical section example is:

let dispatchQueue = DispatchQueue(label: "com.mycompany.myComponent") // Initialize queue, just as you would with the mutex, label will show in debugger

dispatchQueue.sync { // Critical section }

There are several wins here. One is that gcd controls thread creation as needed. But the bigger win is when you have a longer running critical section and mustn't block the calling UI (main) queue. Then you can simply do an async dispatch on the queue and callback to the UI queue on completion. GCD will also introduce the necessary memory barriers for passing data between the queues.

coldcode · on Nov 5, 2018

I find the lack of language level locking a benefit in Swift, using GCD is really easy now that the syntax is much more polished. I have yet to run into a race condition. When I worked (years ago) on a 3D game engine in C++ we actually removed most of the multi-threading because it was too hard to reason what it was doing and avoid threading errors and race conditions.

sovande · on Nov 5, 2018

Swift has easy access to pthread mutex and you can do that.

serialQueue.async { // Critical section } is GCD's way of doing the same with the added benefit that the calling code does not have to wait.

Problems in the parent code could be solved by using a mutex instead of GCD in this case. The code _has_ to wait for the access token to be loaded and not blocking does not make sense here unless you also introduce some kind of Promises.

bunnycorn · on Nov 5, 2018

This.

Also, GCD has DispatchSemaphore that is a Semaphore. Mutexes is a particular case of semaphores.

bunnycorn · on Nov 5, 2018

1. You don't, as said before.

2. Creating dispatch queues is trivial and most surely you already created one for other tasks.

3. You miss the context of this article.

Try doing that on the main thread of a simple command line utility and nothing happens.

Try locking the main thread of a GUI Application, and you did a beginners mistake, your App will lock.

GUI API calls must be done on the main thread and that's where our View Controllers run in.

jordansmithnz · on Nov 5, 2018

A while ago I built an app data framework that used similar logic (although a bit more abstracted) to coalesce every GET type request.

It made me wonder why network libraries don’t do this for the actual requests by default (or perhaps they do?). I’m sure there are some cases where simultaneous requests are intentional, but it seems like this would be infrequent.

skohan · on Nov 5, 2018

I guess the argument against would be that it adds complexity for a use-case which is probably infrequent. How often does the average HTTP client request the same resource multiple times concurrently? In what subset of those cases does it actually matter if requests are redundant?

Why waste your user's clock-cycles on an edge case?

tumdum_ · on Nov 4, 2018

This page is completely unusable without js, yet all it is, is short static document :(

dep_b · on Nov 4, 2018

Nice article. It came to a few conclusions I came before: buffering token requests in a queue once the first request has been kicked off and using named threads for things that need to happen off main but in-order.

I can follow some of the critiques here but implementing both solutions really makes things a lot better than doing without.