Hacker News new | past | comments | ask | show | jobs | submit login
Race Condition vs. Data Race (2011) (regehr.org)
133 points by bshanks 85 days ago | hide | past | web | favorite | 44 comments

This is something I learned through the course of trying to be clever in my early days of using Go (circa 2011) and it's something I wish my CS courses made explicitly clear, and I think all programmers should at some point learn this difference. Even in JavaScript, race conditions are possible.

In fact, I'm pretty sure my car's "infotainment" system has several race conditions (though I'm not sure about data races specifically). After years of dealing with these kinds of bugs, you can smell them a mile away especially on software you use a lot. Symptoms: highly sporadic / unpredictable behavior of a wonky state machine. These are hard to document and debug!

My car, for example, will sometimes exhibit these behaviors:

- Volume changes Android Auto levels even though only music is playing

- Backup camera is not on anymore, but steering alignment overlay remains when side camera is activated

- Computer+screen stays on even though car is off

- Audio is quieted for a notification but then never returned to its previous volume

- Buttons change state as if the screen was showing something else

Race conditions are really common, I think. I know this is just my car as an example but you've probably experienced them on any moderately-complex systems you've used. I bet your TV has race conditions. And your router. And your ("smart") thermostat. I hope airplanes and spacecraft don't.

Not only that, but race conditions of this broad, "logic-related" sort are especially common in message-passing concurrency models, like the one that Go makes idiomatic. To make things worse, Go also has data races, when using plain old shared-access concurrency!

> To make things even worse, Go also has data races, when using plain old shared-access concurrency!

And Go's message passing readily and silently devolves to unprotected shared-data concurrency as you can (and people commonly will either for performance reasons or because of reference types like maps or slices) send pointers through channels.

> Even in JavaScript, race conditions are possible.

I guess this was a joke/typo? Otherwise, are you implying that JS has any protection against them? This would be the last language I would think of.

I think that the "Even in JavaScript..." comment was written with the idea that race conditions might not be present/possible in single threaded environments, such as JavaScript...I read it as "Even in a single-threaded environment, such as JavaScript..."

I don't think that most folks would hold up JS as some sort of gold standard of language design, and it doesn't seem that the author is doing so here.

I have a little rant on this.

In the scale of "understanding imperative semantics", awareness of race conditions is probably near the very high end of difficulty, while pointers are only somewhere in the middle. And that's a problem, because there are plenty of languages, JS included, that free you from understanding pointers while still making it very easy to create racy code. As such there are a lot of programmers going around with a false understanding of whether they are writing robust concurrent code - because they don't think it is concurrent. It doesn't say "concurrent" on it - this is a footgun naturally achieved with any sufficiently complex loop - and often they have made some effort to tuck away mutability in tiny functions, hindering efforts to find and fix the resulting race conditions.

Inversely, it's relatively easy to abstract away the complexity of pointers (so we do), but useful abstractions that make race-y code impossible are horrendously hard to get right (so we don't).

Rust's borrow checker is a great example of the sort of complexity you _have_ to bring in to have your language protect you from data races.

You don't need Rust's complexity and borrow checking to protect you from data races. Actor model does that without the bad parts.

That's true but actor model achieves freedom from data races by forcing ALL messages to be synchronized. This can be useful in some situations but imo isn't a panacea for concurrency issues

> ALL messages to be synchronized

I'm assuming you mean implementation-wise. Only cross-core communications need synchronization and since messages are asynchronous you can pay synchronization costs only ones for the whole batch.

That makes more sense.

JavaScripts async model is arguably the worlds main source of race conditions at this point.

I believe they were using JavaScript is an example because it is a single threaded environment, while most people think of race conditions as being related to multi threading problems.

None of those car examples are necessarily race conditions. Most of those are likely deterministic and merely the result of wrong assumptions due to poor understanding of the system about what happens on which event, i.e. your usual software development. Don't mistake that for race conditions!

Just the other day I was working on a browser extension and saw some things not getting triggered sometimes, but since it's a single threaded event driven code in a single content script it can't really have race conditions, everything is deterministic, and it took only five minutes and couple of console.log()s to figure out what the problem was.

> Are the data races on these flags harmful? Perhaps not. For example, in the evening we might shut down all transaction-processing threads and then select 10 random accounts that are flagged as having had activity for manual auditing. For this purpose, the data races are entirely harmless.

My understanding of data races like that is that they can cause UB by, for example, leading the reader to take neither side of an if/else branch. If something can cause UB depending on timing, that seems like it must be a race condition too.

The article sort of mentions this, but I thought it should be pointed out fully: in c/c++ from the language standpoint it's impossible to have a data race without a race condition, because a data race is undefined behavior and that means anything can happen, including race conditions.

I'd consider races of atomics with relaxed ordering data races, but those aren't UB per se.

The C++ standard definition of a data race excludes races of atomics; http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n429... section 1.10.23, page 14 near the bottom:

"The execution of a program contains a data race if it contains two ... ,at least one of which is not atomic, and ..."

I feel like, while it may be useful to distinguish the terms, a data race is really just an instance of a race condition. I'd say data races are race conditions, but not all race conditions are data races.

For example, with this line:

a = a + 1;

That's really multiple operations:

  1. Read a
  2. Add 1 to the read a
  3. Put the new value over the old a memory location.
And the data race is due to this. If this happened concurrently, for example:

  1. T1 reads a which is 10
  2. T2 reads a which is 10
  3. T1 adds 1 to its read value 11
  4. T2 adds 1 to its read value 11
  5. T1 puts 11 in location a
  6. T2 puts 11 in location a
And now if these had been synchronized, a would actually be 12, since you had two things increment it. But due to this data race, a is still at 11 only.

Now if you deconstruct the actual operations performed by a single line of code as I did here, you see that it is exactly similar to a race condition.

> I'd say data races are race conditions, but not all race conditions are data races.

In practice, I think you are almost right. There are some cases which exhibit data races, while being free of race conditions, but I think the linked post overstates the importance of differentiating between data races leading to and free of race conditions.

> That's really multiple operations > [...] > if you deconstruct the actual operations performed by a single line of code as I did here

It is not a given that the increment operation is to be divided into individual operations.

All of this is very dependent on the language (interpreted vs compiled (and how it's compiled)), how the concurrency is implemented (managed by the kernel vs software threads), as well as the ordering model of the hardware (whether the processor takes the liberty of reordering instructions).

The data races you describe lead to a race condition under the assumption that `a` or a derivative is used in some conditional branching later on. Arguably, this is always the case, otherwise, there would be no point in having this variable in the first place. But this is just one example.

Now for an example of a data race without race conditions which, IMO, is a bit more explicit than the one from the linked post. Imagine the following: we want each of our threads to write a timestamp of its execution into a shared variable, in order to know when the last thread was executed. This situation contains a data race by definition, as we do not know which thread will write its value into the shared variable. However, both values are similar enough for our monitoring needs (e.g. correct), and do not lead to race conditions.

Hum... okay. Maybe I'm not understanding what is meant by a data race here.

So your example would be:

  function f() {
    lastTime = Time.now();
And now if we had threads T1 and T2 calling f concurrently, you say this would be a data race?

So I'm confused here. There's no bug, so a data race is not a category of defect the way race conditions are? Both your example and the article's example was a data race that is not a bug. Is there an example of a data race that is not a race condition yet leads to a bug?

By the article's definition, yes, it would be a data race. With this example, I just wanted to make it more explicit to you what a data race without a race condition could look like.

Truly I'm still not sure why the author had to make a point and write a blog post about the conceptual difference between data races and race conditions.

> I'd say data races are race conditions, but not all race conditions are data races.

This is not the case according to the article. In the article, transfer4 is a data race but not a race condition.

Your example with T1 and T2 is a race condition and a data race.

Your example with a = a + 1 is neither a race condition nor a data race since there is no concurrency.

It's probably possible to argue that the terms should be defined differently to match the formulation you have instead of the author's. After all, definitions are created by people to help our thinking and we can change them or refine them. But I suspect that would be very a difficult argument to make in a compelling way.

Maybe I didn't understand the article properly then. Their example of a data race is

account.activity = true;

Which two threads could perform concurrently.

Now from my understanding, the issue is that the hardware, OS, runtime and compiler might not guarantee that this assignment is atomic. Thus if two threads were to perform it concurrently, the result would be undefined, in that maybe the value ends up being true, or maybe it ends up corrupted, or maybe it is still false, or it will eventually be true, but for a while, if you read back the value it would still be false, etc.

Edit: Also I don't think you understood my example. When I lay out T1 and T2, I'm talking about two threads concurrently executing a = a + 1;. Which is a data race, by their definition. But I'm saying that the issue with data races are that they can lead to race conditions at the assembly level. Thus a data race is really just a race condition at the assembly or even hardware level.

I'm starting to be convinced that data races don't matter if the variable being raced is in the machine word size.

It depends on context. It's true that many CPUs provide atomicity for aligned accesses of the machine word size. However, the C/C++11 standard defines data races to be undefined behavior, so a data race in a C program still matters.

Go's race detector (-race) does an excellent job of catching data races without false positives, but it does not detect race conditions.

This was my concern with sync.Map being added to Go's standard library: I saw many Java developers use concurrent hashmaps to avoid data races while creating applications riddled with race conditions. Your critical section is rarely just the map.{get,set} but rather the operations performed with the objects you get/set.

Luckily I've seen sync.Map used very infrequently as Go's lack of generics make it fairly awkward.

Someone should fix en.wikipedia that redirects Data Race to Race Condition.

> Generally speaking, some kind of external timing or ordering non-determinism is needed to produce a race condition

race conditions don't rely on non-determinism -- two overlapping CAS operations (compare-and-swap) with deterministic read-read-write-write order will race every time.

There are realistic code examples that produce this order in the wild -- the issue isn't non-determinism, it's lack of synchronization around the CAS.

Two overlapping CAS operations that deterministically get executed in a particular interleaving produce a deterministic result and as such are, by definition, not a race condition.

A race the result of which is known ahead of time is not a race.

it's an order-dependent logic error caused by lack of synchronization

what would you call it if not a race?

If the order is deterministic, there is nothing "order-dependent", other than in the sense that every calculation is order-dependent, in that any calculation might give you different results if you changed the current deterministic order of operations to a different order of operations.

For the same reason, there is also no lack of synchronization. If the result is wrong, it's simply a wrong calculation/a logic error, and if that is because the order of operations is wrong, then that is because the order of operations is wrong, not because of some race that doesn't happen.

If you have a Hyperthreading CPU, the order of operations on a given code block can change from sequential to parallel depending on the availability of specific ALUs. Many times the code can be correct until something external either causes more or less hyperthreading to occur which exposes dormant bugs.

... in which case the ordering of operations is not deterministic, so it's not relevant to this thread of the discussion?

in my example I'm launching two compare-and-swap operations (op1 and op2) at the same time.

The correct order (in the presence of synchronization) would be [op1_read, op1_write, op2_read, op2_write].

The incorrect order (without synchronization) is [op1_read, op2_read, op1_write, op2_write].

In the second trace, op2_write will overwrite op1_write with the data from its stale read. The correctness issue here isn't dependent on nondeterminism, just on lack of synchronization.

> In the second trace, op2_write will overwrite op1_write with the data from its stale read. The correctness issue here isn't dependent on nondeterminism, just on lack of synchronization.

No, it's simply a wrong order. When the order of operations is deterministic, that means, by definition, that they are synchronized. The fact that you can change the order by inserting or replacing instructions does not mean that the code lacked synchronization, and it is irrelevant that you could use those same instructions in a different context to synchronize operations. In this context, operations are already synchronized, so nothing you could do can possibly "sychronize them more", you only can reorder them (or desynchronize them, potentially, by introducing non-determinism in the order).

Synchronization is what you use to restrict the ordering of operations. When there is only one possible ordering of operations (i.e., the order is deterministic), there is nothing to restrict there. When the order created by some synchronization is the wrong order for correctness of the software, that doesn't mean that the code is lacking synchronization, it simply means that it does order operations wrong.

simple case of this is an increment operation

the programmer has written an incorrect function that does a SQL read, parses a value and then does an update. It doesn't use a transaction (i.e. no synchronization) so this is unsafe to use in parallel.

and someone else uses their function without reading carefully, spinning off two invocations without reading carefully and awaiting them both

I suspect you'll agree this is incorrect in that the value will be incremented once instead of twice.

If you don't call this a race condition, what do you call it?

Your original scenario was one with deterministic execution order, this one as far as I can tell is not ... so, how is this example relevant to the discussion?

if you make some assumptions about how long IO takes and how the systems order IO, this overlapping-increment example will have a deterministic order and still be wrong every time (increment once instead of twice).

The assumptions are that op1 and op2 are started in order and that your database replies in the order received.

> if you make some assumptions about how long IO takes and how the systems order IO, this overlapping-increment example will have a deterministic order and still be wrong every time (increment once instead of twice).

So? Is every piece of code that gives the wrong result a race condition?

Isn't the non determinism in this case the order of the compare and swap operations? And that non-determinism is the cause of the race condition in your example. Or maybe I didn't fully grasp your example.

This guy really knows how to insult his audience. (By repeatedly saying "This is really obvious...")

The word "obvious" appears exactly twice in the article, and in neither place is it used in a way I would consider condescending or insulting.

> Everything in this post is pretty obvious

When someone is struggling to understand something, it's insulting to tell him or her that it's obvious. It's implying that he or she is a moron.

It's only not condescending or insulting if the material actually is obvious.

And in this case, it's not.

I had to re-read one part that he called "obvious," and I used to do computer science research in concurrency. Granted, I had been reading quickly.

There is no way any of this is "obvious" to a student learning about concurrency for the first time, and Regehr is a professor and has students, so that category of people is presumably part of his audience.

There are HN posters on this thread, presumably intelligent, who did not completely follow the post, so again, he's pretty much calling these people morons, since they didn't follow something that he repeatedly insists is "obvious."

> Everything in this post is pretty obvious, but I’ve observed real confusion about the distinction between data race and race condition by people who should know better (for example because they are doing research on concurrency correctness)

So there's another class of morons that can't understand the "pretty obvious"... some concurrency researchers.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact