
Race Condition vs. Data Race (2011) - bshanks
https://blog.regehr.org/archives/490
======
mholt
This is something I learned through the course of trying to be clever in my
early days of using Go (circa 2011) and it's something I wish my CS courses
made explicitly clear, and I think all programmers should at some point learn
this difference. Even in JavaScript, race conditions are possible.

In fact, I'm pretty sure my car's "infotainment" system has several race
conditions (though I'm not sure about data races specifically). After years of
dealing with these kinds of bugs, you can smell them a mile away especially on
software you use a lot. Symptoms: highly sporadic / unpredictable behavior of
a wonky state machine. These are hard to document and debug!

My car, for example, will sometimes exhibit these behaviors:

\- Volume changes Android Auto levels even though only music is playing

\- Backup camera is not on anymore, but steering alignment overlay remains
when side camera is activated

\- Computer+screen stays on even though car is off

\- Audio is quieted for a notification but then never returned to its previous
volume

\- Buttons change state as if the screen was showing something else

Race conditions are really common, I think. I know this is just my car as an
example but you've probably experienced them on any moderately-complex systems
you've used. I bet your TV has race conditions. And your router. And your
("smart") thermostat. I hope airplanes and spacecraft don't.

~~~
knocte
> Even in JavaScript, race conditions are possible.

I guess this was a joke/typo? Otherwise, are you implying that JS has any
protection against them? This would be the last language I would think of.

~~~
vitalus
I think that the "Even in JavaScript..." comment was written with the idea
that race conditions might not be present/possible in single threaded
environments, such as JavaScript...I read it as "Even in a single-threaded
environment, such as JavaScript..."

I don't think that most folks would hold up JS as some sort of gold standard
of language design, and it doesn't seem that the author is doing so here.

~~~
mntmoss
I have a little rant on this.

In the scale of "understanding imperative semantics", awareness of race
conditions is probably near the very high end of difficulty, while pointers
are only somewhere in the middle. And that's a problem, because there are
plenty of languages, JS included, that free you from understanding pointers
while still making it very easy to create racy code. As such there are a lot
of programmers going around with a false understanding of whether they are
writing robust concurrent code - because they don't think it is concurrent. It
doesn't say "concurrent" on it - this is a footgun naturally achieved with any
sufficiently complex loop - and often they have made some effort to tuck away
mutability in tiny functions, hindering efforts to find and fix the resulting
race conditions.

~~~
pdpi
Inversely, it's relatively easy to abstract away the complexity of pointers
(so we do), but useful abstractions that make race-y code impossible are
horrendously hard to get right (so we don't).

Rust's borrow checker is a great example of the sort of complexity you _have_
to bring in to have your language protect you from data races.

~~~
zzzcpan
You don't need Rust's complexity and borrow checking to protect you from data
races. Actor model does that without the bad parts.

~~~
hderms
That's true but actor model achieves freedom from data races by forcing ALL
messages to be synchronized. This can be useful in some situations but imo
isn't a panacea for concurrency issues

~~~
zzzcpan
> ALL messages to be synchronized

I'm assuming you mean implementation-wise. Only cross-core communications need
synchronization and since messages are asynchronous you can pay
synchronization costs only ones for the whole batch.

------
oconnor663
> Are the data races on these flags harmful? Perhaps not. For example, in the
> evening we might shut down all transaction-processing threads and then
> select 10 random accounts that are flagged as having had activity for manual
> auditing. For this purpose, the data races are entirely harmless.

My understanding of data races like that is that they can cause UB by, for
example, leading the reader to take neither side of an if/else branch. If
something can cause UB depending on timing, that seems like it must be a race
condition too.

------
Thorrez
The article sort of mentions this, but I thought it should be pointed out
fully: in c/c++ from the language standpoint it's impossible to have a data
race without a race condition, because a data race is undefined behavior and
that means anything can happen, including race conditions.

~~~
CodesInChaos
I'd consider races of atomics with relaxed ordering data races, but those
aren't UB per se.

~~~
bshanks
The C++ standard definition of a data race excludes races of atomics;
[http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n429...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf) section 1.10.23, page 14
near the bottom:

"The execution of a program contains a data race if it contains two ... ,at
least one of which is not atomic, and ..."

------
didibus
I feel like, while it may be useful to distinguish the terms, a data race is
really just an instance of a race condition. I'd say data races are race
conditions, but not all race conditions are data races.

For example, with this line:

a = a + 1;

That's really multiple operations:

    
    
      1. Read a
      2. Add 1 to the read a
      3. Put the new value over the old a memory location.
    

And the data race is due to this. If this happened concurrently, for example:

    
    
      1. T1 reads a which is 10
      2. T2 reads a which is 10
      3. T1 adds 1 to its read value 11
      4. T2 adds 1 to its read value 11
      5. T1 puts 11 in location a
      6. T2 puts 11 in location a
    

And now if these had been synchronized, a would actually be 12, since you had
two things increment it. But due to this data race, a is still at 11 only.

Now if you deconstruct the actual operations performed by a single line of
code as I did here, you see that it is exactly similar to a race condition.

~~~
javert
> I'd say data races are race conditions, but not all race conditions are data
> races.

This is not the case according to the article. In the article, transfer4 is a
data race but not a race condition.

Your example with T1 and T2 is a race condition and a data race.

Your example with a = a + 1 is neither a race condition nor a data race since
there is no concurrency.

It's probably possible to argue that the terms should be defined differently
to match the formulation you have instead of the author's. After all,
definitions are created by people to help our thinking and we can change them
or refine them. But I suspect that would be very a difficult argument to make
in a compelling way.

~~~
jacinabox
I'm starting to be convinced that data races don't matter if the variable
being raced is in the machine word size.

~~~
bshanks
It depends on context. It's true that many CPUs provide atomicity for aligned
accesses of the machine word size. However, the C/C++11 standard defines data
races to be undefined behavior, so a data race in a C program still matters.

------
schmichael
Go's race detector (-race) does an excellent job of catching data races
without false positives, but it does not detect race conditions.

This was my concern with sync.Map being added to Go's standard library: I saw
many Java developers use concurrent hashmaps to avoid data races while
creating applications riddled with race conditions. Your critical section is
rarely just the map.{get,set} but rather the operations performed with the
objects you get/set.

Luckily I've seen sync.Map used very infrequently as Go's lack of generics
make it fairly awkward.

------
karmakaze
Someone should fix en.wikipedia that redirects Data Race to Race Condition.

------
awinter-py
> Generally speaking, some kind of external timing or ordering non-determinism
> is needed to produce a race condition

race conditions don't rely on non-determinism -- two overlapping CAS
operations (compare-and-swap) with deterministic read-read-write-write order
will race every time.

There are realistic code examples that produce this order in the wild -- the
issue isn't non-determinism, it's lack of synchronization around the CAS.

~~~
zAy0LfpBZLC8mAC
Two overlapping CAS operations that deterministically get executed in a
particular interleaving produce a deterministic result and as such are, by
definition, not a race condition.

A race the result of which is known ahead of time is not a race.

~~~
awinter-py
it's an order-dependent logic error caused by lack of synchronization

what would you call it if not a race?

~~~
zAy0LfpBZLC8mAC
If the order is deterministic, there is nothing "order-dependent", other than
in the sense that every calculation is order-dependent, in that any
calculation might give you different results if you changed the current
deterministic order of operations to a different order of operations.

For the same reason, there is also no lack of synchronization. If the result
is wrong, it's simply a wrong calculation/a logic error, and if that is
because the order of operations is wrong, then that is because the order of
operations is wrong, not because of some race that doesn't happen.

~~~
blackflame7000
If you have a Hyperthreading CPU, the order of operations on a given code
block can change from sequential to parallel depending on the availability of
specific ALUs. Many times the code can be correct until something external
either causes more or less hyperthreading to occur which exposes dormant bugs.

~~~
zAy0LfpBZLC8mAC
... in which case the ordering of operations is not deterministic, so it's not
relevant to this thread of the discussion?

~~~
awinter-py
in my example I'm launching two compare-and-swap operations (op1 and op2) at
the same time.

The correct order (in the presence of synchronization) would be [op1_read,
op1_write, op2_read, op2_write].

The incorrect order (without synchronization) is [op1_read, op2_read,
op1_write, op2_write].

In the second trace, op2_write will overwrite op1_write with the data from its
stale read. The correctness issue here isn't dependent on nondeterminism, just
on lack of synchronization.

~~~
zAy0LfpBZLC8mAC
> In the second trace, op2_write will overwrite op1_write with the data from
> its stale read. The correctness issue here isn't dependent on
> nondeterminism, just on lack of synchronization.

No, it's simply a wrong order. When the order of operations is deterministic,
that means, by definition, that they are synchronized. The fact that you can
change the order by inserting or replacing instructions does not mean that the
code lacked synchronization, and it is irrelevant that you could use those
same instructions in a different context to synchronize operations. In this
context, operations are already synchronized, so nothing you could do can
possibly "sychronize them more", you only can reorder them (or desynchronize
them, potentially, by introducing non-determinism in the order).

Synchronization is what you use to restrict the ordering of operations. When
there is only one possible ordering of operations (i.e., the order is
deterministic), there is nothing to restrict there. When the order created by
some synchronization is the wrong order for correctness of the software, that
doesn't mean that the code is lacking synchronization, it simply means that it
does order operations wrong.

~~~
awinter-py
simple case of this is an increment operation

the programmer has written an incorrect function that does a SQL read, parses
a value and then does an update. It doesn't use a transaction (i.e. no
synchronization) so this is unsafe to use in parallel.

and someone else uses their function without reading carefully, spinning off
two invocations without reading carefully and awaiting them both

I suspect you'll agree this is incorrect in that the value will be incremented
once instead of twice.

If you don't call this a race condition, what do you call it?

~~~
zAy0LfpBZLC8mAC
Your original scenario was one with deterministic execution order, this one as
far as I can tell is not ... so, how is this example relevant to the
discussion?

~~~
awinter-py
if you make some assumptions about how long IO takes and how the systems order
IO, this overlapping-increment example will have a deterministic order and
still be wrong every time (increment once instead of twice).

The assumptions are that op1 and op2 are started in order and that your
database replies in the order received.

~~~
zAy0LfpBZLC8mAC
> if you make some assumptions about how long IO takes and how the systems
> order IO, this overlapping-increment example will have a deterministic order
> and still be wrong every time (increment once instead of twice).

So? Is every piece of code that gives the wrong result a race condition?

------
javert
This guy really knows how to insult his audience. (By repeatedly saying "This
is really obvious...")

~~~
wtetzner
The word "obvious" appears exactly twice in the article, and in neither place
is it used in a way I would consider condescending or insulting.

~~~
javert
> Everything in this post is pretty obvious

When someone is struggling to understand something, it's insulting to tell him
or her that it's obvious. It's implying that he or she is a moron.

It's only not condescending or insulting if the material actually is obvious.

And in this case, it's not.

I had to re-read one part that he called "obvious," and I used to do computer
science research in concurrency. Granted, I had been reading quickly.

There is no way any of this is "obvious" to a student learning about
concurrency for the first time, and Regehr is a professor and has students, so
that category of people is presumably part of his audience.

There are HN posters on this thread, presumably intelligent, who did not
completely follow the post, so again, he's pretty much calling these people
morons, since they didn't follow something that he _repeatedly_ insists is
"obvious."

> Everything in this post is pretty obvious, but I’ve observed real confusion
> about the distinction between data race and race condition by people who
> should know better (for example because they are doing research on
> concurrency correctness)

So there's another class of morons that can't understand the "pretty
obvious"... some concurrency researchers.

