
How to trigger races reliably in the Linux kernel - sohkamyung
https://people.kernel.org/metan/how-to-trigger-races-reliably
======
loeg
In the FreeBSD kernel, we've got fail(9) for this exact use (since 2009).

Check out the EXAMPLES section of the manual page. You can use the mechanisms
to inject delay on some percentage of executions, and more generally some
integer value can be injected, which can be used to simulate device failures
(EIO) or whatever.

It's very powerful for creating reliable reproducers for all kinds of
failures, including race conditions. And the KPI is not quite as verbose as
the one presented in TFA.

[https://www.freebsd.org/cgi/man.cgi?query=fail&sektion=9](https://www.freebsd.org/cgi/man.cgi?query=fail&sektion=9)

------
chrisseaton
> Race condition, in terms of a computer programming, is a bug where two
> pieces of code cause an error if executed concurrently.

This article makes the common mistake of saying that a race condition is
inherently a bug.

A race condition just means that observable program behaviour is dependent on
the interleaving of two tasks. A race condition is only a bug if you don't
want one of the possible interleavings.

You can often find great performance optimisations by working out how to work
with race conditions rather than disallowing them.

~~~
adtac
is there any situation where a _data race_ might be useful or accommodatable?

~~~
chrisseaton
For example, some profilers that count the number of times something is run
will update a counter without any synchronisation. This risks lost updates,
but since you really just need to know if the counter is very low or very high
a few lost updates here or there don't really matter.

But I think from some accepted definitions (such as Padua et al) a data race
specifically is a bug, so this data interleaving if we're happy with it isn't
a data race because it isn't a bug.

~~~
robocat
> a few lost updates here or there don't really matter

jerf disagrees in last sentence of other comment:
[https://news.ycombinator.com/item?id=23079788](https://news.ycombinator.com/item?id=23079788)

~~~
chrisseaton
Well I don't know what to tell you except this is how real production
compilers written by experts work in practice.

[https://github.com/oracle/graal/blob/4553ea71f15c9e4721565e9...](https://github.com/oracle/graal/blob/4553ea71f15c9e4721565e94e838214a1b1274b1/compiler/src/org.graalvm.compiler.truffle.runtime/src/org/graalvm/compiler/truffle/runtime/OptimizedCallTarget.java#L816-L818)

You can argue this is a bug and they shouldn't write like this, but they don't
consider it a bug and they do write like this.

------
klysm
If you manage to reproduce a race, is there some way to capture exactly what
the scheduler did and force it to do that again?

~~~
jfk13
For debugging purposes, for example? You might be interested in [https://rr-
project.org](https://rr-project.org).

~~~
throwaway2048
rr isn't going to work on a kernel.

~~~
Hello71
I'm inclined to say that it will work on User Mode Linux, and it looks like
QEMU does have record/replay functionality.

~~~
loeg
Yes, it works as long as you don't have any real devices that cannot be
captured in a reverse-debugger's state history.

------
EvgeniyZh
there is a gsoc project regarding race conditions in Linux
[https://summerofcode.withgoogle.com/projects/#65400635565015...](https://summerofcode.withgoogle.com/projects/#6540063556501504)

