
Google Is Uncovering Hundreds of Race Conditions Within the Linux Kernel - pjmlp
https://www.phoronix.com/scan.php?page=news_item&px=Google-KCSAN-Sanitizer
======
elchupanebre
Yay Konstantin Serebriany and his team! The dude behind these sanitizers is
quite brilliant. The Universe should know its heroes.

------
ChuckMcM
Race conditions create meta-stable states. Fixing them always increases
predictability of the code. It can also result in fixing formerly "cosmic ray"
type bugs that occurred once or twice and were never seen again.

This is because one of the sources of very hard to reproduce bugs is a set of
race conditions aligning just right.

~~~
hazeii
Meta-stability [0] is a hardware issue, surely? Race conditions can create
some crazy effects (BTDT) but at least they are fixable in software.

[0]
[https://en.wikipedia.org/wiki/Metastability_%28electronics%2...](https://en.wikipedia.org/wiki/Metastability_%28electronics%29)

~~~
ChuckMcM
With all due respect to Wikipedia, the term meta-stability is applicable to
any system issue where the behavior of the system is undefined when one or
more of its inputs can be undefined.

The inputs to "software" finite state machines are typically inputs in the
form of variables. Those inputs define which branches within the routine will
be taken during processing.

You can model a subroutine as an FSM for which its "outputs" are its state,
the FSM takes inputs, processes them, and sets new outputs or a new state.

We use fuzzing to expose the state machine to all possible combinations of
inputs and this identifies all possible exit states from all possible input
states, _but assumes the input states are stable._

Multi-processing introduces the possibility of race conditions. In a race
condition, one input state is present when the state machine is entered, but
_during execution_ the race completes which changes the input value.

We use 'stable' to define an input that has the same value across the entire
duration of the state machine's state execution.

During a race condition, an input may change value one or more times across
the time interval of the state machine's state execution. Thus at any
_instant_ in time the input has a singular value, during an _interval_ in time
that value may have many different values. These inputs are meta-stable.

And yes, you can fix meta-stability in software with things like mutexes and
execution exclusion. Just like you can fix meta-stability bugs in hardware
signals by add a clock synchronization domain that spans the widest period of
meta-stability possible for a signal.

One of the things that makes race condition bugs so hard to debug is that your
typical tracing facility _assumes_ that the inputs passed to a function are
stable and don't change. Thus you'll see a trace record of a function call,
with its parameters, and walk through the code and say "Wait, with those
parameters this code could never do what it just did." Or conversely, that
variables that have shared write status across execution domains will be the
same throughout a function.

Does that make is clearer what I was talking about? Race conditions suck :-)

~~~
eyko
I think Wikipedia would benefit from your contribution!

------
HashThis
This is why open source for lower level systems is amazing!! That is why
graphic card drivers should be open sourced

~~~
fulvous
They are for the 2 of the biggest GPU vendors.

~~~
CobrastanJorji
Is nVidia no longer one of the two biggest GPU vendors?

~~~
metamet
I don't think that's what meant. "2 of the biggest GPU vendors" !== "the 2
biggest GPU vendors".

~~~
sriram_sun
Nice distinction! Thank you.

------
noego
I'm not surprised that there are vast numbers of latent low-probability race
conditions in the linux kernel or any major software project. Having seen the
testing process in both hardware and software projects, the two are not even
comparable. The testing process in most software projects is dominated by
fully deterministic unit tests that are very simple in nature, and make large
numbers of assumptions about the behaviors and interactions with other
components. Low-probability race conditions between different components is
exactly the outcome I would expect from this testing process.

[https://software.rajivprab.com/2019/04/28/rethinking-
softwar...](https://software.rajivprab.com/2019/04/28/rethinking-software-
testing-perspectives-from-the-world-of-hardware/)

------
plorg
Maybe this will do a better job of identifying the race condition that made
the AMD card in my laptop unusable. About 5 years ago a bunch of changes
evinced a change in behavior when trying to switch hybrid graphics
controllers. I worked in a bug report for a long time after I and a few others
fingered it as a race condition (it failed maybe 1/10 switches). A cluster of
other changes meant it was difficult to bisect (it broke one way at a few
points and broke differently at other points, but in a way that made it
difficult to identify whether the bug we were triageing existed at that
point.)

------
Havoc
Solid contribution. Will hopefully improve stability even further.

Does this automatically generate fixes too, or does someone need to
investigate time for each by hand?

~~~
sanxiyn
It needs investigation by hand. I don't see how that can be automated.

~~~
Havoc
I would have thought so too, but the github says:

>kcsan-with-fixes: Contains KCSAN with various bugfixes for races detected;
the commit messages for those bugfixes include the KCSAN report as-is.

Which to me seems to imply some sort of automatic mitigation. Maybe I'm
reading too much into it

~~~
sanxiyn
Nope, that's just the repository with manual fixes.

------
pedrow
For those interested, here are some of the bugs which their tool has found:
[https://github.com/google/ktsan/wiki/KTSAN-Found-
Bugs](https://github.com/google/ktsan/wiki/KTSAN-Found-Bugs)

~~~
sanxiyn
No, that's KTSAN, which is different from KCSAN, although they do find similar
bugs.

------
mikorym
Are there any HN readers here that use MINIX or possibly another OS that does
not have the Linux kernel? I'm not about to argue pros/cons but would like to
see peoples' use cases; my own use cases have not necessitated using anything
more complicated than Ubuntu, and quite happily so.

Still, I think from my interest point of view it would be interesting to not
only understand the Linux kernel better, but also other OS design paradigms.

~~~
beefhash
I swear by OpenBSD. I use it on my home server. I use it to write C code
(their C standard library has a few very nice things, such as arc4random,
strl{cat,cpy}, explicit_bzero, timingsafe_memcmp, libtls, and other things
that I run into often enough that I don't want to think about them before I
start adding portability stuff).

Of course, it's not much of an option for anything I require proprietary
software.

~~~
__turbobrew__
Openbsd has opened my eyes to how deficient documentation is in Linux.

~~~
mikorym
Meaning it has good documentation?

------
yogthos
I'd be shocked if something written in C that's as complex as the Linux kernel
didn't have race conditions.

~~~
adtac
The language a program is written in has no bearing on whether or not the
program is susceptible to race conditions. You can have data races in Python,
Javascript and Rust as easily as you can in C.

~~~
loonyphoenix
You cannot have data races in safe Rust.

~~~
adtac
Race conditions aren't necessarily data races. This article and the parent
comment was about race conditions in general.

~~~
loonyphoenix
The parent comment literally says

> You can have data races in Python, Javascript and Rust as easily as you can
> in C.

Don't use "race conditions" and "data races" interchangeably if you understand
the difference...

------
ptah
I wonder if any software rely on these bugs

EDIT: windows famously had bugs and had to add special code to preserve the
buggy behaviour to keep certain applications that relied on them working
[https://www.joelonsoftware.com/2004/06/13/how-microsoft-
lost...](https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost-the-api-
war/)

~~~
asveikau
You assume every race is a bug.

Part of the job of a kernel is to resolve races. Requests come to it from
multiple threads and processes in a parallel fashion. Sometimes your timing
will be one way and get you one result, sometimes it will go another way.
That's OK. It's the nature of the beast.

~~~
pjc50
This is a misunderstanding. A situation where "either process A or process B
gets a resource first" is not a race condition. A situation where both of them
get it at the same time is.

~~~
asveikau
Ok, I was being imprecise.

Even if there is parallelism and multi-core, it is still a question of
ordering, and from there we can use language like who gets there "first"
[indeed "race" suggests this]. Or, did this read of a machine word see
somebody else's write, etc.

------
namanaggarwal
I am wondering if machine learning can use used to solve these problems. I
have heard that Google uses a machine learning to auto find the bugs and raise
merge request against them. Someone from Google can confirm.

~~~
jacquesm
You don't need machine learning, you need a better architecture. Micro kernels
can be made in such a way that they are guaranteed race condition free.

~~~
adrianN
If you're willing to throw a large part of the existing code away to move to a
microkernel architecture, you might as well rewrite it in a language that
doesn't have race conditions by design.

~~~
jacquesm
Race conditions can be constructed in almost any language. It is more of a
systemic thing than a language artifact though some languages are more race
condition prone than others.

You can even have race conditions in hardware. Multi-threading or distributed
software are excellent recipes for the introduction of race conditions,
sometimes so subtle that the code looks good and the only proof that there is
something bad going on is the once-in-a-fortnight system lockup.

~~~
lokedhs
I believe the undocumented instructions in the 6502 was caused by race
conditions in the CPU. Some of these instructions are actually usable while
others give random results.

~~~
jacquesm
I wrote a 6502 assembler and had a really hard time not giving $88 it's own
mnemonic in the main table rather than to load it at the start of every file
from an include. It was just too useful :)

~~~
lokedhs
Why wouldn't you? These opcodes seems to be used by most people these days, so
there seems to be no reason not to treat them just like any other opcode?

------
adrianN
I really like the development of all the sanitizers we got in the last decade
or so. I only wish that more could be done at compile time as opposed to these
runtime checks.

~~~
vinceguidry
Is there even an approach to doing that? Intuition tells me that's a
mathematically intractable problem.

~~~
gameswithgo
Rust prevents data races at compile time. Note that data races, while a common
form of race condition, is not the only kind of race condition, and Rust won't
prevent the others.

------
asveikau
Doesn't the kernel make use of some lock free data structures? Like RCU? I
would be surprised if there are no races there, but the point is that they are
harmless and handled.

~~~
sanxiyn
Lock free data structures still use atomics, which this tool understands. All
bugs found really are data races.

~~~
tytso
The tool doesn't understand all lockless techniques, so there will be some
false positives. For example, there are some cases where people have used the
low-level primitives barrier() and cmpxchg(), which this tool (not possessing
human-level intelligence) can analyze.

Also, not all data races can be exploited into serious bugs (in some cases,
some stats might just be incorrect, for example).

That doesn't make the tool useless, of course! Just that one should take the
numbers with a grain of salt.

~~~
sanxiyn
I don't see how any lockless techniques can cause false positives on this
tool, since this tool ignores them. The tool instruments plain memory
accesses. For each access, with some probability, 1. setup watchpoint, delay,
and then delete watchpoint, or, 2. check watchpoint. If there is a matching
watchpoint, two threads made "simultaneous" accesses, hence data race.

------
nialv7
1) data races are different from race conditions. and data races are what this
sanitizer detects.

2) data races are undefined behavior in C, but:

3) they don't necessarily translate to bugs in practice. for example:

[https://godbolt.org/z/Z9RvpB](https://godbolt.org/z/Z9RvpB)

this is a data race, but in practice everything works as expected.

in cases like this, fixing the data race adds overhead without giving us much
benefit.

------
slacka
Any chance of this sanitizer being generalize for use outside of the kernel?
Someday could the we use this like UBSan?

~~~
threadsanitizer
I'm not sure about other platforms, but in Xcode, there is the ThreadSanitizer
option in the Scheme diagnostics. Sounds similar to the Kernel Thread
Sanitizer mentioned in the article. I assume it's part of clang, but am not
positive of the under-the-hood implementation. If so, it may be available on
other platforms. (I mention it because Xcode also has options to use Address
Sanitizer and Undefined Behavior Sanitizer, so perhaps it's the same thing as
the Kernel Thread Sanitizer but for user-space stuff?)

~~~
dang
Could you please stop creating accounts for every few comments you post? We
ban accounts that do that. This is in the site guidelines:
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html).

HN is a community. Users needn't use their real name, but do need some
identity for others to relate to. Otherwise we may as well have no usernames
and no community, and that would be a different kind of forum.
[https://hn.algolia.com/?query=by:dang%20community%20identity...](https://hn.algolia.com/?query=by:dang%20community%20identity&sort=byDate&dateRange=all&type=comment&storyText=false&prefix&page=0)

------
elbelcho
Reading the article, it seems like these are from an automated tool. I wonder
how many of these are actual race conditions, vs false positives because of
logic that the tooling can't decipher.

------
Simulacra
This is a non-issue and really a false narrative blown up into something that
isn't really an issue. Any concern of society can be fed into an AI and out
comes a desired result. It's a computer program.

------
LeonM
Boy, there are a lot of negative comments here. Google is contributing to the
Linux kernel and released their tool as open source. Why all the hate?

Anyway, back on topic: would fixing these race conditions be only beneficial
to stability, or would this also improve performance/responsiveness?

~~~
eloff
Fixing races generally means stricter synchronization, which makes for slower
code. The world is complex and there are always exceptions, but I'd expect
performance to decrease.

Edit: it goes without saying, but it's much better to fix the race conditions,
regardless of what it means for performance. Some people seem to think I'm
advocating for not fixing them.

~~~
dekhn
"brakes make my car slower, so I'd expect my average velocity to decrease if
you fixed them"

~~~
mindcrime
_" brakes make my car slower, so I'd expect my average velocity to decrease if
you fixed them"_

Just to be pedantic... in some contexts better brakes actually allow you to go
faster. Consider racing on an oval track... a car with better braking ability
can maintain speed longer as it approaches a corner, then scrub off speed more
quickly to navigate the corner. A car with inferior brakes has to start
slowing down sooner or risk crashing by taking the corner too fast. So better
brakes can lead directly to faster lap times.

~~~
nickserv
Indeed, disk brakes were first used on race tracks before being available on
street cars.

~~~
discreteevent
And some people now argue that disc brakes on road bicycles are more dangerous
because you learn that you can brake later and more suddenly. I.e You will
more frequently approach the limit of friction between the (very narrow) tyre
and the road.

~~~
mindcrime
I had not heard that one. I always thought the main argument against disc
brakes on road bikes was they they aren't good for long descents where you're
riding the brake for long periods of time... due to heat fade or whatever.

Of course both things could be true...

~~~
magduf
>I always thought the main argument against disc brakes on road bikes was they
they aren't good for long descents where you're riding the brake for long
periods of time... due to heat fade or whatever.

No, you have it the other way around. Rubber brake pads on wheel rims fade
with heat; disc brakes are able to dissipate far more heat, and are basically
essential for long descents.

The _only_ valid argument against disc brakes on bikes is that they weigh a
little more (maybe 1 pound). I suppose you could also argue that brake fluid
is more trouble to deal with than a cable, and certainly not as easy to jerry-
rig, but hundreds of millions of cars use hydraulic brakes without any trouble
these days, and I wouldn't want to have jerry-rigged brakes anyway.

------
tmikaeld
We're getting closer to "How Google got me fired for the 532 bugs it found in
my code..."

EDIT: It was meant as a joke as to how media manages to make something that's
good into a problem.

~~~
LandR
In any decent organisation, a bug that goes out to live isn't the developer
who wrote the codes fault.

It's a team fault, meant that maybe the PO didnt spec correctly, then the
developer implemented something wrong, then this was missed in the code
review, then missed in team testing, then missed in testing before going live.
Missed in the PO sign off before live...

If anything I would hope that stuff like this by Google helps to encourage
orgs to build better processes. If you succeed as a team, you fail as a team.
No scapegoats!

~~~
lmkg
> In any decent organisation

Yes, and in my physics classes I learned a lot about spherical, frictionless
cows in a perfect vacuum.

In the real world, the shape of cows is not a sphere, nor even a closed-form
equation. And real organization care about blame, and as the saying goes, shit
rolls downhill.

~~~
jknoepfler
It's your responsibility to work on a team that handles bugs like adult
engineers rather than school children or politicians. Engineers have a duty to
be spokespersons for rational and humane engineering processes.

I've shipped more than my share of bugs live. I've never felt like I didn't
have a team behind me. Obviously I shipped a lot of things that work, as well.

Working someplace humane and rational doesn't just happen, it requires work.

I expect anyone with "Senior" in their job title to do that work, both in
terms of setting team culture and establishing post-mortem policies with
management.

------
kirkil1
Linux will be the mainstream kernel for anything eventually. And companies
will bundle it with their proprietary solutions.

------
stunt
Not to mention there are +9K open source contributors on KCSAN project.

~~~
sanxiyn
Nope, that's because it is developed as a fork of Linux kernel. Those +9K
people are Linux kernel contributors.

------
fbn79
I think would be more helpfull if Google invest into model and describe Linux
Kernel with TLA+. I googled ad found only one person trying doing it
([https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/ker...](https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-
tla.git/))

------
Simulacra
Seriously...?

“Google is uncovering millions of gender preference in literature...” /sarcasm

~~~
dang
" _Eschew flamebait. Don 't introduce flamewar topics unless you have
something genuinely new to say. Avoid unrelated controversies and generic
tangents._"

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
gnode
I can't help but feel most of these bugs and vulnerabilities in the Linux
kernel could be avoided with a more robust foundation than C. It's nice to see
that Rust is starting to be tolerated in the Linux kernel for modules.

It would be nice to see an effort to migrate some of the kernel core, but I
can't see Rust gaining widespread acceptance in the kernel development
community any time soon.

~~~
rvz
> It would be nice to see an effort to migrate some of the kernel core, but I
> can't see Rust gaining widespread acceptance in the kernel development
> community any time soon.

I don't even think that a migration effort from C -> Rust for the Linux Kernel
is even feasible or even worth it, given the scale of the project. At this
point, you might as well start from scratch.

Google is already experimenting with Rust for OS development with their new
Fuchsia operating system [0], which has some drivers written in Rust with a
capability based security model.

[0] [https://fuchsia.dev](https://fuchsia.dev)

~~~
gnode
Developing a new kernel or operating system in Rust is a good idea, but a
massive undertaking, and it's hard to see it gaining adoption quickly. I think
a more credible way to benefit from Rust in the short term is to rewrite small
but critical components of existing kernels, or adopt Rust components from
experimental Rust OS projects.

Similarly, Firefox's Project Quantum seeks to bring Rust to Firefox from the
more experimental Servo project.

------
vkaku
Unless these linting tools go out there with source, I'm going to be skeptical
of Google's magic code checking and bug finding process.

And IMHO, such large number of patches / fixes on Linux should always use a
decent open audit process.

~~~
LeonM
> Unless these linting tools go out there with source

The tool _is_ released as open source.

> such large number of patches / fixes on Linux should always use a decent
> open audit process

All contributions to the Linux kernel are audited by the kernel development
team. Look at kernel.org, you'll see that contributions are always signed off
by at least one other kernel dev, and often also Linus himself.

------
throwaway1967a
Google is a corporation. The linux kernel is [by implication] developed and
maintained by Linus. Please don't do this. If there are race conditions, fix
them following the process of kernel.org.

~~~
jeffreyrogers
Something like 80+% of commits to the kernel are by people paid to work on
Linux (Google employees among them). This is just part of the Linux
development model and has been for a long time.

~~~
non-entity
Off topic, but what companies, besides Redhat or Google, will pay you to work
on Linux. I imagine that would be an enjoyable job.

