
Threads can infect each other with their low priority - Dobiasd
https://github.com/Dobiasd/articles/blob/master/threads_can_infect_each_other_with_their_low_priority.md
======
RossBencina
As far as I can establish with a cursory skim, the article describes priority
inversion, and states:

"OS schedulers might implement different techniques to lessen the severity of
this problem, but it's far from being solved once and for all."

The article fails to describe what "solved once and for all" might mean.

Given that Priority Ceiling Protocol and Priority Inheritance are two common
solutions used in real-time schedulers, I would like to understand why
priority inversion is not a solved problem.

On the other hand, I have yet to learn whether Mac OS has any synchronisation
primitives that mitigate mutex priority inversion.

~~~
Animats
_Given that Priority Ceiling Protocol and Priority Inheritance are two common
solutions used in real-time schedulers, I would like to understand why
priority inversion is not a solved problem._

Because Windows and Linux are not designed for hard real time. There's a
"rt_mutex" thing in Linux, which does priority inheritance, but it's not used
much.

QNX took this seriously; it has priority inheritance not just for mutexes, but
for interprocess messaging and I/O too. But it's a real-time OS. It passes
through not only the priority, but the CPU quantum for non hard real time
tasks. So a group of rapidly interacting processes are effectively scheduled
as if they were one process which happens to cross address spaces.

On most non-real-time OSs, if you have a microservices-type program which is
rapidly passing control back and forth between multiple processes, it will
work fine as long as there's spare CPU time. If there's no spare CPU time,
performance suddenly drops badly. That's because the processes being released
by mutex unlocks go to the end of the line for CPU time.

This is something microkernel designers take seriously, because it has a big
impact on microkernel performance. Unikernel designers, not so much.

~~~
RossBencina
I agree, there are operating systems where few guarantees are made about
protection from priority inversion, at least not without specific programmer
intervention. As a programmer you need to be aware that priority inversion can
be a thing.

However, this has no bearing on whether or not priority inversion is a "solved
problem". "Solved problem" generally means that there are well known, widely
studied technical solutions to a problem. My understanding right now is that
priority inversion _is_ a solved problem -- it is a standard topic in any
introductory textbook on real-time systems with solutions that I stated. But
maybe I'm wrong -- certainly the author seems to think so -- hence my question
stands: _I would like to understand why priority inversion is not a solved
problem._

~~~
Dobiasd
> I would like to understand why priority inversion is not a solved problem.

Maybe I just was not careful enough in choosing this phrasing when writing the
article.

Do you think it would make sense if I change "but it's far from being solved
once and for all." to "each of them with different upsides and downsides."?

~~~
RossBencina
Trade offs yes. I think I understand now that you meant that it is common that
application-level priority inversion is not automatically avoided by the OS --
programmers still need to worry about it. Even if theoretical solutions exist,
they may not be available on your chosen platform, or they may need to be
manually configured (e.g. PTHREAD_PRIO_INHERIT).

~~~
Dobiasd
Thanks. I committed a change
([https://github.com/Dobiasd/articles/commit/8ac8db11eb95de688...](https://github.com/Dobiasd/articles/commit/8ac8db11eb95de688abbe1e8975e0d40a428df83)).
Feel free to open a PR if you have a better phrasing in mind. :)

------
chaboud
Solution from the article: Don't fiddle with thread priorities.

Solution that works well in real life: Use lock-free (or, better still, wait-
free) data structures.

The trick is that, for those of us in the devices and media world, and despite
not always using RT schedulers, we generally need (largely) consistently
prioritized performance for critical operations. For example, I actually need
my device audio buffer callback to be timely to avoid displeasing gaps.

How have whole industries managed to deliver sellable products in these spaces
without resorting to RT kernels? In many cases, careful uses of thread
priorities and, yes, mutexes, have been involved.

This article feels like it was written by someone half-way through:

Junior engineer: Just use thread priority.

Senior engineer: No! Thread priority is a minefield.

Principal engineer: We're going to carefully walk through this minefield.

~~~
Dobiasd
> This article feels like it was written by someone half-way through [...]

You might be right. I'm not a principal engineer. :)

The first solution in the wild we came up with back then when confronted with
this problem, actually was to use a lock-free queue. Only later we figured
out, that we did not need the prioritizing after all, and our software worked
just fine with all threads having the default priority.

~~~
chaboud
Jr./Sr./PE is just a strawman for illustrative purposes. No ranking aspersions
intended... In order I've been SDE, Sr. SDE, Senior PE, Sr. SDE, PE... It's
pretty meaningless outside of the scope of one company's system.

There are plenty of hardware/software systems out there that leverage thread
priority to keep low latency operations humming along (e.g. clearing a
hardware FIFO) or keep background operations out of the way (ish) when the
system is heavily loaded.

The interactions between threads are often sparse and directional (e.g.
producer/consumer) in a reasonably designed system. Needing to play with
priority to govern the interaction's between one's own internal thread
boundaries can be a bad code smell. Using it to better satisfy external
restrictions (e.g. user input, hardware interaction, callback handling) isn't
too uncommon if you want to maximize the utility of your hardware.

------
jacquesm
> If possible, avoid headaches by just not fiddling around with thread
> priorities.

No, learn the ins and outs of the primitives of your OS. Threads are perfectly
fine to run at different priorities but as soon as you start communicating
between them they become a chain of sorts and if resources are exhausted (such
as a queue reaching a maximum size) or if you artificially couple the threads
using some kind of sync mechanism then your priorities will not what you want
them to be.

Prioritization works well for _independent_ threads, not for dependent
threads, which are somewhat closer to co-routines that happen to use the
thread scheduler rather than that they call each other directly.

~~~
gdxhyrd
The PI the article describes does not need an exhausted queue, just merely
used.

~~~
jacquesm
That's because that particular queue is not implemented lock-free. If it were
you'd have to exhaust it first.

------
MaximumYComb
This article seems like 3 pages of my CS Operating Systems unit summarised
into an article. Since I started reading Hacker News I've always assumed the
userbase has a typical education equating to at least a CS education.

Am I wrong for assuming that this article seems fairly low level for the user
base here? It just seems to me that if you can read this article and
understand push, pop, thread priority, mutex, transitive, etc then it's more
than likely that someone has already lectured at you about the issues that can
arrive with using mutexes for locking.

~~~
gdxhyrd
I would say your assumption is wrong. A lot of people have a CS background,
but there are many others that don't.

~~~
blablabla123
Also there are different approaches. I hardly do any low-level threading but
I'm aware of the fact that locking and context switching can be expensive. I
don't think that one always has to learn theory first, it can also go the
other way.

------
cryptonector
This is called priority inversion.

One way to deal with this is to make it so that when a low-priority thread
dequeues a message from a high-priority thread then the low-priority thread
temporarily inherits the client's higher priority, then later goes back to its
lower priority. The problem is identifying when to go back. Solaris doors was
an IPC mechanism that did this well, but at the price of being synchronous --
when you throw asynchrony in that approach doesn't work, and you really do
want asynchrony. If you trust the low-priority threads enough you can let them
pick a priority according to that of the client they are servicing at any
moment.

------
Danieru
I was hoping for a cool story of os bugs currupting threads. Instead, as
others pointed out, this is just a more verbose explanation of priority
inversion.

I would like to read the zombie injection story still.

------
based2
[https://www.reddit.com/r/programming/comments/e7sb5p/threads...](https://www.reddit.com/r/programming/comments/e7sb5p/threads_can_infect_each_other_with_their_low/)

------
devit
Note that this is only catastrophic when using _realtime_ thread priorities,
i.e. thread priority kinds that cause a runnable higher priority thread to run
instead of a lower priority one regardless of how much time they have been
running already.

In general OSes, the commonly used thread priorities are not realtime, and you
need admin/root to set realtime thread priorities.

Also indefinite deadlock can only happen if realtime threads want to take up
more CPU than available (which in particular requires to not have more cores
than realtime threads), since otherwise they will eventually all be
sleeping/waiting, allowing any threads waiting on the mutex to run.

------
solids
Can this be resumed to “A consumer cannot be faster than the producer”?

~~~
snovv_crash
The infection is the other way around here, the producer thread is slowed down
due to a low priority of the consumer.

TLDR for article: the consumer has a lower priority and pauses on the context
switch of the mutex lock, but doesn't get rescheduled for a while. The
producer then has to wait for the consumer to get rescheduled and unlock the
mutex before it can enqueue something.

------
snovv_crash
Isn't this an issue on the OS side? It should only actually lock the mutex
once the scheduler returns, and not on the entrance to the function.

