
Basics of Futexes - ingve
https://eli.thegreenplace.net/2018/basics-of-futexes/
======
bonzini
An important part of futexes, which the article mentions only in passing, is
that the kernel will also check a condition atomically with the wait
operation; if the condition isn't satisfied the wait will not be performed.
This avoids races between a thread that unlocks the mutex right after another
thread has invoked the system call.

Effectively, inside the kernel the algorithm is

    
    
        lock wait queue
        if (*var == val)
           add myself to wait queue
           unlock wait queue
           wait until queue wakes me up
        else
           unlock wait queue

~~~
eliben
Thanks for mentioning this, it's indeed one of the interesting subtleties of
futex().

In my defense, I explicitly say in the post that it's not a replacement for
the man page and suggest the readers at least go through the descriptions of
FUTEX_WAIT and FUTEX_WAKE. The former explains this behavior in the very first
sentence.

Moreover, the post mentions this behavior twice:

1\. When explaining spurious wakeups in the wait_on_futex_value

    
    
      [...] FUTEX_WAIT will immediately return if the
      value at the futex address is not equal to val. In our case this can
      happen if the child issued a wait before the parent wrote 0xA, for
      example.
    

2\. In a comment inside Mutex::lock()

    
    
       // We're here when either:
       // (a) the mutex was in fact unlocked (by an intervening thread).
    

But your point on avoiding an insidious deadlock is spot on.

~~~
bonzini
Indeed it is mentioned. I wrote the comment because repeating the check in the
kernel _on a user provided value_ is what distinguishes futexes from a
condvar. Of course muted+condvar+int is not a replacement if you're
implementing pthreads, but it's also much less efficient than futexes if
you're implementing a synchronization primitive not provided by pthreads.

(The most common use of futexes that I have seen outside glibc is to implement
binary semaphores. Binary semaphores can be a lighter-weight alternative to
condition variables when only a single thread can ever wait on an event).

~~~
cozzyd
In my experience futexes are primarily used to confound valgrind. Damn you
libgomp (yeah I know you can recompile gcc to use pthreads in the openmp
implementation).

------
Animats
Here's the original implementation of that idea, from the early 1970s.[1]
UNIVAC EXEC 8 had this. See section 4.3.4, page 148, "Test and Set Queuing".

This OS had threads, decades before UNIX got them. They were called
"activities". These were also multiprocessor systems. So it had to have user
code synchronization. The UNIVAC mainframes had a "test and set" instruction.
This was an atomic operation which set a flag. If the flag was already set, an
interrupt occurred, so the OS could reschedule.

This could be used in a spin mode, where hitting a locked lock just
rescheduled the activity for later. Or the user could give their threads IDs
and manage them with system calls, like UNIX pthreads. The third option, test
and set queuing, was added later. There was a list of activities in user space
associated with each lock, and when a activity was blocked, it went on the
user space queue for that lock. The "clear test and set" operation was an
assembly language macro, which took the first waiting activity off the list
and told the OS to unblock it.

The normal case of taking the lock without delay did not involve the OS at
all. Even pthreads works that way in some implementations, as it should.

[1] [http://bitsavers.trailing-
edge.com/pdf/univac/1100/exec/UP-4...](http://bitsavers.trailing-
edge.com/pdf/univac/1100/exec/UP-4144.23_R36r2_Executive_System_Vol_2_Jan80.pdf)

------
Iwan-Zotow
It might be worth mentioning that MS Windows has very similar (though somewhat
distinctive) function WaitOnAddress [https://docs.microsoft.com/en-
us/windows/desktop/api/synchap...](https://docs.microsoft.com/en-
us/windows/desktop/api/synchapi/nf-synchapi-waitonaddress) , introduced in
Server 2012 and Windows 8. See for details
[https://blogs.msdn.microsoft.com/oldnewthing/20170601-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20170601-00/?p=96265)

------
ww520
One thing important to remember about futex is that it is not a lock, despite
its name similar to mutex. Futex is a conditional sleep mechanism. The kernel
wakes up the caller when the futex flag's state changes. When awaken and
returns from the futux() call, you don't get the lock. It just means the state
of the lock has changed and it's time to check and attempt to acquire the lock
again. Actually multiple threads sleeping on the futex flag can be awaken at
the same time (depending on the FUTEX_WAKE command's parameter). All of them
need to do compare-and-set to acquire the lock.

~~~
bogomipz
>"When awaken and returns from the futux() call, you don't get the lock.

Can you elaborate on this? That sentence isn't reading well for me.

>"Actually multiple threads sleeping on the futex flag can be awaken at the
same time (depending on the FUTEX_WAKE command's parameter). All of them need
to do compare-and-set to acquire the lock."

So whatever thread kernel schedules first will get the lock? I.e there is no
fairness or weighting? Does this potentially cause a "thundering heard" when
they're awakened at the same time?

~~~
ww520
Essentially to implement a lock in the user mode, you do:

    
    
        while !compare_and_swap(&flag,0,1) // check if flag is free (0), lock it (1)
            futex_wait(&flag,0)            // sleep until flag changes
    

When futex_wait() returns, it just means the flag has been set back to 0 and
the kernel wakes me up so I can check on it again. Another thread can come in
and grab the lock in the meantime. That's why I need to loop back to check
again. The atomic CAS operation is the one doing the lock.

In the old days, it's just a busy loop running the CAS check continually which
is expensive.

    
    
        while !compare_and_swap(&flag, 0, 1)
            // busy loop running
    

With futex(), the lock-acquiring threads can go to sleep and waking up with
the help of the kernel.

Yes, multiple threads can be wakened up, the nr_wake parameter in futex_wake
[1] dictates how many threads to wake up. Yes, you can get a "thundering herd"
problem. For fairness, see priority usage in my other comment
[https://news.ycombinator.com/item?id=17526849](https://news.ycombinator.com/item?id=17526849)

[1]
[https://elixir.bootlin.com/linux/v4.18-rc4/source/kernel/fut...](https://elixir.bootlin.com/linux/v4.18-rc4/source/kernel/futex.c#L1515)

~~~
bogomipz
Thanks for clarification and the examples. This makes sense. Cheers.

------
noahdesu
Nice! I have fond memories of studying the mechanics of futexes when
PREEMPT_RT was under heavy development.

It's been a long time since I thought about the kernel futex implementation,
but IIRC, one of the most interesting aspects was that the kernel would
actually navigate a linked list that existed entirely in userspace. I believe
this was called, or related to, robust futexes [0].

[0]: [https://www.kernel.org/doc/Documentation/robust-
futexes.txt](https://www.kernel.org/doc/Documentation/robust-futexes.txt)

------
nkurz
Does anyone know how "fair" the current Linux implementation of futexes is? I
experimented with them soon after they were introduced, and at that time there
was a great deal of favoritism: under heavy contention a thread/core that had
just given up a lock was most likely to reclaim it, and other threads/cores
would be starved. This isn't always a bad thing (and can sometimes even help
performance) but didn't work well for my application.

~~~
ww520
Most of the check to acquire the lock happens in user mode. It would favors
the thread in RUNNING state that still has quantum left. For example, if a
thread is running, has the lock, does its work, releases the lock (which wakes
up other sleeping threads but they are in RUNNABLE state not RUNNING yet), and
the running thread immediately loops back to acquire the lock again, it would
succeed because the lock is free. When that running thread's quantum runs out
and put to sleep (still holding the lock), the other awaken threads runs but
the lock is not free so they promptly go back to sleep to wait for the lock.
The old running thread wakes up and continues the cycle.

If you want fairer lock acquisition, you need to use locking priority. Just
lower the running thread's priority when it finishes its work. The other
sleeping threads with higher priority will have a chance to get the lock.

~~~
raattgift
> Just lower the running thread's priority when it finishes its work. The
> other sleeping threads with higher priority will have a chance to get the
> lock.

What happens if immediately after the running thread lowers its priority
priority it is descheduled? Is it still holding the lock at that point? Could
it remain descheduled at low priority for a relatively long time while the
other threads wake up and fail to acquire the lock (and also while any other
unrelated threads go about their unlowered-priority business)?

Why wouldn't the workless thread just give up the lock (and possibly self-
deschedule) when it has finished its work? If it is then assigned more work it
would have to be scheduled and when running again contend for the lock, which
is presumably what one wants, right?

~~~
ww520
You would release the lock first before lowering the priority.

As a separate issue, the priority inversion problem is addressed with the
priority inheritance technique, which futex supports directly. Basically the
priority of the thread holding the lock is boosted to the highest priority of
all the threads sleeping at the lock, making sure it can run to completion.
When the lock is released, the thread's priority is lowered back to its
original level. All the xx_pi() APIs are the priority inheritance versions.

There are a number of ways to ensure fairness for lock acquisition; however,
performance often suffers, so people just live with the problem and design the
workload around it.

------
bogomipz
At the end of the piece the author states:

>"The Go runtime does not use libc, in most cases ..."

Can someone say in which cases does the Go runtime elect to use libc?

~~~
ecnahc515
If and when/depends on some env vars and also compile time options. Basically
it mostly uses libc for some network operations like hostname resolution, and
a few other things like user/group resolution. However it defaults to the Pure
go implementations by default for most things if possible.

