
Linux: Introduce restartable sequences system call - tgragnato
https://github.com/torvalds/linux/blob/d82991a8688ad128b46db1b42d5d84396487a508/kernel/rseq.c
======
muxator
Is it only me? A github link instead of a kernel.org one?

Maybe it's just an effort to reduce the HN death hug, and every git repo is an
identical clone after all, yet I can't help but think that a link to the
kernel tree should point to [0].

Github is (and never was) the only source of truth. It (sort of) became
because of us.

Replace this with Operaring System, Social Network, default Win 95 browser,
any commonly accepted belief.

[0]
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/kernel/rseq.c?id=d82991a8688ad128b46db1b42d5d84396487a508)

~~~
nemanjaboric
I guess this would be in line with the Linus' description of GitHub:

> I think github does a stellar job at the actual hosting part. I really do.
> There is no question in my mind that github is one of the absolute best
> places to host a project. It's fast, it's efficient, it works, and it's
> available to anybody.

[https://github.com/torvalds/linux/pull/17#issuecomment-56546...](https://github.com/torvalds/linux/pull/17#issuecomment-5654674)

~~~
Fnoord
Correct source of comment is elsewhere in that thread [1]

[1]
[https://github.com/torvalds/linux/pull/17#issuecomment-56613...](https://github.com/torvalds/linux/pull/17#issuecomment-5661304)

------
wahern
That took nearly 3 years to get merged:
[https://lwn.net/Articles/650333/](https://lwn.net/Articles/650333/)

~~~
schmichael
5 years since first presented:
[https://www.youtube.com/watch?v=KXuZi9aeGTw](https://www.youtube.com/watch?v=KXuZi9aeGTw)

~~~
wahern
Are the userland scheduling patches (switchto) in the pipeline?

~~~
schmichael
If they are I can't find where...

------
voltagex_
This is too far above my level to understand directly - has anyone got an
example of an application that will be affected by this? I see some comments
about tracing below.

~~~
compudj
Some use-cases likely to be enhanced by rseq: statistics counters, memory
allocators (jemalloc, glibc malloc, and others), user-space tracing (LTTng),
user-space Read-Copy Update (liburcu), reading performance monitoring unit
counters from user-space on ARM64, and possibly user-space task scheduling.

Also, just reading the current CPU number can now be done faster by reading
the __rseq_abi->cpu_id field rather than calling the sched_getcpu vDSO.

~~~
FrankBooth
Thanks for seeing this work through, Mathieu.

> possibly user-space task scheduling

I'm very interested in this aspect. Do you have a sense of whether there's
enough in the kernel now to build this, or are there still pieces missing?

~~~
compudj
Yes, there are indeed pieces missing for this use-case. I intend to push
another system call for the next merge window (4.19): "cpu_opv" [1]. It stands
for "CPU operation vector", which is needed to take care of moving user-level
tasks around between per-cpu work-queues touched by rseq fast-paths in a way
that is safe against CPU hotplug. It's also needed to migrate free memory
between per-cpu memory pools modified by rseq fast-paths safely against CPU
hotplug. Some of it can be approximated by setting cpu affinity, but it's racy
against CPU hotplug.

cpu_opv can be used as a slow-path fallback in pretty much all scenarios where
the rseq fast-path aborts.

rseq user-level APIs are pretty much limited to only work on the current CPU,
whereas cpu_opv allows creating operations on per-cpu data structures [2]
which take the CPU number as argument. If it happens to be on the current CPU,
rseq can be used and it is fast, but if the CPU number is not the current CPU,
or is an offline CPU, then cpu_opv takes care of performing the operation
safely with respect to rseq critical sections, and other cpu_opv operations.

[1]
[https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r...](https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-
rseq.git/commit/?h=v4.18-rc1-rseq-20180619.2&id=293b0fb03e6dc9c70d2c5e6ce4065456aa8c4d5e)
[2]
[https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r...](https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-
rseq.git/commit/?h=v4.18-rc1-rseq-20180619.2&id=d380b8e2f5f08c09ad893dc46cf530b45910914d)

------
mereel
What is an example of the "per-cpu data" they're talking about here[0]?

[0]
[https://github.com/torvalds/linux/blob/d82991a8688ad128b46db...](https://github.com/torvalds/linux/blob/d82991a8688ad128b46db1b42d5d84396487a508/kernel/rseq.c#L31-L32)

~~~
fyi1183
I suspect that something like a heap implementation could use this. For
concurrency, you want different cores to use different pools to avoid atomics.
In practice, this means per-thread pools are used today, but this rseq feature
seems like it would allow using per-core pools instead. That would save memory
and probably be even better for cache locality when a core is shared by
multiple threads.

~~~
scottlamb
I use higher-level APIs built on top of restartable sequences. Here's my
understanding (could be wrong):

> I suspect that something like a heap implementation could use this.

Indeed. Let's say you want to have lots and lots and lots of threads, as
described in the video schmichael linked. [0] Per-thread malloc pools become
less attractive:

* too empty (lots of contention for the global pool) or * too full (lots of wasted RAM, probably poor CPU cache utilization as well) or * lots of sloshing

More generally, people sometimes do per-thread stuff to avoid lock contention.
Some types of state might be reasonable to keep per-thread when the program is
written in a thread-per-core / async style but might not be it's written in a
thread-per-request / sync style. It might use too much RAM. If you ever have
to access _all_ the threads' state (say, if you are doing some counters for a
monitoring system: increment just the current thread's state on write; sum
them on read), that path might get ridiculous. So per-CPU might work better.

Per-CPU stuff doesn't require restartable sequences. You can just use the CPU
number to decide which shard to access then lock it or use atomics as you
would with global state. You get less lock contention and cache-line bouncing.
(Alternatively, you might get some of these benefits by picking a shard
randomly, if the rng is cheap enough. Or a counter.)

Restartable sequences let you entirely avoid atomic operations for per-cpu
stuff.

[0]
[https://www.youtube.com/watch?v=KXuZi9aeGTw](https://www.youtube.com/watch?v=KXuZi9aeGTw)

------
stefan_
So that's what the warning I get on 4.18 with arm64 is about..

CALL scripts/checksyscalls.sh <stdin>:1335:2: warning: #warning syscall rseq
not implemented [-Wcpp]

I guess it's not really _in_ , then.

~~~
pm215
The nature of the thing is that there's a chunk of per-architecture work
required (maybe 20-30 lines of actual kernel code and another 700 lines of
userspace/testcase support, judging by the diffstat). What's gone in to start
with includes support for x86-64, i386, powerpc and 32-bit arm as well as the
common code. Maintainers of other architectures can then add support on top of
this at some point. Support for new syscalls not being perfectly in sync
across all kernel archs is not uncommon, I think.

The other not-yet-there part is how userspace libcs are going to expose this
usefully to applications; there's some interesting discussion on the kernel
mailing list about that. (It needs some libc support because the kernel only
supports a single registered restartable-sequence area, so if multiple
libraries try to do it they'll tread on each others' feet; and libc will want
to use this feature itself.)

~~~
webaholic
Are you aware of any effort to add ARM64 support?

~~~
compudj
Yes, Will Deacon (Linux ARM64 maintainer) is working on it right now.

------
hawski
Here are the man page for it:

Search for "Man page associated":

[https://patchwork.kernel.org/patch/10444833/](https://patchwork.kernel.org/patch/10444833/)

Also man page source:

[https://patchwork.kernel.org/patch/10468085/](https://patchwork.kernel.org/patch/10468085/)

------
nikital
A restartable sequence is defined by a start and an end offset. If I write a
restarable sequence in C I can't be absolutely sure that it's contiguous and I
can't know where the commit instruction is...

Does it mean that I have to write it in assembly to make sure that I know
where the sequence starts and where it ends?

~~~
compudj
Indeed the restartable sequence critical section needs to be written in
assembly. The idea is to keep this complexity within public headers
implementing the common operations as inline assembly for all supported
architectures. You can see such operations already implemented for x86 as part
of the rseq selftests here:
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/rseq/rseq-x86.h?h=v4.18-rc1)

------
glandium
Does something similar exist on other OSes?

~~~
davidtgoldblatt
It's implementable with a kernel driver on Solaris because of its scheduling
hooks. (This was done in the "Mostly Lock-Free Malloc" paper).

------
dooglius
>The address of jump target abort_ip must be outside the critical region

Seems like an odd requirement, isn't the whole point that we want to restart
the critical section, or at least part of it, over in the case that it's
interrupted?

~~~
jgalar
Not necessarily. Applications may choose to retry a certain number of times
and fallback to another mechanism if the critical section can't complete.

A problematic case that was brought up on LKML was the interaction with
debuggers and unexpected page faults, see this comment:
[https://lwn.net/Articles/738119/](https://lwn.net/Articles/738119/)

------
signa11
genuine question: why not boot the machine with 'isolcpu' and place your tasks
on said cpu's ? such tasks would be not subjected to vagaries of scheduler
etc...

~~~
compudj
It all depends on how much control you have on the system you target. The
strategy you refer to may well work for a dedicated deployment, but if you are
developing a general-purpose memory allocator targeting a wide range of
applications, you might not want to impose those constraints on your users.

~~~
signa11
ah, thanks for the information ! having used 'isolcpu' on dedicated-systems, i
completely overlooked this aspect.

------
rwmj
Will this finally allow us to implement the mkdir system call in terms of
mknod and link :-?

~~~
nikital
No, you can't call syscalls from restartable sequences

------
Grollicus
The Speedup numbers named in the commit [0] are impressive on arm but don't
look very well on x86. On the other hand they add a thread-local write to
every context switch and add a bunch of code. Add to that only 1.2 speedup on
the LTTng benchmark.

To me as a non-kernel guy this doesn't sound very impressive..

[0]
[https://github.com/torvalds/linux/commit/d7822b1e24f2df5df98...](https://github.com/torvalds/linux/commit/d7822b1e24f2df5df98c76f0e94a5416349ff759)

~~~
opmac
Ah yes, the ubiquitous middlebrow dismissal
([https://news.ycombinator.com/item?id=4693920](https://news.ycombinator.com/item?id=4693920)
[http://www.byrnehobart.com/blog/why-are-middlebrow-
dismissal...](http://www.byrnehobart.com/blog/why-are-middlebrow-dismissals-
so-tempting/)). "As a non-kernal guy" gives it away immediately.

~~~
cryptonector
The other responses were useful. Yours is just a put-down. Why bother? To
discourage "middlebrow" comments? But the one you're responding to elicited
useful information, so it's hardly a problem.

~~~
striking
I found the articles on the "middlebrow dismissal" insightful. And I don't
think ends usually justify means; I think the parent-most comment was not as
nice as it should have been.

The parent comment is not so much a put-down as much as it calls out a
particularly unsophisticated dismissal. The parent-most comment would have
elicited the same useful information if it were phrased as a question, but
would have been less uncouth. I think that's a perfectly fine thing to write a
comment about.

