Hacker News new | comments | ask | show | jobs | submit login
Linux: Introduce restartable sequences system call (github.com)
197 points by tgragnato 8 months ago | hide | past | web | favorite | 62 comments

Is it only me? A github link instead of a kernel.org one?

Maybe it's just an effort to reduce the HN death hug, and every git repo is an identical clone after all, yet I can't help but think that a link to the kernel tree should point to [0].

Github is (and never was) the only source of truth. It (sort of) became because of us.

Replace this with Operaring System, Social Network, default Win 95 browser, any commonly accepted belief.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

I guess this would be in line with the Linus' description of GitHub:

> I think github does a stellar job at the actual hosting part. I really do. There is no question in my mind that github is one of the absolute best places to host a project. It's fast, it's efficient, it works, and it's available to anybody.


Correct source of comment is elsewhere in that thread [1]

[1] https://github.com/torvalds/linux/pull/17#issuecomment-56613...

That took nearly 3 years to get merged: https://lwn.net/Articles/650333/

5 years since first presented: https://www.youtube.com/watch?v=KXuZi9aeGTw

Are the userland scheduling patches (switchto) in the pipeline?

If they are I can't find where...

I'm not entirely sure this is same thing - but if anybody know where the patches for this is please share - I have tried to contact the presenter to get them but had no success thus far.

That doesn't talk about this, but about a larger set of enhancements (new system calls) for avoiding a bunch of threading/scheduler pathologies.

Could you clarify if you think this is good, bad, or just interesting? Your comment might be received as though a change like this taking three years is somehow not efficient or tsk-worthy.

I'm not the OP, but I posted the video I remember watching 5 years ago with great interest.

I have no basis on which to inform an opinion on whether 5 years is a good or bad period of time to go from an idea to a syscall in the Linux kernel. I just find it interesting as someone who has switched jobs multiple times in 5 years -- much less worked on a single idea that long!

Well, to be fair to the authors, their idea became a system in the linux kernel within Google years ago. It's only the general public that has been deprived of its benefit.

(disclaimer: I am the patch author, Mathieu Desnoyers) Just as a clarification, the idea originates from Google (I give full credits to Paul Turner and Andrew Hunter for it). However, the extra 3 years of work required to get it upstream has been done by myself at EfficiOS.

Hi Mathieu, Yep, was going to say, that the patch set has evolved rather notably. Good job.

Did you manage to push the required changes to glibc or do you maintain your own user space rseq lib?


I'm currently discussing with glibc maintainers on the best approach to integrate this into the Linux userspace ecosystem. So far, discussions aim into a direction where glibc would own the __rseq_abi TLS symbol, and register it for every thread. I can then maintain a rseq library which consists of helper header files that contain the common rseq operations for all supported architectures.

I am concerned about providing a librseq that handles rseq registration for early adopters though, because I don't want projects to eventually end up conflicting with future glibc versions. Once we settled how glibc will expose the symbol and register it, I will try to provide a helper library which exposes this symbol and allow performing explicit rseq registration in a way that won't conflict with future glibc versions.

> I am concerned about providing a librseq that handles rseq registration for early adopters though

Sounds very reasonable.

So at this point, as far as I understand it, FB and Google carry in-house rseq kernel and user space patches. Right? Are they on board with the mainline rseq? Will FB support rseq in jemalloc any time soon?


I've been in touch with FB. They are interested in using rseq for jemalloc. They have provided prototypes of jemalloc based on rseq, along with benchmarks helping me make the case for rseq mainlining.

I don't know whether Google will ever want to swap from their in house rseq implementation to the upstream Linux rseq, use both ABIs for a transition period, or simply keep using their own in-house rseq.

Thank you for persevering! Could you elaborate a bit on what had to be done to get it upstream?

Sure, before getting it upstream, I had to:

- Gather a list of desiderata, ensuring we take into account a complete list of use-cases targeted by everyone active in the rseq discussions. This is crucially important to ensure discussions don't spin in circles going back and forth between different requirements,

- Redesign the uapi/linux/rseq.h ABI, making sure a single TLS store is needed to enter a rseq critical section, without requiring any extra registers as ABI. I have introduced the "rseq_cs" structure as critical section descriptor to do this,

- Optimize arm32 and x86 rseq critical sections for speed, by creating my own benchmark programs,

- Rewrite the kernel rseq implementation a few times so it follows the kernel coding style and ensure it pleases everyone caring about it,

- Present 2 talks about rseq at Linux Plumbers Conference,

- Go through various rounds of in person, email, and IRC discussions with Paul Turner, Peter Zijlstra, Andy Lutomirski, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Ben Maurer, Linus Torvalds, and many others. Those were very constructive discussions bringing up everyone's concerns with respect to this new system call,

- Extend the rseq selftests, adding new testing strategies such as delay loops between "steps" of the critical section, thus increasing the likelihood of generating preemption races,

- Figure out nasty races only happening on NUMA systems after about a full day of stress-testing,

- Provide solutions for debugger single-stepping "lack of progress" problem if rseq is used when retrying on abort. It's basically the cpu_opv system call I plan to propose for 4.19. Meanwhile, without cpu_opv, rseq can still be used in ways to guarantee forward progress, but the abort code needs to use a partitioning strategy rather than a simple retry (e.g. going to a different memory pool in case of abort for a memory allocator),

- Harden the rseq mechanism for security, by adding a "signature" word before the abort label,

- Implement prototypes of lttng-ust and liburcu which use rseq, gathering benchmarks to validate the approach,

- Write rseq and cpu_opv man pages.

And this is just the items that were "forward progress" in the rseq adventure. I'm leaving out everything that were attempts at making things more generic that had to be thrown away.

Thanks for this detailed and a bit overwhelmingly long list. When I saw the first Paul Turner techtalk on rseq (it was called something else - LPC - if I remember correctly), it seemed so simple, so obvious "just read this memory address and if there was an interruption, we have to retry".

But then of course real life is a lot more complex than slideware.

cpu_opv is new for me (no time for LWN these days), but looks simple, elegant and sort of obvious (again). Which makes me wonder why no one thought about it yet. (But of course this is probably my ignorance speaking.)

Thanks for pushing the limits!

I just thought it was an interesting factoid and a good segue to the LWN article link.

This is too far above my level to understand directly - has anyone got an example of an application that will be affected by this? I see some comments about tracing below.

Some use-cases likely to be enhanced by rseq: statistics counters, memory allocators (jemalloc, glibc malloc, and others), user-space tracing (LTTng), user-space Read-Copy Update (liburcu), reading performance monitoring unit counters from user-space on ARM64, and possibly user-space task scheduling.

Also, just reading the current CPU number can now be done faster by reading the __rseq_abi->cpu_id field rather than calling the sched_getcpu vDSO.

Thanks for seeing this work through, Mathieu.

> possibly user-space task scheduling

I'm very interested in this aspect. Do you have a sense of whether there's enough in the kernel now to build this, or are there still pieces missing?

Yes, there are indeed pieces missing for this use-case. I intend to push another system call for the next merge window (4.19): "cpu_opv" [1]. It stands for "CPU operation vector", which is needed to take care of moving user-level tasks around between per-cpu work-queues touched by rseq fast-paths in a way that is safe against CPU hotplug. It's also needed to migrate free memory between per-cpu memory pools modified by rseq fast-paths safely against CPU hotplug. Some of it can be approximated by setting cpu affinity, but it's racy against CPU hotplug.

cpu_opv can be used as a slow-path fallback in pretty much all scenarios where the rseq fast-path aborts.

rseq user-level APIs are pretty much limited to only work on the current CPU, whereas cpu_opv allows creating operations on per-cpu data structures [2] which take the CPU number as argument. If it happens to be on the current CPU, rseq can be used and it is fast, but if the CPU number is not the current CPU, or is an offline CPU, then cpu_opv takes care of performing the operation safely with respect to rseq critical sections, and other cpu_opv operations.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r... [2] https://git.kernel.org/pub/scm/linux/kernel/git/rseq/linux-r...

What is an example of the "per-cpu data" they're talking about here[0]?

[0] https://github.com/torvalds/linux/blob/d82991a8688ad128b46db...

Mathieu's (patch author) main motivation was removing atomic operation from the LTTng userspace tracer's hot path.

In that case, the per-cpu data is the 'reserve' and 'commit' counters that must be updated when the tracer saves an event to the per-CPU buffers.

Other uses that I'm aware of include memory allocators that maintain per-CPU arenas.

I suspect that something like a heap implementation could use this. For concurrency, you want different cores to use different pools to avoid atomics. In practice, this means per-thread pools are used today, but this rseq feature seems like it would allow using per-core pools instead. That would save memory and probably be even better for cache locality when a core is shared by multiple threads.

I use higher-level APIs built on top of restartable sequences. Here's my understanding (could be wrong):

> I suspect that something like a heap implementation could use this.

Indeed. Let's say you want to have lots and lots and lots of threads, as described in the video schmichael linked. [0] Per-thread malloc pools become less attractive:

* too empty (lots of contention for the global pool) or * too full (lots of wasted RAM, probably poor CPU cache utilization as well) or * lots of sloshing

More generally, people sometimes do per-thread stuff to avoid lock contention. Some types of state might be reasonable to keep per-thread when the program is written in a thread-per-core / async style but might not be it's written in a thread-per-request / sync style. It might use too much RAM. If you ever have to access _all_ the threads' state (say, if you are doing some counters for a monitoring system: increment just the current thread's state on write; sum them on read), that path might get ridiculous. So per-CPU might work better.

Per-CPU stuff doesn't require restartable sequences. You can just use the CPU number to decide which shard to access then lock it or use atomics as you would with global state. You get less lock contention and cache-line bouncing. (Alternatively, you might get some of these benefits by picking a shard randomly, if the rng is cheap enough. Or a counter.)

Restartable sequences let you entirely avoid atomic operations for per-cpu stuff.

[0] https://www.youtube.com/watch?v=KXuZi9aeGTw

If you are looking for examples of per-cpu data structures using rseq, see the selftests I implemented here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

There are examples of per-cpu counters, per-cpu spinlocks, per-cpu linked-lists, and various forms of per-cpu buffers.

AFAIK it means that it guarantees the block of code will not be interrupted by scheduler and executed on another CPU/Core

Sort of. It looks like if some critical code is interrupted, by a CPU move, then the process would get some code restarted. The idea being since, in some cases, CPU moves can be rare you aren't paying for the expense of locks to guard against this and just let the OS tell the process, hey I moved you will you were doing something critical, let me restart that for you. This isn't a full solution for all process locks but for a number of locking scenarios restarting the code is all that is needed.

Kind of like optimistic concurrency? "Lets try, we'll retry on failure".

It doesn't look like it's a full retry, atleast not for critical sections.

From what I can tell you do three things;

A) you can define a critical section which must run atomically in respect to the CPU core moves

B) if the critical section is interrupted you can define a callback or restart the section (meaning the section can be safely repeated)

C) you can define a single operation to commit (ie, updating a pointer)

You can certainly build a "retry on failure" method using this, CPU moves are rare so it's unlikely to fail the second time.

Efficient access to thread-local data, maybe. https://github.com/golang/go/issues/8884

So that's what the warning I get on 4.18 with arm64 is about..

CALL scripts/checksyscalls.sh <stdin>:1335:2: warning: #warning syscall rseq not implemented [-Wcpp]

I guess it's not really in, then.

The nature of the thing is that there's a chunk of per-architecture work required (maybe 20-30 lines of actual kernel code and another 700 lines of userspace/testcase support, judging by the diffstat). What's gone in to start with includes support for x86-64, i386, powerpc and 32-bit arm as well as the common code. Maintainers of other architectures can then add support on top of this at some point. Support for new syscalls not being perfectly in sync across all kernel archs is not uncommon, I think.

The other not-yet-there part is how userspace libcs are going to expose this usefully to applications; there's some interesting discussion on the kernel mailing list about that. (It needs some libc support because the kernel only supports a single registered restartable-sequence area, so if multiple libraries try to do it they'll tread on each others' feet; and libc will want to use this feature itself.)

Are you aware of any effort to add ARM64 support?

Yes, Will Deacon (Linux ARM64 maintainer) is working on it right now.

Here are the man page for it:

Search for "Man page associated":


Also man page source:


A restartable sequence is defined by a start and an end offset. If I write a restarable sequence in C I can't be absolutely sure that it's contiguous and I can't know where the commit instruction is...

Does it mean that I have to write it in assembly to make sure that I know where the sequence starts and where it ends?

Indeed the restartable sequence critical section needs to be written in assembly. The idea is to keep this complexity within public headers implementing the common operations as inline assembly for all supported architectures. You can see such operations already implemented for x86 as part of the rseq selftests here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

I think C lost it at: abort must be possible to call from any place in the critical block. Since you can't be sure the registers didn't spill on the stack, I don't think anything more than ASM would work.

Does something similar exist on other OSes?

It's implementable with a kernel driver on Solaris because of its scheduling hooks. (This was done in the "Mostly Lock-Free Malloc" paper).

>The address of jump target abort_ip must be outside the critical region

Seems like an odd requirement, isn't the whole point that we want to restart the critical section, or at least part of it, over in the case that it's interrupted?

Not necessarily. Applications may choose to retry a certain number of times and fallback to another mechanism if the critical section can't complete.

A problematic case that was brought up on LKML was the interaction with debuggers and unexpected page faults, see this comment: https://lwn.net/Articles/738119/

Usually you need to do some cleanup first. Abort_ip being outside critical section prevents kernel from restarting this cleanup code.

Yes, but you can't restart from inside the critical section, you have to like reinitialize it.

genuine question: why not boot the machine with 'isolcpu' and place your tasks on said cpu's ? such tasks would be not subjected to vagaries of scheduler etc...

It all depends on how much control you have on the system you target. The strategy you refer to may well work for a dedicated deployment, but if you are developing a general-purpose memory allocator targeting a wide range of applications, you might not want to impose those constraints on your users.

ah, thanks for the information ! having used 'isolcpu' on dedicated-systems, i completely overlooked this aspect.

You could still have N threads but only want M (M<N) CPUs. For example there are situations where the single threads are mostly I/O bound, but reducing the little CPU work they do allows you to pack more threads (and e.g. serve more clients) on the same number of CPUs.

Without restartable sequences, replacing per-thread data with per-CPU data had a trade-off: per-thread data can be accessed with less overhead, per-CPU data consumes less memory. Restartable sequences give you the best of both worlds.

Will this finally allow us to implement the mkdir system call in terms of mknod and link :-?

No, you can't call syscalls from restartable sequences

Why do you want that?

The Speedup numbers named in the commit [0] are impressive on arm but don't look very well on x86. On the other hand they add a thread-local write to every context switch and add a bunch of code. Add to that only 1.2 speedup on the LTTng benchmark.

To me as a non-kernel guy this doesn't sound very impressive..

[0] https://github.com/torvalds/linux/commit/d7822b1e24f2df5df98...

Full disclosure, I work on LTTng and a 20% improvement is a huge deal to our users, so I'm certainly biased here.

As far as I know, the thread-local write only occurs if a rseq region is set. The added cost to a context switch seems essentially limited to a branch: https://github.com/torvalds/linux/commit/d7822b1e24f2df5df98...

The other benchmarks that were posted in the various threads leading to this merge looked fairly impressive to me.

Of the top of my head, jemalloc benefited a lot from this patchset, and not just on the run-time front: https://lkml.org/lkml/2016/10/10/332

Ah yes, the ubiquitous middlebrow dismissal (https://news.ycombinator.com/item?id=4693920 http://www.byrnehobart.com/blog/why-are-middlebrow-dismissal...). "As a non-kernal guy" gives it away immediately.

The other responses were useful. Yours is just a put-down. Why bother? To discourage "middlebrow" comments? But the one you're responding to elicited useful information, so it's hardly a problem.

I found the articles on the "middlebrow dismissal" insightful. And I don't think ends usually justify means; I think the parent-most comment was not as nice as it should have been.

The parent comment is not so much a put-down as much as it calls out a particularly unsophisticated dismissal. The parent-most comment would have elicited the same useful information if it were phrased as a question, but would have been less uncouth. I think that's a perfectly fine thing to write a comment about.

A 1.2x speedup affecting a primitive operation is huge.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact