Maybe it's just an effort to reduce the HN death hug, and every git repo is an identical clone after all, yet I can't help but think that a link to the kernel tree should point to .
Github is (and never was) the only source of truth. It (sort of) became because of us.
Replace this with Operaring System, Social Network, default Win 95 browser, any commonly accepted belief.
> I think github does a stellar job at the actual hosting part. I really do. There is no question in my mind that github is one of the absolute best places to host a project. It's fast, it's efficient, it works, and it's available to anybody.
I have no basis on which to inform an opinion on whether 5 years is a good or bad period of time to go from an idea to a syscall in the Linux kernel. I just find it interesting as someone who has switched jobs multiple times in 5 years -- much less worked on a single idea that long!
Did you manage to push the required changes to glibc or do
you maintain your own user space rseq lib?
I am concerned about providing a librseq that handles rseq registration for early adopters though, because I don't want projects to eventually end up conflicting with future glibc versions. Once we settled how glibc will expose the symbol and register it, I will try to provide a helper library which exposes this symbol and allow performing explicit rseq registration in a way that won't conflict with future glibc versions.
Sounds very reasonable.
So at this point, as far as I understand it, FB and Google carry in-house rseq kernel
and user space patches. Right? Are they on board with the mainline rseq? Will FB support
rseq in jemalloc any time soon?
I don't know whether Google will ever want to swap from their in house rseq implementation to the upstream Linux rseq, use both ABIs for a transition period, or simply keep using their own in-house rseq.
- Gather a list of desiderata, ensuring we take into account a complete list of use-cases targeted by everyone active in the rseq discussions. This is crucially important to ensure discussions don't spin in circles going back and forth between different requirements,
- Redesign the uapi/linux/rseq.h ABI, making sure a single TLS store is needed to enter a rseq critical section, without requiring any extra registers as ABI. I have introduced the "rseq_cs" structure as critical section descriptor to do this,
- Optimize arm32 and x86 rseq critical sections for speed, by creating my own benchmark programs,
- Rewrite the kernel rseq implementation a few times so it follows the kernel coding style and ensure it pleases everyone caring about it,
- Present 2 talks about rseq at Linux Plumbers Conference,
- Go through various rounds of in person, email, and IRC discussions with Paul Turner, Peter Zijlstra, Andy Lutomirski, Boqun Feng, Paul E. McKenney, Thomas Gleixner, Ben Maurer, Linus Torvalds, and many others. Those were very constructive discussions bringing up everyone's concerns with respect to this new system call,
- Extend the rseq selftests, adding new testing strategies such as delay loops between "steps" of the critical section, thus increasing the likelihood of generating preemption races,
- Figure out nasty races only happening on NUMA systems after about a full day of stress-testing,
- Provide solutions for debugger single-stepping "lack of progress" problem if rseq is used when retrying on abort. It's basically the cpu_opv system call I plan to propose for 4.19. Meanwhile, without cpu_opv, rseq can still be used in ways to guarantee forward progress, but the abort code needs to use a partitioning strategy rather than a simple retry (e.g. going to a different memory pool in case of abort for a memory allocator),
- Harden the rseq mechanism for security, by adding a "signature" word before the abort label,
- Implement prototypes of lttng-ust and liburcu which use rseq, gathering benchmarks to validate the approach,
- Write rseq and cpu_opv man pages.
And this is just the items that were "forward progress" in the rseq adventure. I'm leaving out everything that were attempts at making things more generic that had to be thrown away.
But then of course real life is a lot more complex than slideware.
cpu_opv is new for me (no time for LWN these days), but looks simple, elegant and sort of obvious (again). Which makes me wonder why no one thought about it yet. (But of course this is probably my ignorance speaking.)
Thanks for pushing the limits!
Also, just reading the current CPU number can now be done faster by reading the __rseq_abi->cpu_id field rather than calling the sched_getcpu vDSO.
> possibly user-space task scheduling
I'm very interested in this aspect. Do you have a sense of whether there's enough in the kernel now to build this, or are there still pieces missing?
cpu_opv can be used as a slow-path fallback in pretty much all scenarios where the rseq fast-path aborts.
rseq user-level APIs are pretty much limited to only work on the current CPU, whereas cpu_opv allows creating operations on per-cpu data structures  which take the CPU number as argument. If it happens to be on the current CPU, rseq can be used and it is fast, but if the CPU number is not the current CPU, or is an offline CPU, then cpu_opv takes care of performing the operation safely with respect to rseq critical sections, and other cpu_opv operations.
In that case, the per-cpu data is the 'reserve' and 'commit' counters that must be updated when the tracer saves an event to the per-CPU buffers.
Other uses that I'm aware of include memory allocators that maintain per-CPU arenas.
> I suspect that something like a heap implementation could use this.
Indeed. Let's say you want to have lots and lots and lots of threads, as described in the video schmichael linked.  Per-thread malloc pools become less attractive:
* too empty (lots of contention for the global pool) or
* too full (lots of wasted RAM, probably poor CPU cache utilization as well) or
* lots of sloshing
More generally, people sometimes do per-thread stuff to avoid lock contention. Some types of state might be reasonable to keep per-thread when the program is written in a thread-per-core / async style but might not be it's written in a thread-per-request / sync style. It might use too much RAM. If you ever have to access _all_ the threads' state (say, if you are doing some counters for a monitoring system: increment just the current thread's state on write; sum them on read), that path might get ridiculous. So per-CPU might work better.
Per-CPU stuff doesn't require restartable sequences. You can just use the CPU number to decide which shard to access then lock it or use atomics as you would with global state. You get less lock contention and cache-line bouncing. (Alternatively, you might get some of these benefits by picking a shard randomly, if the rng is cheap enough. Or a counter.)
Restartable sequences let you entirely avoid atomic operations for per-cpu stuff.
There are examples of per-cpu counters, per-cpu spinlocks, per-cpu linked-lists, and various forms of per-cpu buffers.
From what I can tell you do three things;
A) you can define a critical section which must run atomically in respect to the CPU core moves
B) if the critical section is interrupted you can define a callback or restart the section (meaning the section can be safely repeated)
C) you can define a single operation to commit (ie, updating a pointer)
You can certainly build a "retry on failure" method using this, CPU moves are rare so it's unlikely to fail the second time.
<stdin>:1335:2: warning: #warning syscall rseq not implemented [-Wcpp]
I guess it's not really in, then.
The other not-yet-there part is how userspace libcs are going to expose this usefully to applications; there's some interesting discussion on the kernel mailing list about that. (It needs some libc support because the kernel only supports a single registered restartable-sequence area, so if multiple libraries try to do it they'll tread on each others' feet; and libc will want to use this feature itself.)
Search for "Man page associated":
Also man page source:
Does it mean that I have to write it in assembly to make sure that I know where the sequence starts and where it ends?
Seems like an odd requirement, isn't the whole point that we want to restart the critical section, or at least part of it, over in the case that it's interrupted?
A problematic case that was brought up on LKML was the interaction with debuggers and unexpected page faults, see this comment:
Without restartable sequences, replacing per-thread data with per-CPU data had a trade-off: per-thread data can be accessed with less overhead, per-CPU data consumes less memory. Restartable sequences give you the best of both worlds.
To me as a non-kernel guy this doesn't sound very impressive..
As far as I know, the thread-local write only occurs if a rseq region is set. The added cost to a context switch seems essentially limited to a branch: https://github.com/torvalds/linux/commit/d7822b1e24f2df5df98...
The other benchmarks that were posted in the various threads leading to this merge looked fairly impressive to me.
Of the top of my head, jemalloc benefited a lot from this patchset, and not just on the run-time front: https://lkml.org/lkml/2016/10/10/332
The parent comment is not so much a put-down as much as it calls out a particularly unsophisticated dismissal. The parent-most comment would have elicited the same useful information if it were phrased as a question, but would have been less uncouth. I think that's a perfectly fine thing to write a comment about.