I love coroutines and have written a few implementations myself. A few comments:
Don't have separate get and set_context. They are hard to understand and use (see the volatile x) and are suboptimal, a swap_context is better.
With this change, you do not need to store the callee save registers in the context, you can save them on the top of the stack directly, including ip. The pointer to context is just a pointer to the top of stack of the suspended context (there is no actual Context object).
The perfect primitives are:
// switch to 'to'. Returns another context on return, possibly,
// but not necessarily 'from', also possibly null.
// This is an optimization as the next funciton is enough
ctx* swap_context(ctx * to);
// same as swap_context, but execute fn in the 'to' context, before
// returning from swap_context. 'fn' takes a pointer to the newly suspended
// calling context and closure is passed through.
// The function should return a context (possible 'from') or nullptr.
ctx* swap_context_ex(ctx* to, ctx* (*fn)(ctx*, void*), void* closure);
// create an empty context with an allocated stack. swapping to it will deallocate
// the stack and terminate the context. use swap_context_ex to actually assign a
// function to run on the the context
ctx* make_empty_context();
edit: the bits in the article about improving the gdb and *san experience are great and it something I wasn't aware!
Oh, I forgot, you also probably want to be able to pass a a pointer sized argument to the next coroutine (and get one back); In the previous prototypes, convert all uses of 'ctx * ' to plain 'ctx', where ctx is:
struct ctx { void * sp;, void * payload; };
If you get things just right, at least on linux, you can get ctx to always be passed and returned via registers.
If you do not have an independent context structure how do you avoid stack overflow in the event of a coroutine cycle with an unknown cycle length (e.g A -> B -> C -> ... -> A in an infinite loop)?
For an example of where this comes up in schedulers, consider a simple periodic schedule that runs A then B then C then ... then starts back over.
I do not understand, why would there be a stack overflow? The "context" is saved (pushed) at the top of the suspended coroutine stack and restored (popped) when the coroutine is resumed (each coroutine has of course its own stack), so with a simple a->b->c->...->a cycle there is no issue.
In fact given swap_context_ex, yielding to the next ready task in the scheduler (and queuing the current task at the end of the queue) is trivial [1]. To simplify the code and the boundary conditions, I maintain the invariant that the ready queue always has at least one coroutine (the 'idle' coroutine). Also note how the queue node is stack allocated directly on the suspended coroutine stack.
edit: of course if you are not careful with swap_context_ex and keep pushing new functions to execute on a context without them ever returning to the (recursively) previous context, you can run out of space, but I haven't found a case where this would happen in practice (not that I looked hard enough though).
Okay. I see. I did not realize you had separate stacks which would allow you to save on the caller stack before a switch to the callee stack. I was thinking you were nesting on a single stack like “stackless coroutines”.
So if I am understanding correctly, essentially in your design each coroutine uses the end of its stack to push its context which can be trivially popped when you switch back to it and its stack.
yes, you understood correctly. The intuition here is that the rest of the callers saved registers are already saved by the compiler itself in the stack, you just need to complete the work. In fact, via GCC inline ASM, you can leave to the compiler the job of saving all registers and the swap routine doesn't have to touch memory at all [1]
You can actually do it on a single stack though: when you detect a stack overflow, you traverse the stack and garbage collect no longer referenced stack frames (in practice you copy referenced stack frames to a new stack). The "small" catch is that your functions can no longer return and you need to commit to pure CPS, so this technique [2] is only used for some compilers-to-c for other languages with first class continuations.
I suppose the „fn“ executed in „to“ context before return is executed to do some clean-up or de-allocations for caller? I.e. to make a proper „return“ possible? Could you explain this maybe in more detail? And which closure is passed?
It is a more principled replacement for the get_context, set_context pair. And more powerful vecause it is run inside the to coroutine, so the coroutine that called swap is already suspended.
Things you can do from the closure:
- delete the 'from' coroutine
- return into the 'to' coroutine a different coroutine than 'from' (or even bo coroutine at all). This is useful to implement blocking primitives amd schedulers.
- throw an exception into the to coroutine (for example for exception forwarding or even just to force the target coroutine to exit)
And much more of course.
It is a very powerful primitive analogous to call-with-current-continuation.
Of interest: Google has recently resumed trying to upstream its SwitchTo API into Linux. This API is 10-20x faster way to schedule threads (real threads) from userspace.
Oh, yes. I've been looking forward to this since I saw https://www.youtube.com/watch?v=KXuZi9aeGTw (User-level threads....... with threads. - Paul Turner - Google) in 2013.
The way it is implemented in on top of the futex machinery is very elegant.
Unfortunately it still need to go through the kernel, so it is probably going to be orders of magnitude slower than pure userspace based task switching, especially after spectre.
And yet it is an order of magnitude faster than using the kernel thread scheduler, and you get to keep working thread-locals, normal process ID behavior, etc. It's a compromise.
The best solution is to standardize tooling around backtraces and switch in userspace. Kernel switchto() has its place, but only between processes. It’s much, much slower than a userspace function call.
Though I'm not impressed with the thread-local storage errors argument, as it's just an argument against low quality implementations.
In a good implemention of fibers or green threads, TLS APIs should be replaced by fiber-local storage using the same APIs and fast access method, and FLS doesn't have those errors.
Perhaps there's another argument against FLS, but the TLS errors argument should have considered FLS rather than leaving it implied that there's no solution.
I think the actual issue with TLS is that compilers will happily hoist TLS address computations across function calls. It they didn't do that, TLS would be much safer (at least as safe as with stackless coroutines). TLS Address computation is pretty much free in non-PIE code on x86-64 linux so normally GCC doesn't do it, but on shared libraries or on other architectures, you can randomly get corruptions.
I do not think that FLS (at least by default) is the right solution as it would consume a lot of memory, increase the switch overhead and add complexity for little benefit.
Wait, are you saying that compilers can cache the pointer to TLS across function calls, even when inlining is disabled? Does this mean that the OP article is wrong when it says to use a separate getter/setter function for each TLS variable (even if we manually disable inlining for those accessor functions)?
Also, regarding shared libraries, how can compiler cache the TLS address, since the compiler versions that compiled the shared library may be different that the version which is compiling the executable that links to it, and thus may have different caching behaviours?
thread_local x;
void foo() {
x = 10; // x in thread 1
foo(); // moves this coroutine to thread 2
x = 20; // might erroneously write to x of thread 1
The issue with dynamic libraries is that TLS computation is slightly more expensive (as the TLS offset is not statically known) so the compiler is more likely to cache the TLS address computation. So the above function might be more likely to "miscompiled" if is part of a .so translation unit.
edit: also yes, forcing inlining off is not guaranteed to disable cross procedure optimizations. At the very least you have to disable function cloning. In fact I see that GCC now has attribute noipa which is probably a better solution.
Yes, but what you describe only happens if you try to access the TLS variables directly. If I put all the TLS variable in a .so file, and only access them through getter/setter functions (also part of the .so), can compiler optimisations still screw things up? It seems to me that unless there is some runtime optimisation going on, this method should be safe.
> I think the actual issue with TLS is that compilers will happily hoist TLS address computations across function calls.
TLS has its own issues, but in every situation where this optimisation is fine with TLS, it's fine with FLS too.¹
If you call a function in a fibers / stackful coroutines environment, and the function has a context switch inside (say it does some blocking I/O), by the time the function returns the original context is valid again.
> I do not think that FLS (at least by default) is the right solution as it would consume a lot of memory,
Depends on what you store in FLS. As with TLS, there are situations when it's beneficial for memory, compared with storing replicated copies of data all over the place in arguments and data structures, and situations when it isn't beneficial.
In some it will consume less memory than stackless coroutines, which store copies of the same data in async structures anyway.
> increase the switch overhead
If we're still discussing the article's case that stackless coroutines are better, FLS switch overhead is usually lower than stackless/async overhead.
In the stackless/async version you have compiler-generated loads and stores to context-specific structs that need to be executed every time something context switches for an await.
FLS overhead is updating a single pointer. Even that may be free, because you're going to update some kind of context structure pointer anyway. Even in the base case, stackless will update an equivalent pointer at absolute minimum, but often more.
This is really the same argument for TLS being sometimes useful in threaded programs specifically for performance, as opposed to explicitly passing around copies of pointers and storing them in data structures all over the place. (Of course there are non-performance reasons why people choose TLS or avoid it too.)
> add complexity for little benefit
The benefit is faster programs, and a simpler programming model. You may reasonably disagree :-)
But consider, the Linux kernel is basically all fibers with FLS, and that's a top performer :-)
With regard to programming model: In Linux, it was found to be very much simpler to use fibers (called "tasks" in Linux) than to build an async I/O state machine for filesystems. This is why Linux didn't get useful async I/O in filesystems for a long time, because the state machine version of filesystem operations, though people attempted, was just far too complex in practice to implement for all corner cases, so its async properties were unreliable.
--
¹ At least no problems that are specific to FLS. If you're pointing out there are issues with TLS (without fibers) due to shared library loading/unloading or lazy memory instantiation, that's an issue but it does not support the article's implication that TLS is ok while FLS is not.
So the compilation issue is only with TLS, an hypothetical FLS would not have it.
The alternative to FLS is not storing duplicated data but use TLS and making sure that the compiler does not miscompile your code, either because you are extremely careful or because the compiler provides an option for it. Unfortunately compiler writers do not really see a future in stackful coroutines and are not really willing to provide a 'couroutine-safe' TLS mode.
Regarding the switch overhead, the fastest switch routine I have implemented does not store or load anything from any context-specific struct, only swaps a couple of registers, so an extra load might have a cost (trivial for async io, but can be noticeable for generators). Also, the way TLS is implemented on x86-64 for example is via a segment register which would need to be changed at every coroutine switch; updating it is significantly more expensive compared to a normal register [1].
Finally, as far as I know, the kernel does not use thread local/fiber local storage, instead heavily relies on per-cpu data. I advocate the same solution for coroutines, where, if you use a 1:1 mapping from OS threads to cores, the per-cpu data would correspond to the TLS of the underlying OS thread.
[1] Agner reports a reverse throughput of ~13 cycles on Neahalem. I think he stopped testing on newer architectures as it is a rarely used instruction; it needs to be expensive because it is not just a register load but it also has to update shadow limit registers. It probably has second order effects on the TLB as well.
> Finally, as far as I know, the kernel does not use thread local/fiber local storage, instead heavily relies on per-cpu data.
It does use FLS. Everywhere you see "current->" or "current_thread_info()" in the Linux kernel, it's FLS, not per-cpu.
There's a lot of those, some of them wrapped in other macros and inlines. "current" refers to the current task, which is the kernel's name for what the article calls fibers.
Per-cpu data is used in the kernel as well, and I agree with you about using the analogous thing for coroutines which is TLS for certain things. Indeed, a FLS implementation almost certainly uses TLS for its root-of-data pointer and keeping tracking of the current thread's coroutine queue.
> Agner reports a reverse throughput of ~13 cycles on Neahalem. [...] it is not just a register load but it also has to update shadow limit registers
You're thinking of a segment register load, but these days we would use the WRFSBASE or WRGSBASE instructions, which don't update the segment descriptor (shadow limit etc) and don't use the GDT/LDT.
Windows has supported this for userspace scheduling for a long time (but Windows has good fiber support with native FLS anyway), Java coroutines want to use it for performance on Linux, and Linux kernel looks like it will enable this soon; the CPUs have supported it for years.
When those instructions aren't available, since the "current coroutine" is almost certainly a TLS variable itself that must be updated on context switch, use that as the base for FLS too. Then FLS access is one extra level of pointer indirection on access, but still hoisted and cached. For portable coroutines-and-threads code, use a portable TLS variable or pthread_getspecific() for this.
This written with the focus on the game development audience, but I'm really excited for green threads to be coming full circle on Java with Loom. (Green threads were in Java 1.1 then were quickly removed in favor of Native Threads) As a lot of the work described in the article are handled once in the JVM instead of each application dealing with it separately. Debugging tools that already work with it (except UI displays not used to displaying 1000's of threads at once). Still have to deal with the normal concurrency issues you always have, but how you approach them in the language isn't really any different then you have in the past.
It's interesting to see a game use case via linked pdf. They increased concurrency by overlapping the processing for two frames (alternating gameplay and rendering code) because there were too many idle cycles when doing one frame at a time (which took 25ms instead of ~16ms for 60fps). This is after they had made all the jobs run as concurrently (and parallel on SPUs) as they could and reducing locking as much as they could per frame.
// Make 128 byte scratch space for the Red Zone. This arithmetic will not unalign
// our stack pointer because 128 is a multiple of 16. The Red Zone must also be
// 16-byte aligned.
sp -= 128;
Isn't the red zone normally implicitly below the sp? I'm not sure why you would need to explicitly subtract the size here.
EDIT - the following answer is based on an incorrect understanding of the red zone. See the reply to this for a correction. I shall leave it here as a monument to my shame.
Yes, so since the stack grows downwards, we subtract the size of the red zone from the top of the stack (represented by the stack pointer), moving the new stack pointer down.
This "grows" the stack (downwards remember) by the size of the red zone. Future allocations on the stack bump the pointer further downwards to grow the stack further. This leaves the 128 bytes of the red zone always below the stack.
EDIT: to clarify the red zone is below the stack pointer conceptually (below the stack) but higher in memory actually (since the stack grows downwards).
No. The red-zone is in the unused area beyond the current stack pointer that the stack will grow into. It moves with the current stack pointer. The author of the article is incorrect.
Oh damn, I totally misremembered, thank you for correcting me! Kinda dumb of me, because even thinking for a second about what the red zone is for would make it obvious why I'm wrong.
Pinning to hardware threads is most certainly not "always" better with fibers. I can say with certainty after very thorough performance testing across dozens of architectures and dozens of workloads. The difference can be dramatic and it's difficult to determine in advance which will.be faster (heuristics are necessary).
> Pinning to hardware threads is most certainly not "always" better with fibers.
The article is recommending pinning the worker (N) threads to cores, not pinning fibres (M) to cores.
If it makes sense to schedule a fibre on a different core to the last one it ran on (perhaps because the data it needs is already on the other core) then you do that yourself when you pick which N to run it on, but if the N threads weren't pinned how would you keep track of which is the right one to run on to get your fibre onto a particular core? What would you do if you selected the right N for a core... but then it got moved when you went to actually run it? That's why we pin.
I was talking about pinning workers, not fibers. Empirically, it is often better for performance to allow the OS to manage the hardware threads. Sometimes its worth pinning to a NUMA node, sometimes to a core, rarely to a cpu. Just reporting our results for many thousands of hours across a wide spectrum of test environments.
I can understand it being counterintuitive, believe me it has been a shock. But it's true.
If your scheduler knows about data dependencies between tasks, such as in a dataflow model, or maybe an actor model, then you can try to run tasks that depend on data produced by a worker on the same core that the producer ran on, to catch data already in that core's cache.
Technically, "concurrency", nowadays, refers to the apparatus and overhead involved in coordinating parallel activity. The more concurrency you are involved in, the less parallelism you get.
It is an odd choice, but it appears to have stabilized.
I use the word parallelism to talk about multiple processors simultaneously cooperating on solving a single task for the purpose of making the time to complete it shorter (i.e. its latency is reduced).
I use concurrency to describe the problem of simultaneously managing multiple mostly-independent tasks that are competing for hardware resources and performed on one or more processors. The performance we care about is throughput, not latency, and increasing concurrency helps throughput due to Little's law (https://inside.java/2020/08/07/loom-performance/).
> "concurrency", nowadays, refers to the apparatus and overhead involved in coordinating parallel activity
Are you saying "highly concurrent" would mean "lots of unnecessary overhead"? I strongly dispute that is the widely accepting meaning.
My understanding of concurrency vs parallelism (this is also the definition implicit in the article):
* Parallel: n processes running on n physical processors
* Concurrent: n processes running on m≤n (possibly m=1) physical processors due to preemptive or cooperative multitasking (the latter is common in async frameworks). (Edit: Changed m<n to m≤n. To be clear: parallelism is a type of concurrency.)
A system with a lot of concurrency might not have any parallelism but still have less overhead than a parallel system e.g. in the case the processes are IO bound rather than CPU bound and need to communicate between each other a lot.
Concurrency - computer science:
In computer science, concurrency is the ability of different parts or units of a program, algorithm, or problem to be executed out-of-order or in partial order, without affecting the final outcome. This allows for parallel execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor and multi-core systems. - Wikipedia
If you start delving into how, you quickly lose the overall meaning of the concept over to specific implementations.
That confirms my point that "concurrency" doesn't mean anything close to what the grandparent comment says, which claimed the concurrency is to do with overhead of parallelisation.
What's more, it all mostly agrees with what I was saying. Look at this bit in particular:
> This allows for parallel execution of the concurrent units
Notice how it uses "parallel" to mean that the concurrent units (to use the quoted terminology) are actually executing at the same time, exactly as my previous comment says. OK, it avoided talking about "physical processors" or "cores", but that's what this boils down to on practical digital computers. Maybe the wording in my last comment wasn't 100% perfect, and there are some slight differences on details (e.g. the quoted Wikipedia text only seems concerned with processing-bound units of execution, not IO-bound waiting) but it's clearly talking about the same two definitions.
Concurrency is a design/architecture concern. It's a property of the code: are parts of it independent of time with respect to each other so that they can execute out of order (including interleaved in time slices.)
Parallelism is a run-time concern. Are parts of the code executing literally simultaneously?
You're right. I do remember now having discussed that more than 30 years ago with someone who actually owned an Amiga then. I also remember my puzzlement when I learned that Oberon chose to support only cooperative scheduling which seemed like a step back then already.
Yes, since then we've moved away from cooperative multitasking, especially in cases where we needed low latency, such as in UIs. It is therefore a bit strange that the pattern has resurfaced lately, and has been heavily used in the implementation of UIs and webservers ...
You could argue that in small systems (compared to OSes), you can keep latency under control, but systems tend to grow, and modularity of the systems works against this (modules don't need to know what other modules do or how long they take to finish their tasks).
Cooperative multithreading has lower overhead and latency than preemptive multithreading. I don't know what you are talking about.
The problem with cooperative multithreading is that you have to yield manually which is a disaster for scheduling random applications but not a problem when used inside a single application where it is isolated from everything else.
Cooperative multithreading might have lower overhead, as it saves the trips from and to kernelspace, if handled by the application and by that can be assumed to be more efficient, which might be significant when there is a high rate of thread scheduling events. I don't see how it improves on latency, i.e. the delay between an external stimuli and the expected reaction to it. The maximum latency for cooperative multi-threading is in the general case unbounded and hence unsuitable to low-latency applications, much less for real time application.
I'm not sure if this is what the parent was referring to, but one way that it might have lower latency is if you have multiple strands / threads dealing with different bits of the system, with queues to pass messages between them. In a truely threaded application, handing data from one to another using a queue requires the new thread to be scheduled by the OS (unless the target thread is so busy that it pops the item off the of the queue immediately after handling the previous item). In a single-threaded async design, if the workload is light then, after the first strand adds an item to the queue and yields control, the second strand might pop the item off and deal with it immediately in the same time slice.
I absolutely get that (and get that not everyone does). Meanings evolve over time. Or are understood to mean different things to different people (an example I like is the almost opposing meanings of "implies" between mathematicians and mainstream English speakers). You can even use words to mean something different from the usually-accepted meaning, so long as you're clear about your own definitions up front.
But you specifically claimed that your definition of "concurrency" is the main accepted one:
> "concurrency", nowadays, refers to ...
> ... it appears to have stabilized
This is the part I disagree with. My understanding of the word "concurrency", the Wikipedia definition that someone quoted, and the meaning in the article all align, and are all different from the meaning you gave (if I haven't misunderstood it, which admittedly is very possible). So it seems very unlikely to me that we all hold the minority understanding while yours is the mainstream.
Terminology's job is to help us think, and I am distinctly underwhelmed by the track record of the "concurrency vs. parallelism" terminology. It sets up an absolution distinction for something that is almost always on a gradient, and it seems to abstract away some of the most important details to concentrate on less important details. In particular, the problem itself often tends to pick whether you're "concurrent" or "parallel", which means it's not a useful metric to analyze possible solutions with since it doesn't change based on your potential solution because it doesn't provide a gradient to descend.
I find it much more productive to consider "what resources do I have" vs "what resources is my code consuming". I think this focuses on the real questions, which is not "is this more about parallelism or more about concurrency?" but, do I have enough CPU cores for what I need? Am I using GPU resources? Can I use one or the other more effectively for this problem? Is this algorithm disk or RAM limited? etc. etc. In addition to simply subsuming all the considerations about "parallel vs. concurrent", this mindset raises questions like that last one that isn't even related to that question, but more directly gets at the real issues with making code run well on the real machines you have. This does provide a gradient to descend; "oh, I'm trashing memory, I can fix it with this... I can move this CPU task to the GPU... I can trade this CPU expense to reduce memory load...", etc.
When I aim for parallelism, I seem to fare a lot worse than when I aim for concurrency. (In the parallel case, I hit Amdahl's law hard and my throughput for 1, 2, 4 cores goes something like 1, 1.8, 2.4.)
> and it’s not what you’re interested in doing if you actually want to take advantage of hardware threads.
This is not always correct. Coroutines don't necessarily have to be the only mechanism for splitting work within an application. A master process can distribute work among several worker processes that then use coroutines internally.
This still takes advantage of hardware multithreading capabilities with a low number of separate processes and then further splits the work up on several coroutines in each thread.
This article is a bit mixed, audience wise. If you need the first section telling you what userpace M:N scheduling is, you really shouldn't be using the second part to implement a runtime for it. You can run into any number of subtle issues that will manifest with weird behaviour and be very hard to debug.
If you have to use C/C++ use a library that has done it for you. If you don't then you can use Rust, Go or Erlang/Elixir all of which have excellent implementations.
Meh I don't like the absolutism in this comment, personally. I'm well versed in coroutines and "fibers" and all of the theory surrounding it but haven't come across the formalized terminology (m:n, n:1, etc.).
I'll give you another example: ropes are string classes with more complex data storage, useful for optimizing mutations and cutting down copies. The concept is easy to understand, and personally I've implemented them a number of times, but didn't know about the formalized term until the last year or so.
The last sentence is a bit unfair and presumptuous, I think.
Perhaps it was a little absolutist. Maybe it would be better written like this:
If you're not familiar with implementing schedulers and concurrent systems then beware that implementing fibers from scratch is likely to result in some subtle and hard to debug problems.
If you have to use C/C++ you may well get better results using a library. If you aren't tied to the language then Erlang/Elixir, Go and Rust all have excellent implementations that have more language support.
if you are not familiar with implementing schedulers and concurrent systems but you are interested, do it! It is fun and and great learning experience. Doesn't mean you have to deploy it to prod, it can be a personal project.
This is an excellent point. Experimentation is always a good idea. I have just got cynical and conservative about what I choose for the basis of production systems.
Actually for full disclosure I should say that I am currently building a telecomms system in Elixir. I find the decades of battle testing in the BEAM reassuring.
I dont't agree. I'm not used to implement schedulers and concurrent systems and I wrote my own fiber-system in C a couple of months back (https://gitlab.com/dosshell/tinyfiber) after listening to the famous talk from Christian (mentioned in article).
Its a bit messy, sure. But I don't think I have any subtle bugs and was not hard to debug. Performance is good too.
What I want to say is: it is easy to do, even I can :P
Don't have separate get and set_context. They are hard to understand and use (see the volatile x) and are suboptimal, a swap_context is better.
With this change, you do not need to store the callee save registers in the context, you can save them on the top of the stack directly, including ip. The pointer to context is just a pointer to the top of stack of the suspended context (there is no actual Context object).
The perfect primitives are:
edit: the bits in the article about improving the gdb and *san experience are great and it something I wasn't aware!