It amounts to using CPU prefetch instructions with C++ coroutines to simulate hyperthreading in software by scheduling instructions around cache misses (but is potentially better than hyperthreads because it's not limited to 2/core)
However, the article at hand shows with benchmarks that software prefetching can be very beneficial in common algorithms such as hash probe, binary search, Masstree and Bw-tree, even when concurrency is implemented in a straight-forward way using (stackless) coroutines.
Now. This is of course a general case. If you control the whole algo and data structures during the execution, a well crafted prefectch /can/ be beneficial. Again, the general idea of the links I posted is that /generally/ the CPU has more info of the /overall/ system state to make a correct prefetching choice. I think that info/links are usefull/interesting even if they do not apply to the specific case in TFA.
Clarifying a bit more: I didn't post that to contradict the article but just to provide a bit of related info.
But also potentially worse because hyperthreads schedule only on actual memory waits, whereas this approach puts a suspension point after each prefetch whether or not the target is actually in cache.
I have a recollection that some static scheduling compilers could run
Running coroutines like this seems like it would potentially create more cache pressure. Is the idea that you'll execute instructions you already have loaded, so you won't incur any cache misses for instructions?
How this would fare in a real world scenario that isn't a benchmark but in a system that is also doing other stuff at the same time is anyones guess though.
I don't think it matters what else the system is doing as long as a complete core is available for running this.
If C++ or Rust ever get coroutines they will be more low-level. Just a control flow construct which compiles coroutines down to state machines. So it's concurrency, not parallelism. High performance for one thread, but less powerful conceptually.
Rust used to have segmented stacks, but they removed it because performance was unpredictable.