
Exploiting Coroutines to Attack the “Killer Nanoseconds” [pdf] - mpweiher
http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf
======
jimbo1qaz
This title was confusing, the terms "exploiting" and "attack" made me expect
another Hyperthreading-related security exploit (though the paper is actually
a performance optimization)

------
akubera
Gor Nishanov, one of the authors, talks about this in his CppCon2018 talk
[https://youtu.be/j9tlJAqMV7U](https://youtu.be/j9tlJAqMV7U) (jump to ~12:33
to skip background).

It amounts to using CPU prefetch instructions with C++ coroutines to simulate
hyperthreading in software by scheduling instructions around cache misses (but
is potentially better than hyperthreads because it's not limited to 2/core)

~~~
int0x80
I haven't read the article or video. But regarding prefetching triggerd by
software (programmer) instead of by the hw, this is a very informative read,
with a lot of numbers and tests/profiles:

    
    
      https://lwn.net/Articles/444336/
    
      https://lwn.net/Articles/444346/

~~~
tuukkah
Your links show with benchmarks that software prefetching is not always
useful, the example being a for loop to traverse a linked list in the Linux
kernel.

However, the article at hand shows with benchmarks that software prefetching
can be very beneficial in common algorithms such as hash probe, binary search,
Masstree and Bw-tree, even when concurrency is implemented in a straight-
forward way using (stackless) coroutines.

~~~
int0x80
The links are not only about a loop. The general conclusion is that making
better informed decisions about whether to prefecth or not is very hard and
that the CPU will have most of the time way more information to make a good
informed decision. It also says that unless you proove by benchmarks that it
makes sense, it is probably wrong.

Now. This is of course a general case. If you control the whole algo and data
structures during the execution, a well crafted prefectch /can/ be beneficial.
Again, the general idea of the links I posted is that /generally/ the CPU has
more info of the /overall/ system state to make a correct prefetching choice.
I think that info/links are usefull/interesting even if they do not apply to
the specific case in TFA.

Clarifying a bit more: I didn't post that to contradict the article but just
to provide a bit of related info.

------
fulafel
Does anyone know about earlier work similar to "asynchronous memory access
chaining"? Sounds like there must have been stuff done in eg. VLIW static
scheduling. And in compilers for highly memory-parallel architectures like
Tera MTA. And in GPU compilers.

I have a recollection that some static scheduling compilers could run

------
brian-armstrong
This is a very cool idea, and I wonder if eventually we'll see new tooling
spring up to support this.

Running coroutines like this seems like it would potentially create more cache
pressure. Is the idea that you'll execute instructions you already have
loaded, so you won't incur any cache misses for instructions?

~~~
yxhuvud
It is more about memory pressure than instruction pressure - the idea is that
when an algorithm finds itself having to fetch something from memory, it can
issue a prefetch, then take a step back and handle other requests while
waiting for the result.

How this would fare in a real world scenario that isn't a benchmark but in a
system that is also doing other stuff at the same time is anyones guess
though.

~~~
tuukkah
Their benchmark is close to what a relational database does in the real world.
They also cite previous work "Interleaving with Coroutines: A Practical
Approach for Robust Index Joins", which implements this succesfully in the SAP
HANA column store.

I don't think it matters what else the system is doing as long as a complete
core is available for running this.

------
tuukkah
Cheap-enough coroutines seem to be new in C++ but what about other languages
such as Go, Rust, Haskell?

~~~
AlexanderDhoore
Go has channels which are much more like fibers. Full threads of their own
with actual stacks (which can be moved in Go).

If C++ or Rust ever get coroutines they will be more low-level. Just a control
flow construct which compiles coroutines down to state machines. So it's
concurrency, not parallelism. High performance for one thread, but less
powerful conceptually.

Rust used to have segmented stacks, but they removed it because performance
was unpredictable.

~~~
tuukkah
Rust seems to have stackless coroutines as an experimental language feature:
[https://doc.rust-lang.org/nightly/unstable-book/language-
fea...](https://doc.rust-lang.org/nightly/unstable-book/language-
features/generators.html)

------
busterb
The Intel IXP Microengine (Now Netronome?) architecture supported this
explicitly. You could schedule a memory read explicitly, do other things, then
come back. Each core was also 8-way multi-threaded, and you could yield
explicitly to the next thread.

------
fanf2
This could also be described as hyperthreading in software

