Hacker News new | past | comments | ask | show | jobs | submit login
Exploiting Coroutines to Attack the “Killer Nanoseconds” [pdf] (vldb.org)
57 points by mpweiher 5 months ago | hide | past | web | favorite | 15 comments

This title was confusing, the terms "exploiting" and "attack" made me expect another Hyperthreading-related security exploit (though the paper is actually a performance optimization)

Gor Nishanov, one of the authors, talks about this in his CppCon2018 talk https://youtu.be/j9tlJAqMV7U (jump to ~12:33 to skip background).

It amounts to using CPU prefetch instructions with C++ coroutines to simulate hyperthreading in software by scheduling instructions around cache misses (but is potentially better than hyperthreads because it's not limited to 2/core)

I haven't read the article or video. But regarding prefetching triggerd by software (programmer) instead of by the hw, this is a very informative read, with a lot of numbers and tests/profiles:



Your links show with benchmarks that software prefetching is not always useful, the example being a for loop to traverse a linked list in the Linux kernel.

However, the article at hand shows with benchmarks that software prefetching can be very beneficial in common algorithms such as hash probe, binary search, Masstree and Bw-tree, even when concurrency is implemented in a straight-forward way using (stackless) coroutines.

The links are not only about a loop. The general conclusion is that making better informed decisions about whether to prefecth or not is very hard and that the CPU will have most of the time way more information to make a good informed decision. It also says that unless you proove by benchmarks that it makes sense, it is probably wrong.

Now. This is of course a general case. If you control the whole algo and data structures during the execution, a well crafted prefectch /can/ be beneficial. Again, the general idea of the links I posted is that /generally/ the CPU has more info of the /overall/ system state to make a correct prefetching choice. I think that info/links are usefull/interesting even if they do not apply to the specific case in TFA.

Clarifying a bit more: I didn't post that to contradict the article but just to provide a bit of related info.

> but is potentially better than hyperthreads because it's not limited

But also potentially worse because hyperthreads schedule only on actual memory waits, whereas this approach puts a suspension point after each prefetch whether or not the target is actually in cache.

Does anyone know about earlier work similar to "asynchronous memory access chaining"? Sounds like there must have been stuff done in eg. VLIW static scheduling. And in compilers for highly memory-parallel architectures like Tera MTA. And in GPU compilers.

I have a recollection that some static scheduling compilers could run

This is a very cool idea, and I wonder if eventually we'll see new tooling spring up to support this.

Running coroutines like this seems like it would potentially create more cache pressure. Is the idea that you'll execute instructions you already have loaded, so you won't incur any cache misses for instructions?

It is more about memory pressure than instruction pressure - the idea is that when an algorithm finds itself having to fetch something from memory, it can issue a prefetch, then take a step back and handle other requests while waiting for the result.

How this would fare in a real world scenario that isn't a benchmark but in a system that is also doing other stuff at the same time is anyones guess though.

Their benchmark is close to what a relational database does in the real world. They also cite previous work "Interleaving with Coroutines: A Practical Approach for Robust Index Joins", which implements this succesfully in the SAP HANA column store.

I don't think it matters what else the system is doing as long as a complete core is available for running this.

Cheap-enough coroutines seem to be new in C++ but what about other languages such as Go, Rust, Haskell?

Go has channels which are much more like fibers. Full threads of their own with actual stacks (which can be moved in Go).

If C++ or Rust ever get coroutines they will be more low-level. Just a control flow construct which compiles coroutines down to state machines. So it's concurrency, not parallelism. High performance for one thread, but less powerful conceptually.

Rust used to have segmented stacks, but they removed it because performance was unpredictable.

Rust seems to have stackless coroutines as an experimental language feature: https://doc.rust-lang.org/nightly/unstable-book/language-fea...

The Intel IXP Microengine (Now Netronome?) architecture supported this explicitly. You could schedule a memory read explicitly, do other things, then come back. Each core was also 8-way multi-threaded, and you could yield explicitly to the next thread.

This could also be described as hyperthreading in software

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact