
RAM-less Buffers - emily-c
https://emsea.github.io/2017/12/31/register-buffer/
======
Karliss
Splitting inline asm in to multiple blocks like shown in the article isn't a
reliable way of using inline asm. Unless registers are specified in clobber
list compiler isn't aware which registers are used by inline asm and may use
them for other purpose. That can easily break by compiler overwriting them
between inline asm blocks or other way around - inline asm overwriting values
that compiler stored there.

~~~
emily-c
This is true and is precisely why it doesn't work without modification on
higher optimization levels. I was originally going to write this article
entirely in assembly but it seems like nobody wants to read an assembly dump!
Instead even though it is malpractice to do so, the way the code is written in
the article is still very clear even to people who don't really know assembly
that well (I hope) and the concept is more clear. I will make that more clear
in the article.

~~~
lukego
Thanks for writing this. My mind is buzzing now with ideas for how I could
reserve and misuse a vector register :).

------
DiabloD3
This is really neat, but when a thread/process state enables MMX/SSE/AVX
registers, it increases the cost of context switching (on Linux and other
OSes) because they only restore registers they know could have been use (ie,
if it has not called MMX/SSE/AVX ops yet, it does not save/restore XMM/YMM/ZMM
registers).

There have been cases where autovectorization in compilers have produced good
SIMD versions of code (which is still considered a hard thing to do in some
cases), but was still slower during certain benchmarks that did high thread
counts (ergo, absurd amounts of CPU time wasted on context switching), but
beat non-vectorized in less loaded situations.

------
gruez
what's the point of using xmm registers when they'll get clobbered (or have to
be moved back to memory) whenever you need to do a floating point operation?
at least with ymm/zmm you can hope that you're not using AVX.

>The reason why we won't be considering other operating systems is because the
System V ABI doesn't preserve any of the XMM registers between calls and puts
the burden on the caller to save them on the stack. If you think about it,
this sort of defeats the purpose of using a register buffer if we're always
going to be pushing our bytes to memory in user space.

as opposed to windows? regardless of whether it's the caller/callee's job to
preserve registers, the result is the same.

~~~
phire
_> what's the point of using xmm registers when they'll get clobbered (or have
to be moved back to memory) whenever you need to do a floating point
operation? at least with ymm/zmm you can hope that you're not using AVX._

Nope, because SSE operations are explictly defined as setting the upper bits
of the ymm/zmm registers to zero.

 _> as opposed to windows? regardless of whether it's the caller/callee's job
to preserve registers, the result is the same._

There is a huge difference, with caller saved registers, the caller must
compulsively save all registers it's using to the stack before it calls a
function. With callee saved registers, if the callee doesn't use the
registers, then it doesn't need to save them and a bunch of extra push/pops
are saved.

It is optimal to have a mixture of caller and callee saved registers, so the
compiler pick what type it uses for each function/variable.

------
cybergoat
This is rather clever!

What immediately came to mind was perhaps you want to hide a secret key from
entering memory. If this was done in kernel mode, you would be able to disable
interrupts/task switching execute the "secret" stuff, and continue on your
merry way...

------
kstenerud
Seriously, why all the sourpuss comments? The guy shows a fun little way to
potentially use these registers in an innovative fashion and you jump all over
him!

You do understand that toys like this are meant to simulate the mind towards
different ways of thinking, right? Great things are born from kennels of
innovative thought.

HN has really gone downhill...

~~~
eecc
Guy is called Emily... my guess is it’s a gal

------
cjbprime
Thanks, interesting!

------
brownmenace
benchmarks plz

~~~
jsjohnst
I think you’re missing the point of this post. This isn’t a general purpose
technique they are advocating, but rather an interesting method that might
prove useful to someone.

~~~
kindfellow92
I think you’re missing the point. This is only useful if there is actually a
measureable benefit.

~~~
emily-c
I think it's useful in that it gives you a weird trick to think about :)

~~~
Dylan16807
It's definitely good to be aware of what tricks you have. Sometimes you need
to do weird things like treat memory as hostile, see this article about chrome
sandboxing:
[https://lwn.net/Articles/347547/](https://lwn.net/Articles/347547/)

> So that's what we do: each untrusted thread has a trusted helper thread
> running in the same process. This certainly presents a fairly hostile
> environment for the trusted code to run in. For one, it can only trust its
> CPU registers - all memory must be assumed to be hostile. Since C code will
> spill to the stack when needed and may pass arguments on the stack, all the
> code for the trusted thread has to [be] carefully written in assembly.

> The trusted thread can receive requests to make system calls from the
> untrusted thread over a socket pair, validate the system call number and
> perform them on its behalf. We can stop the untrusted thread from breaking
> out by only using CPU registers and by refusing to let the untrusted code
> manipulate the VM in unsafe ways with mmap, mprotect etc.

(I don't know if that technique is still used)

~~~
kindfellow92
This trick doesn’t avoid the storage of this data into RAM. A single context
switch is enough.

~~~
Dylan16807
Context switches put it into _kernel_ memory, not _process_ memory. That's
safe.

------
emerged
As an engineer who has spent thousands of hours programming SIMD assembly --
using them as buffers which can then be operated upon in batch is literally
the entire reason they exist.

So this reads to me as a programmer who isn't familiar with SIMD discovering
one small part of why it exists and then writing an article about that as if
it was a new idea outside the scope of normal SIMD usage.

It's nice that this may give exposure into some low level details to those
unfamiliar, but it isn't an innovation.

~~~
emily-c
Innovation? That's quite a stretch. Using them together as a contiguous buffer
is the not the reason they exist. Not quite sure how you took the article like
you did but I have added some extra clarification that this is /not/ how they
were designed to be used as well as another example.

~~~
emerged
Of course they are. You load the registers with data from RAM. You permute the
entire array and/or insert/extract individual elements. They are in
fundamental nature a contiguous array. Most often you are literally loading
them with a contiguous array from memory so you can work with it more
efficiently before writing it back to memory.

Downvote me as much as you want for saying so but it is the literal reason why
the registers and instructions exist.

~~~
joshAg
Aren't they designed specifically for SIMD instructions though? This seems to
be more just using them as another cache layer and not using them for their
intended-SIMD use.

~~~
emerged
I'm baffled by the downvotes that I'm receiving / why this is a controversial
opinion. When you program SIMD, the registers are literally a temporary cache
so that you can work with the data without touching RAM. The instructions are
designed to load/store/manipulate these cached buffers.

I didn't mean any offense to OP, if that's the cause for bad response. I've
been using these instructions for well over a decade and the description given
is without exaggeration the precise reason they exist and is what I've always
used them for.

~~~
joshAg
Ok, I'll try to explain it since you replied to me, even though I didn't
downvote you.

The purpose of XMM buffers is for SIMD instructions, which implies that the
buffers store multiple atoms and does the same instruction on each atom in the
buffer at once.

This interesting bit of this little hack is how it breaks that expected data
level parallelism and instead provides a (relatively) high-level interface for
the new usage. It has nothing to do with shuffling data to and from those
registers.

You're getting downvoted because your initial comment ignored the bit that
used the registers in a way that wasn't intended and focused on the bits that
were the same as the intended usage.

And then you're being condescending about the points you're trying to make.
You might not have intended to be condescending, but you are being
condescending nonetheless.

~~~
emerged
You know what at this point I have no incentive to further engage since my
attempt to elucidate was also downvoted. Each post that I submit is a further
cost. Such a silly game.

Let those who havent worked with SIMD reject an authentic attempt from a vet
to elucidate err and be happy with that click. Whatever.

If you can drain my remaining few hundred karma I'll have an excuse to stop
clicking another Bitcoin rehash article every day and will thank you.

~~~
joshAg
I suggest optimizing for understanding over hacker news karma points, but to
each their own.

You're not being downvoted for a technical issue.

~~~
emerged
I'm clearly not optimizing for karma since I literally have invited downvotes
at this point. I'm accepting the karmatic expense of making meta commentary.

