Hacker News new | comments | show | ask | jobs | submit login
RAM-less Buffers (emsea.github.io)
123 points by emily-c 10 months ago | hide | past | web | favorite | 38 comments

Splitting inline asm in to multiple blocks like shown in the article isn't a reliable way of using inline asm. Unless registers are specified in clobber list compiler isn't aware which registers are used by inline asm and may use them for other purpose. That can easily break by compiler overwriting them between inline asm blocks or other way around - inline asm overwriting values that compiler stored there.

This is true and is precisely why it doesn't work without modification on higher optimization levels. I was originally going to write this article entirely in assembly but it seems like nobody wants to read an assembly dump! Instead even though it is malpractice to do so, the way the code is written in the article is still very clear even to people who don't really know assembly that well (I hope) and the concept is more clear. I will make that more clear in the article.

Thanks for writing this. My mind is buzzing now with ideas for how I could reserve and misuse a vector register :).

This is really neat, but when a thread/process state enables MMX/SSE/AVX registers, it increases the cost of context switching (on Linux and other OSes) because they only restore registers they know could have been use (ie, if it has not called MMX/SSE/AVX ops yet, it does not save/restore XMM/YMM/ZMM registers).

There have been cases where autovectorization in compilers have produced good SIMD versions of code (which is still considered a hard thing to do in some cases), but was still slower during certain benchmarks that did high thread counts (ergo, absurd amounts of CPU time wasted on context switching), but beat non-vectorized in less loaded situations.

what's the point of using xmm registers when they'll get clobbered (or have to be moved back to memory) whenever you need to do a floating point operation? at least with ymm/zmm you can hope that you're not using AVX.

>The reason why we won't be considering other operating systems is because the System V ABI doesn't preserve any of the XMM registers between calls and puts the burden on the caller to save them on the stack. If you think about it, this sort of defeats the purpose of using a register buffer if we're always going to be pushing our bytes to memory in user space.

as opposed to windows? regardless of whether it's the caller/callee's job to preserve registers, the result is the same.

> what's the point of using xmm registers when they'll get clobbered (or have to be moved back to memory) whenever you need to do a floating point operation? at least with ymm/zmm you can hope that you're not using AVX.

Nope, because SSE operations are explictly defined as setting the upper bits of the ymm/zmm registers to zero.

> as opposed to windows? regardless of whether it's the caller/callee's job to preserve registers, the result is the same.

There is a huge difference, with caller saved registers, the caller must compulsively save all registers it's using to the stack before it calls a function. With callee saved registers, if the callee doesn't use the registers, then it doesn't need to save them and a bunch of extra push/pops are saved.

It is optimal to have a mixture of caller and callee saved registers, so the compiler pick what type it uses for each function/variable.

This is rather clever!

What immediately came to mind was perhaps you want to hide a secret key from entering memory. If this was done in kernel mode, you would be able to disable interrupts/task switching execute the "secret" stuff, and continue on your merry way...

Seriously, why all the sourpuss comments? The guy shows a fun little way to potentially use these registers in an innovative fashion and you jump all over him!

You do understand that toys like this are meant to simulate the mind towards different ways of thinking, right? Great things are born from kennels of innovative thought.

HN has really gone downhill...

> Seriously, why all the sourpuss comments? ... a fun little way to potentially use these registers in an innovative fashion and you jump all over ...

Not sure which "sourpuss comments" you mean. Perhaps a direct reply to one of those comments would be more helpful than this general one, which lumps together all fellow commenters.

I only saw comments from people to appreciate the effort, have lots of experience in similar areas, and share their knowledge about the pitfalls they see. All very helpful and polite, as far as I can see.

Guy is called Emily... my guess is it’s a gal

Thanks, interesting!

benchmarks plz

Please don't post unsubstantive comments to Hacker News.

I think you’re missing the point of this post. This isn’t a general purpose technique they are advocating, but rather an interesting method that might prove useful to someone.

I think you’re missing the point. This is only useful if there is actually a measureable benefit.

I think it's useful in that it gives you a weird trick to think about :)

It's definitely good to be aware of what tricks you have. Sometimes you need to do weird things like treat memory as hostile, see this article about chrome sandboxing: https://lwn.net/Articles/347547/

> So that's what we do: each untrusted thread has a trusted helper thread running in the same process. This certainly presents a fairly hostile environment for the trusted code to run in. For one, it can only trust its CPU registers - all memory must be assumed to be hostile. Since C code will spill to the stack when needed and may pass arguments on the stack, all the code for the trusted thread has to [be] carefully written in assembly.

> The trusted thread can receive requests to make system calls from the untrusted thread over a socket pair, validate the system call number and perform them on its behalf. We can stop the untrusted thread from breaking out by only using CPU registers and by refusing to let the untrusted code manipulate the VM in unsafe ways with mmap, mprotect etc.

(I don't know if that technique is still used)

This trick doesn’t avoid the storage of this data into RAM. A single context switch is enough.

Context switches put it into kernel memory, not process memory. That's safe.

It's not even a 'weird trick'; early-stage bootloader code and embedded systems often have to execute before RAM has been configured. This is a useful way to gain a little working space.

I'm not an assembly programmer, but is it possible there are benefits to treating SSE or AVX registers as a contiguous on-core buffer that are unrelated to performance?

I don't have enough experience with XMM registers (and SSE4 in general) to know if this is actually {useful,possible}, but a hypothetical use might be creating and using cryptographic keys such that the important numbers are never stored in main memory. If e.g. a decryption key is ever present in RAM, it's probably possible to steal it with a cold-boot attack that copies all of RAM. Once you have the RAM dump, the key can probably be found very quickly:

    for (secret_key_t *p = 0; p < RAM_SIZE; p++) {
This does require significant physical access, but it works. I seem to remember reading ~1.5 years ago about a turnkey forensics kit (bottle of refrigerant included) for doing cold boot attacks? Regardless, more ways to protect keys is could be really useful.


Unless process is temporarily stopped by OS which has saved all registers in RAM so that process can be continued afterwards.

Registers may be stored to RAM (though there are often local register files in the CPU). If you're on a modern x86 platform, you may have encrypted memory support.

For Intel, look up SGX. For AMD, look up SEV. Each of these is way more secure than reliance on registers as secure scratch memory.

SSE registers will never get stored to ram without emitting explicit instructions (or getting an interrupt) to do so on any Intel/amd cpu as far as I know. It can be tricky to deal with such situations, but if you have a stretch of code where you can stop that registers can be used to hide data from memory.

Yeah on context switch the SSE registers are stored in memory, so this doesn’t help for security.

That's why I said it's tricky to deal with interrupts, but possible if the effort is worth it to the use case. One could run the code in a kernel module which masked interrupts or use restartable sequences and cleared sse in the kernel when in certain code sections.

In the style of this article, Most probably not.

As an engineer who has spent thousands of hours programming SIMD assembly -- using them as buffers which can then be operated upon in batch is literally the entire reason they exist.

So this reads to me as a programmer who isn't familiar with SIMD discovering one small part of why it exists and then writing an article about that as if it was a new idea outside the scope of normal SIMD usage.

It's nice that this may give exposure into some low level details to those unfamiliar, but it isn't an innovation.

Innovation? That's quite a stretch. Using them together as a contiguous buffer is the not the reason they exist. Not quite sure how you took the article like you did but I have added some extra clarification that this is /not/ how they were designed to be used as well as another example.

Of course they are. You load the registers with data from RAM. You permute the entire array and/or insert/extract individual elements. They are in fundamental nature a contiguous array. Most often you are literally loading them with a contiguous array from memory so you can work with it more efficiently before writing it back to memory.

Downvote me as much as you want for saying so but it is the literal reason why the registers and instructions exist.

Aren't they designed specifically for SIMD instructions though? This seems to be more just using them as another cache layer and not using them for their intended-SIMD use.

I'm baffled by the downvotes that I'm receiving / why this is a controversial opinion. When you program SIMD, the registers are literally a temporary cache so that you can work with the data without touching RAM. The instructions are designed to load/store/manipulate these cached buffers.

I didn't mean any offense to OP, if that's the cause for bad response. I've been using these instructions for well over a decade and the description given is without exaggeration the precise reason they exist and is what I've always used them for.

Ok, I'll try to explain it since you replied to me, even though I didn't downvote you.

The purpose of XMM buffers is for SIMD instructions, which implies that the buffers store multiple atoms and does the same instruction on each atom in the buffer at once.

This interesting bit of this little hack is how it breaks that expected data level parallelism and instead provides a (relatively) high-level interface for the new usage. It has nothing to do with shuffling data to and from those registers.

You're getting downvoted because your initial comment ignored the bit that used the registers in a way that wasn't intended and focused on the bits that were the same as the intended usage.

And then you're being condescending about the points you're trying to make. You might not have intended to be condescending, but you are being condescending nonetheless.

You know what at this point I have no incentive to further engage since my attempt to elucidate was also downvoted. Each post that I submit is a further cost. Such a silly game.

Let those who havent worked with SIMD reject an authentic attempt from a vet to elucidate err and be happy with that click. Whatever.

If you can drain my remaining few hundred karma I'll have an excuse to stop clicking another Bitcoin rehash article every day and will thank you.

I suggest optimizing for understanding over hacker news karma points, but to each their own.

You're not being downvoted for a technical issue.

I'm clearly not optimizing for karma since I literally have invited downvotes at this point. I'm accepting the karmatic expense of making meta commentary.

You would be more convincing (and informative) if you were to include some examples and explanation of where this technique has been applied. An appeal to authority shouldn't be neccessary for the point you're making.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact