To my understanding, the memory subsystem is fetching a byte in parallel with ac...

jimrandomh · on Jan 9, 2018

Flushing cache lines doesn't work, at least not straightforwardly. The attacker can arrange things so that the cache line is pre-populated with something else that it has access to, with a colliding address that will be evicted by the speculative load. Flushing undoes the load, but can't easily undo the eviction.

tzs · on Jan 9, 2018

> I believe one solution would be to put permission checks before the memory access, which would add serialized latency to all memory access.

I don't see why that would have to add latency to all (or any) memory access. The addresses generated by programs (except in real mode, when everything has access to everything anyway so we don't care about these issues then) are virtual addresses, so they have to be translated to get the actual memory address.

The permission information for a page is stored in the same place as the physical address translation information for that page. The processor fetches it at the same time it fethes the physical base address of the page.

They should also have the current permission level of the program readily available. That should be enough to let them do something about Meltdown without any performance impact. They could do something as simple as if the page is a supervisor page and the CPU is not in supervisor mode don't actually read the memory. Just substitute fixed data.

Note that AMD is reportedly not affected by Meltdown. From what I've read that is because they in fact do the protection check before trying to access the memory, even during speculation, and they don't suffer any performance loss from that.

Note that since Meltdown is only an issue when the kernel memory read is on the path that does NOT become the real path (because if it becomes the real path, then the program is going to get a fault anyway for an illegal memory access), the replacing of the memory accesses with fixed data cannot harm any legitimate program.

Spectre is going to be the hard one for the CPU people to fix, I think. I think they may have to offer hardware memory protection features that can be used in user mode code to protect parts of that code from other parts of that code, so that things that want to run untrusted code in a sandbox in their own processes can do so in a separate address space that is protected similar to the way kernel space is protected from user space.

It may be more complicated than that, though, because Spectre also does some freaky things that take advantage of branch prediction information not being isolated between processors. I haven't read enough to understand the implications of that. I don't know if that can be defeated just be better memory protection enforcement.

ahh · on Jan 10, 2018

> I don't see why that would have to add latency to all (or any) memory access. The addresses generated by programs (except in real mode, when everything has access to everything anyway so we don't care about these issues then) are virtual addresses, so they have to be translated to get the actual memory address.

L1 caches are generally virtually indexed for exactly this reason: to allow a L1 cache read to happen in parallel with the TLB lookup. (They're also usually, I believe, physically tagged, so we have to check for collisions at some point, but making sure there's no side channel information at that point is, obviously given recent events, hard.)

tomatocracy · on Jan 9, 2018

Indeed - Meltdown has an "easy" fix and now it's known it should be possible to design chips which are not vulnerable.

Spectre is, as you say, harder - but more because the line of what sort of state should be separate isn't as clear-cut as we might like it to be (i.e. it's not neccessarily just "processes" as the OS sees it - e.g. JVM/JavaScript interpreter state should allow for an effective sandbox between the executing interpreter/JVM process and what the running JVM/JavaScript code can see). And worse, those are precisely the cases where one probably cares most about separation given that's where untrusted code is typically run.

But hardware assistance could help - in simple terms, I'd imagine that allowing a swap out of more of the internal processor state (to the extent that one process "training" the branch-predictor doesn't impact how the branch predictor acts in another process) would be pretty effective. That might be expensive in terms of performance per-transistor/per-watt however (though probably not absolute performance).

tsukikage · on Jan 11, 2018

If we're looking at hardware design changes, it really feels like what we actually need is to add a place to hold a nonce that the OS/hypervisor can set per-process/per-vm, and incorporate those bits in the CPU cache tags so cache lines never match across security boundaries, which would close the side channel used to exfiltrate information.

thisoneforwork · on Jan 9, 2018

Would "flushing on ignore" not leave the cache side channel open for many instructions before the abort?

martin1975 · on Jan 9, 2018

the first approach sounds kind of expensive to be done at the cpu level. I like your second one better. thank you!

dannyw · on Jan 10, 2018

AMD already takes the first approach to prevent Meltdown.

white-flame · on Jan 9, 2018

Actually, my preferred solution would be to eliminate the notion of distributing machine code binaries entirely, but that's a bit beyond the scope of these discussions. ;-)

martin1975 · on Jan 9, 2018

so run everything in a VM?

white-flame · on Jan 9, 2018

No, creating a block of machine code bytes to execute would be a privileged operation. All code would run through a privileged CPU-specific compiler first, and there'd be no way to run raw machine code bytes otherwise.

If there are bugs that can be exposed through various machine code patterns, the compiler can centralize the restrictions of what may be executed, enforce runtime checks, or prevent certain instructions from being used at all. Security or optimization updates would affect all running programs automatically. Granted, these current speculative vulnerabilities would be much more difficult to statically detect.

But it would follow the crazy gentoo dream of having everything optimized for your environment better, allow much better compatibility across systems, and prevent entire classes of privilege escalation issues.

Terr_ · on Jan 9, 2018

> no way to run raw machine code bytes otherwise [...] restrictions of what may be executed, enforce runtime checks, or prevent certain instructions from being used at all [...] everything optimized for your environment better, allow much better compatibility across systems and prevent entire classes of privilege escalation issues.

So... basically re-inventing Java? :)

"Raw machine code bytes" aren't distributed but occur through the privileged JVM and its just-in-time compiler, the byte-code verifier enforces restrictions on what data-access patterns and where instructions can be used, the JVM for a particular OS has optimizations for that environment, and sandboxing (while imperfect) blocks some classes of privilege escalation issues.

Don't get me wrong, I'm not saying Java is perfect or that the underlying goal isn't good, I'm just happily amused by this sense of "everything old is new again."

JdeBP · on Jan 10, 2018

No, reinventing the AS/400.

* https://news.ycombinator.com/item?id=16053518

white-flame · on Jan 10, 2018

Well, to me Java is still new tech. ;-) But yes, it's certainly a reasonable sampling into non-machine code distribution, and enforcement of security rules when actually running/JITting the code, as were some mainframe developments before then.

Of course, Java certainly does have some higher level weaknesses as in the introspection API kerfuffle a while back, and is too locked into its Object Obsessed design for it to be a truly general purpose object code format.

londons_explore · on Jan 9, 2018

Arguably x64 assembly code is the same...

A privileged process (the microcode) enforces restrictions and converts it to micro-ops which execute on the real processor.

dreish · on Jan 9, 2018

I've been thinking along the same lines for the last few years. If you did this, you could have a multi-user operating system in a single address space and avoid the cost of interrupts for system calls (which would just be like any other function call).

0x0 · on Jan 9, 2018

Sounds similar to some ideas explored in the experimental OS "Singularity" - https://en.wikipedia.org/wiki/Singularity_(operating_system)

mykull · on Jan 10, 2018

We'd need a better binary representation of uncompiled code, then. Moving around lots of code as ascii is kind of suboptimal... I wouldn't want that. By all means, show it as text to the user, but don't store it that way.

martin1975 · on Jan 9, 2018

and what if I wrote a compiler that doesn't heed any of your security concerns? It would still compile to machine code and continue to be able to exploit things Spectre/Meltdown style? Or am I off here?

white-flame · on Jan 9, 2018

You'd only be able to run it on your system. At least, without other means of breaching the low level secured configuration of someone else's machine, because that's where the One True Compiler for that system lives.

gumby · on Jan 9, 2018

If I were taking this approach I might not even tell you the instruction set of the machine, so your compiler wouldn’t be useful.

XorNot · on Jan 9, 2018

I think the idea is you just never accept foreign machine code.

martin1975 · on Jan 9, 2018

cool .. I think I get it. It's like compiler/instruction based DRM ... CPU specific permission to run code. Maybe they can leverage existing TPM chips to do this...

I just don't want to see performance being decimated as a trade off for security, if at all possible.

spullara · on Jan 9, 2018

i'm not so sure. memory accesses are so slow (hundreds of cycles) it probably wouldn't be that much slower to issue them a few instructions later. when it was introduced memory access and cycles were much closer together, only a few cycles and it saved a huge amount of time.

gpderetta · on Jan 9, 2018

Main memory access take an order of hundred cycles. D1 cache hit access usually take usually 3-4 cycles. Microarchitecture designers will take heroic efforts to even shave a single cycle here. Adding an overhead of even a couple of cycles would be a huge deal.

Having said that, AMD CPUs are the existence proof that you can be immune to meltdown with no significant overhead.

Spectre is a completely different issue though.

londons_explore · on Jan 9, 2018

AMD CPUs have pretty poor single threaded performance.

Perhaps that because they haven't taken the speed short-cuts that Intel took...?

jandrese · on Jan 10, 2018

Didn't Ryzen close the single thread performance gap quite a bit?

But yeah, protecting against it means implementing memory protection in more places in the CPU. More gates and the possibility of becoming a bottleneck.

arcticbull · on Jan 10, 2018

With Ryzen, they're pretty much equal on an IPC basis.