I guess a another question I have is can we win the performance back through fixes on the cpu or will speculative execution always be insecure and thus need patching in software?
Try as I might, I am not a CPU person.
Speculation is not fundamentally incompatible with security. It's just that literally everyone in the industry never though that leaking information out of speculative context was possible -- and so there is no hardening anywhere. Now that it was proven possible, people are rushing to find all the ways this can be exploited. New CPUs currently being designed will fix all these, and then eventually we will have speculation without security issues.
Except for Spectre variant 1. That will always stay with us, because there is no sensible fix for it. The only real solution to that is to accept that branches cannot be used as a security boundary. This is mostly relevant to people implementing secure sandboxes and language runtimes. Going forward, the only reasonable assumption is that if you let a third party run their code in a process, no matter how you verify accesses or otherwise try to contain that code, you should assume it has a read access to the entire process. Any real security requires you to make use of the proper OS-provided isolation.
If this is true, it is unbelievably bad for the future of security and computing in general. People throwing around this assertion are, in my opinion, not appreciating how bad it is. We need to try a lot harder before we give up.
1. Finer-grained isolation makes security better, because it allows us to apply the Principle of Least Authority to each component of the system, and protect components of a system from bugs in other components. To make meaningful gains in security going forward, we need to encourage more fine-grained isolation. If process-level isolation is the finest grain we'll ever have, we can't make these advances.
2. The scalability of edge computing requires finer-grained isolation than process isolation. The trend is towards pushing more and more code to run directly on devices or edge networks rather than centralized servers. That means that the places where code runs need to be able to handle orders of magnitude more tenants than before -- because everyone wants their code to run in every location. If we can't achieve high multi-tenancy securely -- by which I mean 10,000 or more independently isolated pieces of code running on the same machine -- then the only solution will be to limit these resources to big, trusted players. Small companies will but shut out from "the edge". That's bad.
Luckily, the "process isolation is the only isolation" claim is wrong. It's true that we need to evolve our approach to make sub-process isolation secure, but it's not impossible at all. In fact, it's possible to design a sandbox where you can trivially prove that code running inside it cannot observe side channels.
Here's how that works: Thank about Haskell, or another purely-functional language. An attacker provides you with a pure function to execute. Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on, or what else is going on in the background. Therefore, the output cannot possibly incorporate observations from side channels, no matter what the code does internally.
So: It is possible to run attacker-provided code without exposing secrets from the process's memory space.
We do need to remove access to timers, or find a way to make them not useful. We also need to prevent attackers from being able to time their code's execution remotely. Basically, we need to think carefully about the I/O surface while paying attention to timing. But sandboxing has always been about thinking carefully about the I/O surface, and we have a lot of control there. We just have a new aspect that we need to account for.
I think it's doable.
This is patently false. The attacker can do timing on his side and see how long it takes your service to return a response. If you let an attacker have any sort of output channel, you give him the power to use his stuff to find side channels, and there is nothing you can do to close it.
Yes, I mentioned that in my comment:
> We also need to prevent attackers from being able to time their code's execution remotely.
I don't think it's impossible to mitigate.
First, not all use cases actually involve the attacker being able to directly invoke and then measure the timing of their code. Imagine, for example, that I'm using attacker-provided code to apply some image manipulation to a photo on my phone, then I post the photo online. The attacker doesn't have any way to know how long their code took to execute, and I can be confident that they haven't exfiltrated side channels through the image via steganography because their code was deterministic so couldn't have incorporated those side channels into its output.
Second, I don't think the space of timer noise techniques has been adequately explored. Timer noise is one of the main defenses browsers have deployed against spectre, and pragmatically speaking it has been reasonable effective. Yes, in theory there are lots of statistical techniques an attacker might use to get around it, but it certainly increases the cost of attack. We need to figure out how to push the cost of an attack out-of-reach, even if we can't make it completely impossible.
Why? Browsers are moving to one process per security domain anyway. Workers spawn their own threads too. You have to make your processes more lightweight and optimize memory-sharing, but that's exactly what's happening.
There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.
At a significant cost in RAM usage and some CPU overhead too. But browsers have the advantage that they are sitting on desktops and laptops that are massively underutilized, and the number of security domains you might typically have open at once is in the 10's or 100's, not 10,000's.
But that's not what I was talking about with "edge computing". I'm the architect of Cloudflare Workers, an edge compute platform. We have 151 locations today and are pushing towards thousands in the coming years, and every one of our customers wants their code to run in every location. As we push to more and more locations, the available resources in each location decrease, but the number of customers will only increase.
At our scale, unlike browsers, one-process-per-customer just isn't going to scale. Context switching is too expensive, RAM usage per process is too high, etc. So we need other ways to mitigate attacks.
> There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.
It's not really voodoo. Linux has no distinction between processes and threads. Everything is a process. But two processes can share the same memory space. Usually, developers call these "threads", and use "process" to mean "the set of threads sharing a memory space".
In any case, if it's the same memory space, then it's susceptible to Spectre attacks.
kentonv is correct that giving up on intra-process security entirely is an ominous sign.
And we might already have it. Thanks to virtualization hardware, it should be possible for a process to own and handle its own private page table mapping completely from userspace .
Switching from the sandboxed code to the sandboxing code might be more expensive than a plain function call, but still way faster than IPC.
 For example, libDune (https://github.com/ramonza/dune), which admittedly seems a dead project. Also it requires OS support.
Process, container, and VM isolation all have lots of bugs too.
But process-based sandboxing does not replace language-based sandboxing. Operating systems (e.g. Linux) have bugs all the time that allow processes to escalate privileges. It's also not at all clear that process-based sandboxing is enough to defend against spectre, either. But it's much harder to design an attack than can break both layers, than to design one that breaks one or the other.
Certainly, if process-based sandboxing does not add too much overhead for your use case, then you absolutely should use it in addition to language-based sandboxing. But there are plenty of use cases where processes are too much overhead and we really need language-based sandboxing to make progress.
It's true, though. It's been known since the early 1990's when they wrote about all the timing channels and such in VAX CPU's. The good news is there's both architectures and tooling that can do anything from eliminating to reducing these issues. You just have to design the processors for them. Otherwise, you're constantly playing a cat and mouse game trying to dodge the issues of running code needing separation on a machine designed for pervasive sharing.
One thing I came up with was just going back to physical separation with a lot of tiny, SBC-like computers with multicore chips. Kind of like what they use for energy-efficient clusters like BlueGene. One can do some separation at a physical level with better, software-level separation from there. The stuff that truly can't mix gets the physical separation. The rest software using things like separation kernels with time/space partitioning. At least one of the commercial vendors reported being immune to CPU weaknesses due to how separation and scheduling work. Whether true or not, the stronger methods of separation kernels make more senses now given they'll plug some links. The other method I came up was invented before with a patent. (sighs)
"Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on"
That's not true btw. It gets converted into whatever the internal representation is running through circuits done on that process node. These create analog properties that might be manipulated by the attackers to bypass security. I warned people about that when I was new on HN like I did on Schneier's blog. I learned from a hardware guru who specialized in detecting or building that stuff mainly over counterfeiting not backdoors. We're seeing numerous attacks now that use software to create hardware effects at analog level. There's ways to mitigate stuff like that but I have no confidence they'll work with complex, highly-optimized hardware with a billion transistors worth of attack surface to consider.
So, as Brian Snow advocated in "We Need Assurance," you have to reverse the thinking to start with a machine designed to enforce separation from ground up in its operations. Then, OS/software architecture on top of that. Good news is CompSci has lots of stuff like that. Someone with money just has to put it together. More attacks will be found but most will be blocked. We can iterate the good stuff over time as problems are found addressing as many root causes as possible.
Is it not a feasible fix for Spectre-v1 to have loaded cachelines wait in a staging area, and not actually update the cache hierarchy until the load instruction is retired?
Keep in mind that it's not just the L1 cache you potentially need to worry about, but also all the higher level caches, which are really like remote nodes in a distributed system. Even the last level cache would have to be aware of the current speculation state of all threads in the system, which seems like it would be really expensive.
What's perhaps even worse, you wouldn't be able to hit on speculatively loaded cache lines until they have "committed" (because the resulting change in external bandwidth utilization may be observable from another thread), which would probably kill a lot of the benefits of the cache in the first place.
The scarier part is managing bandwidth contention to shared structures.
Other developments catch up, like compiler technology. Not directly related to branch prediction, but: We are much better at polyhedral optimization than the last time people tried VLIW, for example.
Besides, the last major VLIW push (itanium) DID have SpecEx AND Branch Prediction, and did NOT have delay slots.
What will actually happen, of course, is that everyone will put a ASID/PCID as a tag word into all the relevant caches, they'll stop being probe-able from other contexts, and we'll all keep our deep pipelines and speculation and the cache crisis of 2018 will be just a story we tell our grandkids.
Much cheaper than a paradigm shift based on long-since tried and rejected technology.
You jumped straight from "absence of proof must be proof of the opposite" into as bald a no true scotsmanism as I've seen recently. It's like fallacy central here.
But anyway I think they're making a fair argument. If something takes an enormous technical effort to switch to, and is similar in performance, nobody is going to put in that effort.
This is emphatically not evidence that the idea "doesn't work".
I believe if you could retroactively wipe out the last decade of improvement on x86 and Arm, and redirect all of that development effort to VLIW, the resulting chips and software would perform just fine by current standards.
Spectre is more generally due to load speculation modifying the cache while inside a branch speculation context. Basically it effected every high performance cpu in some way. Note that saying there’s a protected/safe branch instruction is only effective if the default branch code gen uses it, so I’m not including such mitigation’s when saying “not effected”. I believe that to be a justifiable decision.
Essentially the only cpus that weren’t effected were those with very little, if any, out of order execution. None of them are considered to have competitive performance.
But Intel got the government ("legal") backdoors in. Priorities.
(Intel wasn't quite alone in this, though -- at least one ARM core also had this issue.)
The instruction-level parallelism that a CPU can extract is primarily a dynamic kind of parallelism. You can, say, have a branch that's true 1000 times, then false 1000 times, then true 1000 times, then false 1000 times, etc.--as a compiler, telling the hardware to predict true or false is going to guarantee a 50% hit rate, but the stupid simple dynamic branch predictor will get 99% on that branch.
I worked on compilers too, and I think there is definitely still work to be explored here. Really, complicated OoO speculative processors are simply recovering a lot of the information the compiler already calculated! You take this code that is in no way suited for a CPU, and you do a ton of dataflow and high-level analysis on it (that you can only know from the source). You build this graph and optimized based on these facts. And then, you throw away the dataflow graph when you lower things. All this, after painfully calculating it on the assumed basis of "Yes, the CPU will like this code" -- only for the CPU to perform that whole process over again, say "Yes, I do, in fact, like this code" -- so it can execute efficiently without stalls anyway. There's clearly a mismatch here.
I mean, don't get me wrong -- this all seems like a perfectly fine and acceptable engineering tradeoff, but just disappointing from a computer science perspective, to me, at least, that this is not unified :) Of course, VLIW vs CISC etc is one of the classic debates...
Interestingly, Aaron Smith from Microsoft Research (and one of the original members of the TRIPS project) actually demonstrated and talked about their work on newer EDGE processor designs at MSR, and even demonstrated Windows running on an EDGE processor, this past week or so at ISCA2018 (complete with a working Visual Studio target, using an LLVM-based toolchain!) Their design is quite different from TRIPS it seems. I'm hoping the talks/work becomes public eventually, but that might just be a dream.
There's a lot less mischief you can get up to when you only execute a couple of instructions beyond a mispredicted branch rather than a hundred but mischief can't be theoretically ruled out entirely.
VLIW machines allow you to provide hints about instruction level parallelism without all the superscalar on-the-fly analysis of instruction level data dependencies (so the hardware can do all that rescheduling on the fly).
I think that if interlock-free software scheduled CPUs allowed us to reach 20GHz clock speeds where complex superscalar machines were stuck at 3GHz we'd all be jumping ship - but they're not
The solution might be to actually prevent some kind of security important code from being optimized in this way. Say, forcing full cache sync in-order execution for parts of code with no resource sharing between cores.
Additionally, realize that clock speeds are not a relevant performance metric. Maybe, perhaps, it's possible for a software scheduled CPU to run at 2*n GHz, but that's not interesting, if it's slower in real-world workloads than todays tech at n GHz. And for breadth-deployment in data centers we're mostly looking at throughput per Watt. CPUs with high clock speeds don't do well in that area because semiconductor physics.
Everyone overlooked it, or everyone had a good reason to think it was impossible? The information is there; it's a target that needs to be secured.
> but we will effectively need to get rid of the idea that you can run untrusted code in the same process as data you want to keep secure.
If we accept that, then we also need to get rid of the idea that we can branch differently in trusted code based on details of "untrusted data".
It's just that there are a lot of branches that don't depend on untrusted data. Speculatively executing past them is perfectly fine and extremely valuable for performance. That's why nobody wants to get rid of speculative execution.
In the larger sense, mixing various privilege levels on shared hardware is likely a bad idea, despite being fundamental to general-purpose computing. This is because it's fairly difficult to cloak intrinsic "physical" attributes of execution, like execution timing, cache timing, from other processes, and essentially impossible (and/or unreasonable) to cloak it from one's own process. It is both possible for a process to generate lots of side-effects (e.g. IO, cache fill), and for it or another process to try to figure out ways the shared state changes by observing its own execution.