(He switched it to eager FP because it's faster on modern hardware.)
But a lot of people running old LTS kernels may be affected.
EDIT: Looks like Luto's change landed in kernel version 4.6. https://kernelnewbies.org/Linux_4.6#List_of_merges https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
Not that it's much solace when your proprietary blob breaks, but Red Hat does make considerable effort to give vendors a stable target to build against - unfortunately not everyone fully validates conformance which results in the same old problems cropping up from time to time.
Obviously, using the B120i was out of question due to their crappy driver only targeting RedHat's ancient kernel so my only viable solution was to have to go out and purchase a 200$ real HP P420 smart array RAID card to connect my drives to.
This is very wrong.
It would be trivial for Red Hat to update the kernel shipped into RHEL. It's not a question that they "can't keep up".
What they are doing is much more difficult and takes much more effort. They are backporting all security fixes, and sometimes other features from newer kernels to older kernels. Why? Because many RHEL customers want older kernels. Unlike Windows and Solaris, Linux doesn't have a stable kernel ABI (by design). Some RHEL customers want to use binary drivers, and they want their drivers to work when they update their system. Some other RHEL customers do not use or need binary drivers, but they do want to avoid possible incompatibility problems introduced by newer kernels. And no, it doesn't matter that the kernel API is very stable. Sometimes even fixing bugs can introduce incompatibility problems in badly written software. Other times something like changing some trivial subtle thing about the scheduler can make a badly written program perform much differently then it used to be. The world is full of such programs, and people who administer such systems are very grateful the behavior of their systems doesn't randomly change across the lifetime of a release.
The fact that RHEL freezes the versions of shipped software is its greatest feature. It's not for everyone, but it's for some people, and for those people it is very important.
What good is a fresh kernel if a binary driver you need but can't change breaks?
If you have the appropriate CI pipelines as well as the kind of developers capable of directly consuming the upstream community outputs, by all means, do that. Many companies aren't like that though and prefer talking to tech support people over diving into the source code themselves. I guess we should just be happy about the fact that they fund Linux development at Red Hat.
Like changing the scheduler preferences of which process runs first on fork(), the parent or the child, exposed a bug in bash: https://yarchive.net/comp/linux/child-runs-first.html
Well, at least that's how it was a long time ago when I messed with this stuff more, but it looks like it may be a bit different now (I just did what I proposed), at least with the downstream CentOS kernel. I stopped having to care about this mostly back around the time when Oracle forked RHEL/CentOS into Oracle Linux and RedHat wasn't happy about Oracle piggybacking on their kernel testing for a paid product (as opposed to CentOS). I think maybe RedHat ships the kernel tar mostly pre-patched in-house now. I may have some of the details wrong in that, but it sounds right to me. It's easy to blame Oracle for why we can't have nice things anymore, because usually it's true. :/
Edit: Ah, I found one like what I was talking about, it has 5697 patches included! The most recent 5 series kernel (5.11) didn't have a lot of patches, but 5.6 did. There might be some 6.x series ones like that as well, I don't recall.
The normal method now is that they provide a large, pre-patches kernel source, which is then built. In both cases, the same code source is used for building,it's just the transformations before the building which are different, so that's likely how it doesn't violate the GPL.
The GPL says the source must be shipped, but it makes no provisions for how easy it is to read. It's sort of like if you shipped a JS lib as minified to comply with the GPL, but have a separate non-minified version to develop on in-house.
In the end, it doesn't stop anyone from downloading and compiling their own RedHat kernel as CentOS does, and Oracle Linux did(does?), but it does make it harder (not impossible) for Oracle to come in to RHEL customers and say "hey, stop your RHEL contracts, and pay us instead, and we'll support your systems without you having to reinstall." which is capitalizing on RedHat's packaging and testing work.
Yes, this was in the news a couple of years ago. They do not include separate patches anymore:
https://access.redhat.com/security/cve/cve-2018-3665 shows there will be updated kernels coming.
Build a business on a fork, then compartmentalize it until it can run without modifying the core so you can get off the bugfix treadmill.
It's a fork in the sense that it's heavily diverged from the source of the old kernel, but it's all still backports and not a ton of new development.
So about that "Lazy FPU" vulnerability (CVE-2018-3665)... this probably ought to be a blog post, but the embargo just ended and I think it's important to get some details out quickly. This affects recent Intel CPUs. It might affect non-Intel CPUs but I have no evidence of that. It is an information leak caused by speculative execution, affecting operating systems which use "lazy FPU context switching". The impact of this bug is disclosure of the contents of FPU/MMX/SSE/AVX registers. This is very bad because AES encryption keys almost always end up in SSE registers. You need to be able to execute code on the same CPU as the target process in order to steal cryptographic keys this way. You also need to perform a specific sequence of operations before the CPU pipeline completes, so there's a narrow window for execution. I'm not going to say that it's impossible that this could be executed via a web browser or a similarly "quasi-remote" attack, but it's much harder than Meltdown was. I was not part of the coordinated disclosure process for this vulnerability. I became aware of this issue after attending a session organized by Theo de Raadt at @BSDCan. It took me about 5 hours to write a working exploit based on the details he announced. Theo says that he was not under NDA and was not part of the coordinated disclosure process. I believe him. However, there were details which he knew and attributed to "rumours" which very clearly came from someone who was part of the embargo. My understanding is that the original disclosure date for this was some time in late July or early August. After I wrote an exploit for this, I contacted the embargoed people to say "look, if I can do this in five hours, other people can too; you can't wait that long". While I have exploit code and it is being circulated among some of the relevant security teams, I'm not going to publish it yet; the purpose was to convince the relevant people that they couldn't afford to wait, and that purpose has been achieved. I know from the years that I spent as FreeBSD security officer that it takes some time to get patches out, and my goal is to make the world more secure, not less. But after everybody has had time to push their patches out I'll release the exploit code to help future researchers. I think that's everything I need to say about this vulnerability right now. Happy to answer questions, but I'm not part of the FreeBSD security team and don't have any inside knowledge here -- FreeBSD takes embargoes seriously and they didn't share anything with me. </thread>
One more thing, some advisories are going out giving me credit for co-discovering this. I didn't; I just reproduced it and wrote exploit code after all the important details leaked.
I guess a another question I have is can we win the performance back through fixes on the cpu or will speculative execution always be insecure and thus need patching in software?
Try as I might, I am not a CPU person.
Speculation is not fundamentally incompatible with security. It's just that literally everyone in the industry never though that leaking information out of speculative context was possible -- and so there is no hardening anywhere. Now that it was proven possible, people are rushing to find all the ways this can be exploited. New CPUs currently being designed will fix all these, and then eventually we will have speculation without security issues.
Except for Spectre variant 1. That will always stay with us, because there is no sensible fix for it. The only real solution to that is to accept that branches cannot be used as a security boundary. This is mostly relevant to people implementing secure sandboxes and language runtimes. Going forward, the only reasonable assumption is that if you let a third party run their code in a process, no matter how you verify accesses or otherwise try to contain that code, you should assume it has a read access to the entire process. Any real security requires you to make use of the proper OS-provided isolation.
If this is true, it is unbelievably bad for the future of security and computing in general. People throwing around this assertion are, in my opinion, not appreciating how bad it is. We need to try a lot harder before we give up.
1. Finer-grained isolation makes security better, because it allows us to apply the Principle of Least Authority to each component of the system, and protect components of a system from bugs in other components. To make meaningful gains in security going forward, we need to encourage more fine-grained isolation. If process-level isolation is the finest grain we'll ever have, we can't make these advances.
2. The scalability of edge computing requires finer-grained isolation than process isolation. The trend is towards pushing more and more code to run directly on devices or edge networks rather than centralized servers. That means that the places where code runs need to be able to handle orders of magnitude more tenants than before -- because everyone wants their code to run in every location. If we can't achieve high multi-tenancy securely -- by which I mean 10,000 or more independently isolated pieces of code running on the same machine -- then the only solution will be to limit these resources to big, trusted players. Small companies will but shut out from "the edge". That's bad.
Luckily, the "process isolation is the only isolation" claim is wrong. It's true that we need to evolve our approach to make sub-process isolation secure, but it's not impossible at all. In fact, it's possible to design a sandbox where you can trivially prove that code running inside it cannot observe side channels.
Here's how that works: Thank about Haskell, or another purely-functional language. An attacker provides you with a pure function to execute. Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on, or what else is going on in the background. Therefore, the output cannot possibly incorporate observations from side channels, no matter what the code does internally.
So: It is possible to run attacker-provided code without exposing secrets from the process's memory space.
We do need to remove access to timers, or find a way to make them not useful. We also need to prevent attackers from being able to time their code's execution remotely. Basically, we need to think carefully about the I/O surface while paying attention to timing. But sandboxing has always been about thinking carefully about the I/O surface, and we have a lot of control there. We just have a new aspect that we need to account for.
I think it's doable.
This is patently false. The attacker can do timing on his side and see how long it takes your service to return a response. If you let an attacker have any sort of output channel, you give him the power to use his stuff to find side channels, and there is nothing you can do to close it.
Yes, I mentioned that in my comment:
> We also need to prevent attackers from being able to time their code's execution remotely.
I don't think it's impossible to mitigate.
First, not all use cases actually involve the attacker being able to directly invoke and then measure the timing of their code. Imagine, for example, that I'm using attacker-provided code to apply some image manipulation to a photo on my phone, then I post the photo online. The attacker doesn't have any way to know how long their code took to execute, and I can be confident that they haven't exfiltrated side channels through the image via steganography because their code was deterministic so couldn't have incorporated those side channels into its output.
Second, I don't think the space of timer noise techniques has been adequately explored. Timer noise is one of the main defenses browsers have deployed against spectre, and pragmatically speaking it has been reasonable effective. Yes, in theory there are lots of statistical techniques an attacker might use to get around it, but it certainly increases the cost of attack. We need to figure out how to push the cost of an attack out-of-reach, even if we can't make it completely impossible.
Why? Browsers are moving to one process per security domain anyway. Workers spawn their own threads too. You have to make your processes more lightweight and optimize memory-sharing, but that's exactly what's happening.
There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.
At a significant cost in RAM usage and some CPU overhead too. But browsers have the advantage that they are sitting on desktops and laptops that are massively underutilized, and the number of security domains you might typically have open at once is in the 10's or 100's, not 10,000's.
But that's not what I was talking about with "edge computing". I'm the architect of Cloudflare Workers, an edge compute platform. We have 151 locations today and are pushing towards thousands in the coming years, and every one of our customers wants their code to run in every location. As we push to more and more locations, the available resources in each location decrease, but the number of customers will only increase.
At our scale, unlike browsers, one-process-per-customer just isn't going to scale. Context switching is too expensive, RAM usage per process is too high, etc. So we need other ways to mitigate attacks.
> There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.
It's not really voodoo. Linux has no distinction between processes and threads. Everything is a process. But two processes can share the same memory space. Usually, developers call these "threads", and use "process" to mean "the set of threads sharing a memory space".
In any case, if it's the same memory space, then it's susceptible to Spectre attacks.
kentonv is correct that giving up on intra-process security entirely is an ominous sign.
And we might already have it. Thanks to virtualization hardware, it should be possible for a process to own and handle its own private page table mapping completely from userspace .
Switching from the sandboxed code to the sandboxing code might be more expensive than a plain function call, but still way faster than IPC.
 For example, libDune (https://github.com/ramonza/dune), which admittedly seems a dead project. Also it requires OS support.
Process, container, and VM isolation all have lots of bugs too.
But process-based sandboxing does not replace language-based sandboxing. Operating systems (e.g. Linux) have bugs all the time that allow processes to escalate privileges. It's also not at all clear that process-based sandboxing is enough to defend against spectre, either. But it's much harder to design an attack than can break both layers, than to design one that breaks one or the other.
Certainly, if process-based sandboxing does not add too much overhead for your use case, then you absolutely should use it in addition to language-based sandboxing. But there are plenty of use cases where processes are too much overhead and we really need language-based sandboxing to make progress.
It's true, though. It's been known since the early 1990's when they wrote about all the timing channels and such in VAX CPU's. The good news is there's both architectures and tooling that can do anything from eliminating to reducing these issues. You just have to design the processors for them. Otherwise, you're constantly playing a cat and mouse game trying to dodge the issues of running code needing separation on a machine designed for pervasive sharing.
One thing I came up with was just going back to physical separation with a lot of tiny, SBC-like computers with multicore chips. Kind of like what they use for energy-efficient clusters like BlueGene. One can do some separation at a physical level with better, software-level separation from there. The stuff that truly can't mix gets the physical separation. The rest software using things like separation kernels with time/space partitioning. At least one of the commercial vendors reported being immune to CPU weaknesses due to how separation and scheduling work. Whether true or not, the stronger methods of separation kernels make more senses now given they'll plug some links. The other method I came up was invented before with a patent. (sighs)
"Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on"
That's not true btw. It gets converted into whatever the internal representation is running through circuits done on that process node. These create analog properties that might be manipulated by the attackers to bypass security. I warned people about that when I was new on HN like I did on Schneier's blog. I learned from a hardware guru who specialized in detecting or building that stuff mainly over counterfeiting not backdoors. We're seeing numerous attacks now that use software to create hardware effects at analog level. There's ways to mitigate stuff like that but I have no confidence they'll work with complex, highly-optimized hardware with a billion transistors worth of attack surface to consider.
So, as Brian Snow advocated in "We Need Assurance," you have to reverse the thinking to start with a machine designed to enforce separation from ground up in its operations. Then, OS/software architecture on top of that. Good news is CompSci has lots of stuff like that. Someone with money just has to put it together. More attacks will be found but most will be blocked. We can iterate the good stuff over time as problems are found addressing as many root causes as possible.
Is it not a feasible fix for Spectre-v1 to have loaded cachelines wait in a staging area, and not actually update the cache hierarchy until the load instruction is retired?
Keep in mind that it's not just the L1 cache you potentially need to worry about, but also all the higher level caches, which are really like remote nodes in a distributed system. Even the last level cache would have to be aware of the current speculation state of all threads in the system, which seems like it would be really expensive.
What's perhaps even worse, you wouldn't be able to hit on speculatively loaded cache lines until they have "committed" (because the resulting change in external bandwidth utilization may be observable from another thread), which would probably kill a lot of the benefits of the cache in the first place.
The scarier part is managing bandwidth contention to shared structures.
Other developments catch up, like compiler technology. Not directly related to branch prediction, but: We are much better at polyhedral optimization than the last time people tried VLIW, for example.
Besides, the last major VLIW push (itanium) DID have SpecEx AND Branch Prediction, and did NOT have delay slots.
What will actually happen, of course, is that everyone will put a ASID/PCID as a tag word into all the relevant caches, they'll stop being probe-able from other contexts, and we'll all keep our deep pipelines and speculation and the cache crisis of 2018 will be just a story we tell our grandkids.
Much cheaper than a paradigm shift based on long-since tried and rejected technology.
You jumped straight from "absence of proof must be proof of the opposite" into as bald a no true scotsmanism as I've seen recently. It's like fallacy central here.
But anyway I think they're making a fair argument. If something takes an enormous technical effort to switch to, and is similar in performance, nobody is going to put in that effort.
This is emphatically not evidence that the idea "doesn't work".
I believe if you could retroactively wipe out the last decade of improvement on x86 and Arm, and redirect all of that development effort to VLIW, the resulting chips and software would perform just fine by current standards.
Spectre is more generally due to load speculation modifying the cache while inside a branch speculation context. Basically it effected every high performance cpu in some way. Note that saying there’s a protected/safe branch instruction is only effective if the default branch code gen uses it, so I’m not including such mitigation’s when saying “not effected”. I believe that to be a justifiable decision.
Essentially the only cpus that weren’t effected were those with very little, if any, out of order execution. None of them are considered to have competitive performance.
But Intel got the government ("legal") backdoors in. Priorities.
(Intel wasn't quite alone in this, though -- at least one ARM core also had this issue.)
The instruction-level parallelism that a CPU can extract is primarily a dynamic kind of parallelism. You can, say, have a branch that's true 1000 times, then false 1000 times, then true 1000 times, then false 1000 times, etc.--as a compiler, telling the hardware to predict true or false is going to guarantee a 50% hit rate, but the stupid simple dynamic branch predictor will get 99% on that branch.
I worked on compilers too, and I think there is definitely still work to be explored here. Really, complicated OoO speculative processors are simply recovering a lot of the information the compiler already calculated! You take this code that is in no way suited for a CPU, and you do a ton of dataflow and high-level analysis on it (that you can only know from the source). You build this graph and optimized based on these facts. And then, you throw away the dataflow graph when you lower things. All this, after painfully calculating it on the assumed basis of "Yes, the CPU will like this code" -- only for the CPU to perform that whole process over again, say "Yes, I do, in fact, like this code" -- so it can execute efficiently without stalls anyway. There's clearly a mismatch here.
I mean, don't get me wrong -- this all seems like a perfectly fine and acceptable engineering tradeoff, but just disappointing from a computer science perspective, to me, at least, that this is not unified :) Of course, VLIW vs CISC etc is one of the classic debates...
Interestingly, Aaron Smith from Microsoft Research (and one of the original members of the TRIPS project) actually demonstrated and talked about their work on newer EDGE processor designs at MSR, and even demonstrated Windows running on an EDGE processor, this past week or so at ISCA2018 (complete with a working Visual Studio target, using an LLVM-based toolchain!) Their design is quite different from TRIPS it seems. I'm hoping the talks/work becomes public eventually, but that might just be a dream.
VLIW machines allow you to provide hints about instruction level parallelism without all the superscalar on-the-fly analysis of instruction level data dependencies (so the hardware can do all that rescheduling on the fly).
I think that if interlock-free software scheduled CPUs allowed us to reach 20GHz clock speeds where complex superscalar machines were stuck at 3GHz we'd all be jumping ship - but they're not
The solution might be to actually prevent some kind of security important code from being optimized in this way. Say, forcing full cache sync in-order execution for parts of code with no resource sharing between cores.
Additionally, realize that clock speeds are not a relevant performance metric. Maybe, perhaps, it's possible for a software scheduled CPU to run at 2*n GHz, but that's not interesting, if it's slower in real-world workloads than todays tech at n GHz. And for breadth-deployment in data centers we're mostly looking at throughput per Watt. CPUs with high clock speeds don't do well in that area because semiconductor physics.
There's a lot less mischief you can get up to when you only execute a couple of instructions beyond a mispredicted branch rather than a hundred but mischief can't be theoretically ruled out entirely.
Everyone overlooked it, or everyone had a good reason to think it was impossible? The information is there; it's a target that needs to be secured.
> but we will effectively need to get rid of the idea that you can run untrusted code in the same process as data you want to keep secure.
If we accept that, then we also need to get rid of the idea that we can branch differently in trusted code based on details of "untrusted data".
It's just that there are a lot of branches that don't depend on untrusted data. Speculatively executing past them is perfectly fine and extremely valuable for performance. That's why nobody wants to get rid of speculative execution.
In the larger sense, mixing various privilege levels on shared hardware is likely a bad idea, despite being fundamental to general-purpose computing. This is because it's fairly difficult to cloak intrinsic "physical" attributes of execution, like execution timing, cache timing, from other processes, and essentially impossible (and/or unreasonable) to cloak it from one's own process. It is both possible for a process to generate lots of side-effects (e.g. IO, cache fill), and for it or another process to try to figure out ways the shared state changes by observing its own execution.
Unless the CPU does a partial pipeline flush upon realizing that there would be a trap. I don't know -- do Intel CPUs check for exceptions in order?
AFAIK other systems aren't affected -- is lazy context switching even a thing on them? The fundamental issue here is that one process' data is still in registers when another process is running and we've been relying on getting a trap to tell us when we need to restore the correct FP state.
Some ARM ABIs end up passing floats in integer registers, but that's just for compatibility for code that doesn't assume the presence of an FPU and might be doing everything soft float.
My hunch says that chips affected by 4a could easily be fair game (4a is speculating reads into priviledged regs... I wonder if 4a would work on regs that are trapped, not inconceivable)
But it's a (relatively) simple OS fix.