Hacker News new | comments | ask | show | jobs | submit login
Intel SA-00145: Lazy FP State Restore (intel.com)
206 points by lkurusa 8 months ago | hide | past | web | favorite | 124 comments

Andy Lutomirski noted on another thread that he unintentionally fixed this two years ago in Linux:


(He switched it to eager FP because it's faster on modern hardware.)

But a lot of people running old LTS kernels may be affected.

EDIT: Looks like Luto's change landed in kernel version 4.6. https://kernelnewbies.org/Linux_4.6#List_of_merges https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

Does that include Red Hat Enterprise Linux? I can't believe how old of a kernel they use still. I have to use it for work, but I run Arch at home which is running 4.16.13. I'm honestly surprised that Red Hat can't manage to keep their distros current with stable package builds.

Red Hat backports security fixes, hardware support and certain new features to their kernel packages, staying on a stable kernel version throughout the product lifecycle is what allows them to guarantee a consistent kernel ABI that virtually every other distribution throws to the wayside. Personally I prefer not having third-party kernel modules on my servers break every time I run `yum upgrade` like I do with Fedora (oh VMWare, how I hate you so).

except they still break because Red Hat backports bleeding-edge features but doesn't change the version number.

I'm using HP B120i fakeraid controller with proprietary driver and it broke after 7.4 upgrade, so while they probably doing a good job about binary compatibility (7.5 didn't broke it), it's not ideal.

Unfortunately the kABI does not encompass every symbol exported by the kernel, there is a whitelist maintained in the kernel-abi-whitelists package and scripts to check conformance of a module to the whitelist. Symbols are only ever added throughout the lifecycle of a RHEL release, so anything that conforms to the 7.3 kABI will also work on 7.4 - but if the module uses a symbol NOT whitelisted in 7.3 there's no guarantee the 7.4 update won't break it.

Not that it's much solace when your proprietary blob breaks, but Red Hat does make considerable effort to give vendors a stable target to build against - unfortunately not everyone fully validates conformance which results in the same old problems cropping up from time to time.

That and Spectre fixes forced a BUNCH of third party code that relied on non exported code to break.

I really hate HP for this stupid fakeraid shit. The regular SATA AHCI controller is unusable on the HP servers I have because it does not pass the temperature info to the iLO and causes the fans to skyrocket to 70% permanently.

Obviously, using the B120i was out of question due to their crappy driver only targeting RedHat's ancient kernel so my only viable solution was to have to go out and purchase a 200$ real HP P420 smart array RAID card to connect my drives to.

> I'm honestly surprised that Red Hat can't manage to keep their distros current with stable package builds.

This is very wrong.

It would be trivial for Red Hat to update the kernel shipped into RHEL. It's not a question that they "can't keep up".

What they are doing is much more difficult and takes much more effort. They are backporting all security fixes, and sometimes other features from newer kernels to older kernels. Why? Because many RHEL customers want older kernels. Unlike Windows and Solaris, Linux doesn't have a stable kernel ABI (by design). Some RHEL customers want to use binary drivers, and they want their drivers to work when they update their system. Some other RHEL customers do not use or need binary drivers, but they do want to avoid possible incompatibility problems introduced by newer kernels. And no, it doesn't matter that the kernel API is very stable. Sometimes even fixing bugs can introduce incompatibility problems in badly written software. Other times something like changing some trivial subtle thing about the scheduler can make a badly written program perform much differently then it used to be. The world is full of such programs, and people who administer such systems are very grateful the behavior of their systems doesn't randomly change across the lifetime of a release.

The fact that RHEL freezes the versions of shipped software is its greatest feature. It's not for everyone, but it's for some people, and for those people it is very important.

The problem with that is that I'd argue it is "safer" to run a kernel.org stable kernel rather than RHEL, since it's sheer impossible to keep up with the amount of fixes that eventually land in stable kernels every few weeks. And it is unclear which of the stable fixes are security relevant ones even if they did not get a CVE number and the like. QAing all these is another mammoth task that often requires expertise in several areas. I'm actually glad that (upstream) Linux doesn't have a stable kernel ABI, if you've ever looked into what gross hacks are required in the RHEL kernel only to keep that intact, well, good luck.

That's not the choice customers have. The parent gave examples for why companies often need predictability and stable ABIs. They're choosing between no kernel upgrades and RHEL's.

What good is a fresh kernel if a binary driver you need but can't change breaks?

Right.. it's like they haven't managed to automate a CI pipeline to handle these changes for them. Other distros do it just fine, and I've had far more things break in CentOS or Red Hat than Arch.

That CI pipeline would need to be placed not at Red Hat but at their big customers and their hardware vendors. To RH's benefit, those tend to suck at that and prefer to pay big dollars over having to keep up with kernel development or getting rid of their crappy out-of-tree drivers.

If you have the appropriate CI pipelines as well as the kind of developers capable of directly consuming the upstream community outputs, by all means, do that. Many companies aren't like that though and prefer talking to tech support people over diving into the source code themselves. I guess we should just be happy about the fact that they fund Linux development at Red Hat.

> Other times something like changing some trivial subtle thing about the scheduler can make a badly written program perform much differently then it used to be.

Like changing the scheduler preferences of which process runs first on fork(), the parent or the child, exposed a bug in bash: https://yarchive.net/comp/linux/child-runs-first.html

RHEL kernels are heavily patched, i.e. stuff gets backported from new kernels, sometimes even from -rc if needed. The version number is therefore not that useful. Some call it a Frankenkernel. :)

The nice thing is that it's all fully documented. If you download the kernel src.rpm[1] and install it, you'll get the kernel RPM spec file in your default RPM build environment paths (or use rpm2cpio $KERNEL_SRPM | cpio -idmv). The way the RPM building works is to take a vanilla kernel and then apply a LOT of individual patch files to it, each with their own entry.

Well, at least that's how it was a long time ago when I messed with this stuff more, but it looks like it may be a bit different now (I just did what I proposed), at least with the downstream CentOS kernel. I stopped having to care about this mostly back around the time when Oracle forked RHEL/CentOS into Oracle Linux and RedHat wasn't happy about Oracle piggybacking on their kernel testing for a paid product (as opposed to CentOS). I think maybe RedHat ships the kernel tar mostly pre-patched in-house now. I may have some of the details wrong in that, but it sounds right to me. It's easy to blame Oracle for why we can't have nice things anymore, because usually it's true. :/

Edit: Ah, I found one like what I was talking about[2], it has 5697 patches included! The most recent 5 series kernel (5.11) didn't have a lot of patches, but 5.6 did. There might be some 6.x series ones like that as well, I don't recall.

1: http://vault.centos.org/centos/7/os/Source/SPackages/kernel-...

2: http://vault.centos.org/5.6/os/SRPMS/kernel-2.6.18-238.el5.s...

Yeah, as of a few years ago Red Hat only give access to the proper broken-out set of patches they include in their kernel to paying customers, and they make those customers sign a contract agreeing not to distribute those patches. (They're also the only major distro who doesn't cooperate with the upstream stable releases.)

How is that not a GPL violation?

The source code is released, it's just the formatting that has changed. Presumably the contract is around the extra effort RedHat puts into making the source nice and easily manageable by providing what they used to by default, a pristine kernal and a bunch of patches to apply.

The normal method now is that they provide a large, pre-patches kernel source, which is then built. In both cases, the same code source is used for building,it's just the transformations before the building which are different, so that's likely how it doesn't violate the GPL.

The GPL says the source must be shipped, but it makes no provisions for how easy it is to read. It's sort of like if you shipped a JS lib as minified to comply with the GPL, but have a separate non-minified version to develop on in-house.

In the end, it doesn't stop anyone from downloading and compiling their own RedHat kernel as CentOS does, and Oracle Linux did(does?), but it does make it harder (not impossible) for Oracle to come in to RHEL customers and say "hey, stop your RHEL contracts, and pay us instead, and we'll support your systems without you having to reinstall." which is capitalizing on RedHat's packaging and testing work.

Oracle then started a project to break them back out again. Last update seems to be November 2017 though.


I think maybe RedHat ships the kernel tar mostly pre-patched in-house now.

Yes, this was in the news a couple of years ago. They do not include separate patches anymore:



> RHEL-7 will automatically default to (safe) “eager” floating point register restore on Sandy Bridge and newer Intel processors. AMD processors are not affected.


It looks like for older processors they may be vulnerable?

https://access.redhat.com/security/cve/cve-2018-3665 shows there will be updated kernels coming.

Contrast this with the Postgres addon community, which has a culture of pushing changes upstream until they no longer have to maintain a fork of the Postgres code.

Build a business on a fork, then compartmentalize it until it can run without modifying the core so you can get off the bugfix treadmill.

The changes are all upstream to begin with.

It's a fork in the sense that it's heavily diverged from the source of the old kernel, but it's all still backports and not a ton of new development.

Why does it need to be diverged if there is no new functionality in their version? Because they can?

ABI stability to support binary-only loadable modules is probably the most important reason.

Why do many distros have releases to which only bug and security fixes are applied? Same reason. The divergence is just those fixes.

I posted some details about this to Twitter: https://twitter.com/cperciva/status/1007010583244230656

Here is the text of the tweets for convenience:

So about that "Lazy FPU" vulnerability (CVE-2018-3665)... this probably ought to be a blog post, but the embargo just ended and I think it's important to get some details out quickly. This affects recent Intel CPUs. It might affect non-Intel CPUs but I have no evidence of that. It is an information leak caused by speculative execution, affecting operating systems which use "lazy FPU context switching". The impact of this bug is disclosure of the contents of FPU/MMX/SSE/AVX registers. This is very bad because AES encryption keys almost always end up in SSE registers. You need to be able to execute code on the same CPU as the target process in order to steal cryptographic keys this way. You also need to perform a specific sequence of operations before the CPU pipeline completes, so there's a narrow window for execution. I'm not going to say that it's impossible that this could be executed via a web browser or a similarly "quasi-remote" attack, but it's much harder than Meltdown was. I was not part of the coordinated disclosure process for this vulnerability. I became aware of this issue after attending a session organized by Theo de Raadt at @BSDCan. It took me about 5 hours to write a working exploit based on the details he announced. Theo says that he was not under NDA and was not part of the coordinated disclosure process. I believe him. However, there were details which he knew and attributed to "rumours" which very clearly came from someone who was part of the embargo. My understanding is that the original disclosure date for this was some time in late July or early August. After I wrote an exploit for this, I contacted the embargoed people to say "look, if I can do this in five hours, other people can too; you can't wait that long". While I have exploit code and it is being circulated among some of the relevant security teams, I'm not going to publish it yet; the purpose was to convince the relevant people that they couldn't afford to wait, and that purpose has been achieved. I know from the years that I spent as FreeBSD security officer that it takes some time to get patches out, and my goal is to make the world more secure, not less. But after everybody has had time to push their patches out I'll release the exploit code to help future researchers. I think that's everything I need to say about this vulnerability right now. Happy to answer questions, but I'm not part of the FreeBSD security team and don't have any inside knowledge here -- FreeBSD takes embargoes seriously and they didn't share anything with me. </thread>

One more thing, some advisories are going out giving me credit for co-discovering this. I didn't; I just reproduced it and wrote exploit code after all the important details leaked.

how do you feel about how Intel gives every big customer a nice presentation about every vuln ahead of the disclosure?

Serious question - should we considering speculative cpu execution to be A Bad Idea (tm) and move on from it (since these problems keep coming up), or is the thought that we more or less have been gaining performance on the back on incorrectly written software (which does not take these speculative execution edge cases into account), and the only forward is patching?

I guess a another question I have is can we win the performance back through fixes on the cpu or will speculative execution always be insecure and thus need patching in software?

Try as I might, I am not a CPU person.

Completely dropping all forms of speculative execution means dropping overall performance to a tenth of today. There are really hard limits on how fast any operation, especially memory operations, can be done. The way we have made our CPUs faster is by making them do more operations in parallel, at all levels. At the lowest level, in straight line code, this very often requires speculation to achieve.

Speculation is not fundamentally incompatible with security. It's just that literally everyone in the industry never though that leaking information out of speculative context was possible -- and so there is no hardening anywhere. Now that it was proven possible, people are rushing to find all the ways this can be exploited. New CPUs currently being designed will fix all these, and then eventually we will have speculation without security issues.

Except for Spectre variant 1. That will always stay with us, because there is no sensible fix for it. The only real solution to that is to accept that branches cannot be used as a security boundary. This is mostly relevant to people implementing secure sandboxes and language runtimes. Going forward, the only reasonable assumption is that if you let a third party run their code in a process, no matter how you verify accesses or otherwise try to contain that code, you should assume it has a read access to the entire process. Any real security requires you to make use of the proper OS-provided isolation.

> Going forward, the only reasonable assumption is that if you let a third party run their code in a process, no matter how you verify accesses or otherwise try to contain that code, you should assume it has a read access to the entire process.

If this is true, it is unbelievably bad for the future of security and computing in general. People throwing around this assertion are, in my opinion, not appreciating how bad it is. We need to try a lot harder before we give up.

Here's why:

1. Finer-grained isolation makes security better, because it allows us to apply the Principle of Least Authority to each component of the system, and protect components of a system from bugs in other components. To make meaningful gains in security going forward, we need to encourage more fine-grained isolation. If process-level isolation is the finest grain we'll ever have, we can't make these advances.

2. The scalability of edge computing requires finer-grained isolation than process isolation. The trend is towards pushing more and more code to run directly on devices or edge networks rather than centralized servers. That means that the places where code runs need to be able to handle orders of magnitude more tenants than before -- because everyone wants their code to run in every location. If we can't achieve high multi-tenancy securely -- by which I mean 10,000 or more independently isolated pieces of code running on the same machine -- then the only solution will be to limit these resources to big, trusted players. Small companies will but shut out from "the edge". That's bad.

Luckily, the "process isolation is the only isolation" claim is wrong. It's true that we need to evolve our approach to make sub-process isolation secure, but it's not impossible at all. In fact, it's possible to design a sandbox where you can trivially prove that code running inside it cannot observe side channels.

Here's how that works: Thank about Haskell, or another purely-functional language. An attacker provides you with a pure function to execute. Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on, or what else is going on in the background. Therefore, the output cannot possibly incorporate observations from side channels, no matter what the code does internally.

So: It is possible to run attacker-provided code without exposing secrets from the process's memory space.

The question is, of course, how do we build a useful sandbox that relies on this property. There is a lot of work to be done there. Luckily, we don't really have to use a purely-functional language, we only need to use a deterministic language. It turns out that the world's most popular language, JavaScript, is actually highly deterministic, in large part due to its single-threaded nature.

We do need to remove access to timers, or find a way to make them not useful. We also need to prevent attackers from being able to time their code's execution remotely. Basically, we need to think carefully about the I/O surface while paying attention to timing. But sandboxing has always been about thinking carefully about the I/O surface, and we have a lot of control there. We just have a new aspect that we need to account for.

I think it's doable.

> Here's how that works: Thank about Haskell, or another purely-functional language. An attacker provides you with a pure function to execute. Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on, or what else is going on in the background. Therefore, the output cannot possibly incorporate observations from side channels, no matter what the code does internally.

This is patently false. The attacker can do timing on his side and see how long it takes your service to return a response. If you let an attacker have any sort of output channel, you give him the power to use his stuff to find side channels, and there is nothing you can do to close it.

> The attacker can do timing on his side

Yes, I mentioned that in my comment:

> We also need to prevent attackers from being able to time their code's execution remotely.

I don't think it's impossible to mitigate.

First, not all use cases actually involve the attacker being able to directly invoke and then measure the timing of their code. Imagine, for example, that I'm using attacker-provided code to apply some image manipulation to a photo on my phone, then I post the photo online. The attacker doesn't have any way to know how long their code took to execute, and I can be confident that they haven't exfiltrated side channels through the image via steganography because their code was deterministic so couldn't have incorporated those side channels into its output.

Second, I don't think the space of timer noise techniques has been adequately explored. Timer noise is one of the main defenses browsers have deployed against spectre, and pragmatically speaking it has been reasonable effective. Yes, in theory there are lots of statistical techniques an attacker might use to get around it, but it certainly increases the cost of attack. We need to figure out how to push the cost of an attack out-of-reach, even if we can't make it completely impossible.

Deterministic code cannot receive a side channel. It may, as you point out, be able to transmit on a side channel. It might even be able to "reflect" one side channel into another one. But these risks seem both less severe and more easily mitigated. At least, the history of attacks of the latter types is not very impressive.

If the attacker can do timing on their side in a pure function, then by definition the time the response takes is one of the function inputs.

Haskell functions are only pure in an abstract sense that ignores micro-architectural side-effects and the resulting timing changes. All of the recent speculative side-channel attacks are the result of the abstraction layer exposed by modern processors leaking. You can't fix this by putting more abstraction layers with neat theoretical properties on top.

You can't be guaranteed to prevent the attacker form writing information to those side channels but you can guarantee that the attacker can't read from those side channels because reading from them requires doing timing analysis and timing analysis requires IO - which we've stipulated hasn't been passed to the attacking function.

> 2. The scalability of edge computing requires finer-grained isolation than process isolation.

Why? Browsers are moving to one process per security domain anyway. Workers spawn their own threads too. You have to make your processes more lightweight and optimize memory-sharing, but that's exactly what's happening.

There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.

> Browsers are moving to one process per security domain anyway.

At a significant cost in RAM usage and some CPU overhead too. But browsers have the advantage that they are sitting on desktops and laptops that are massively underutilized, and the number of security domains you might typically have open at once is in the 10's or 100's, not 10,000's.

But that's not what I was talking about with "edge computing". I'm the architect of Cloudflare Workers, an edge compute platform. We have 151 locations today and are pushing towards thousands in the coming years, and every one of our customers wants their code to run in every location. As we push to more and more locations, the available resources in each location decrease, but the number of customers will only increase.

At our scale, unlike browsers, one-process-per-customer just isn't going to scale. Context switching is too expensive, RAM usage per process is too high, etc. So we need other ways to mitigate attacks.

> There's also some vodoo one might be able to do on linux with the clone syscall which lets you spawn a new process which still shares the same memory as the parent.

It's not really voodoo. Linux has no distinction between processes and threads. Everything is a process. But two processes can share the same memory space. Usually, developers call these "threads", and use "process" to mean "the set of threads sharing a memory space".

In any case, if it's the same memory space, then it's susceptible to Spectre attacks.

There are limits to Site Isolation. For example, due to document.domain, subdomains of a domain cannot generally be protected from one another.

kentonv is correct that giving up on intra-process security entirely is an ominous sign.

What we need is the equivalent of inter process isolation for intraprocess sandboxes, i.e. an hardware solution (hopefully this time not affected by Meltdown).

And we might already have it. Thanks to virtualization hardware, it should be possible for a process to own and handle its own private page table mapping completely from userspace [1].

Switching from the sandboxed code to the sandboxing code might be more expensive than a plain function call, but still way faster than IPC.

[1] For example, libDune (https://github.com/ramonza/dune), which admittedly seems a dead project. Also it requires OS support.

On JavaScript being a secure sub-process sandbox, as my manager told me months before Spectre, "Pwn2Own says otherwise." Much to my disappointment.

JavaScript is the most secure we have. The fact that it is subject to Pwn2Own is part of why. Other sandboxes have not received anywhere near the scrutiny, and surely have many more bugs that simply haven't been found.

Process, container, and VM isolation all have lots of bugs too.

Do you think it's likely, though, that a language-based sandbox is more secure than a process-based sandbox? Every major browser developer has now embraced process-based sandboxing in addition to the JavaScript sandbox they already had. Believe me, I want sub-process sandboxing to be viable. But now, post-Spectre, I think it's hopeless.

The combination of language-based sandboxing and process-based sandboxing is, of course, more secure than either on its own. More layers can only help. You could go a step further and run the process sandbox inside a VM. And run that VM on a dedicated machine. Each layer helps -- but each layer is more expensive, and at some point the cost is too high.

But process-based sandboxing does not replace language-based sandboxing. Operating systems (e.g. Linux) have bugs all the time that allow processes to escalate privileges. It's also not at all clear that process-based sandboxing is enough to defend against spectre, either. But it's much harder to design an attack than can break both layers, than to design one that breaks one or the other.

Certainly, if process-based sandboxing does not add too much overhead for your use case, then you absolutely should use it in addition to language-based sandboxing. But there are plenty of use cases where processes are too much overhead and we really need language-based sandboxing to make progress.

"If this is true, it is unbelievably bad for the future of security and computing in general. People throwing around this assertion are, in my opinion, not appreciating how bad it is. We need to try a lot harder before we give up."

It's true, though. It's been known since the early 1990's when they wrote about all the timing channels and such in VAX CPU's. The good news is there's both architectures and tooling that can do anything from eliminating to reducing these issues. You just have to design the processors for them. Otherwise, you're constantly playing a cat and mouse game trying to dodge the issues of running code needing separation on a machine designed for pervasive sharing.

One thing I came up with was just going back to physical separation with a lot of tiny, SBC-like computers with multicore chips. Kind of like what they use for energy-efficient clusters like BlueGene. One can do some separation at a physical level with better, software-level separation from there. The stuff that truly can't mix gets the physical separation. The rest software using things like separation kernels with time/space partitioning. At least one of the commercial vendors reported being immune to CPU weaknesses due to how separation and scheduling work. Whether true or not, the stronger methods of separation kernels make more senses now given they'll plug some links. The other method I came up was invented before with a patent. (sighs)

"Because it's a pure function, for some given set of inputs, it will always produce exactly the same output, no matter what machine you run it on"

That's not true btw. It gets converted into whatever the internal representation is running through circuits done on that process node. These create analog properties that might be manipulated by the attackers to bypass security. I warned people about that when I was new on HN like I did on Schneier's blog. I learned from a hardware guru who specialized in detecting or building that stuff mainly over counterfeiting not backdoors. We're seeing numerous attacks now that use software to create hardware effects at analog level. There's ways to mitigate stuff like that but I have no confidence they'll work with complex, highly-optimized hardware with a billion transistors worth of attack surface to consider.

So, as Brian Snow advocated in "We Need Assurance," you have to reverse the thinking to start with a machine designed to enforce separation from ground up in its operations. Then, OS/software architecture on top of that. Good news is CompSci has lots of stuff like that. Someone with money just has to put it together. More attacks will be found but most will be blocked. We can iterate the good stuff over time as problems are found addressing as many root causes as possible.

Except for Spectre variant 1. That will always stay with us, because there is no sensible fix for it.

Is it not a feasible fix for Spectre-v1 to have loaded cachelines wait in a staging area, and not actually update the cache hierarchy until the load instruction is retired?

Loading those cache lines might invalidate an exclusively held cache line on another core. When that core tries to write to that cache line, it can observe a slower write.

Is it theoretically possible? Maybe. Is it practical? Almost certainly not.

Keep in mind that it's not just the L1 cache you potentially need to worry about, but also all the higher level caches, which are really like remote nodes in a distributed system. Even the last level cache would have to be aware of the current speculation state of all threads in the system, which seems like it would be really expensive.

What's perhaps even worse, you wouldn't be able to hit on speculatively loaded cache lines until they have "committed" (because the resulting change in external bandwidth utilization may be observable from another thread), which would probably kill a lot of the benefits of the cache in the first place.

Yes, you can buffer up and kill any misspeculated updates. You have to do this for ALL shared structures, not just data caches. Annoying, but not impossible.

The scarier part is managing bandwidth contention to shared structures.

I know the Arm A-53 isn't cutting edge fast, but hardly one tenth. Seeing as the A-53 is unaffected by both Meltdown and Spectre, my assumption was that it didn't use speculative execution.

Basically every CPU with a pipeline -- even simple, in-order cores -- speculate by way of branch prediction. I think with the Cortex A53 the pipeline is short enough and the branch predictor simple enough that you can't build a useful spectre attack. It's also common in simpler in-order cores to speculate a little bit around memory accesses.

There are architectural choices for pipelined systems you can make that don't specex, but it requires compiler level optimization to take advantage of them, so it's a chicken and egg problem to get adoption

Been there, done that, didn't work. VLIW has been tried a dozen times and failed every time.

I think you're being too dismissive here. First of all, VLIW has been hugely successful in DSP cores. Your phone probably has a few in it. Second, Itanium wasn't great but it wasn't terrible either. And the Transmeta/NVidia lineage have also been not terrible. There isn't any reason to choose them over a conventional OoO processor absent security concerns but if they really do help with security concerns then these approaches bear careful investigation.

machine learning has been tried a dozen times and failed every time.

Other developments catch up, like compiler technology. Not directly related to branch prediction, but: We are much better at polyhedral optimization than the last time people tried VLIW, for example.

Besides, the last major VLIW push (itanium) DID have SpecEx AND Branch Prediction, and did NOT have delay slots.

Shrug. To paraphrase, your argument for "this can be done" is "THIS time will be different I swear!".

What will actually happen, of course, is that everyone will put a ASID/PCID as a tag word into all the relevant caches, they'll stop being probe-able from other contexts, and we'll all keep our deep pipelines and speculation and the cache crisis of 2018 will be just a story we tell our grandkids.

Much cheaper than a paradigm shift based on long-since tried and rejected technology.

you are fundamentally misunderstanding my argument: sometimes it is different. When it is different, it's usually because associated technologies has changed. For ML, it was GPUs and large amounts of data harvestable from the internet (arguably the second one more than the first). The evolutionary development in compilers is a strong argument that it MIGHT happen to end SpecEx. If it doesn't it's probably mostly due to technical debt and engineering inertia, difficulty convincing end users to adopt. There are definitely cases where non specex (vliw or otherwise) can outperform, I have seen it with my own eyes.

> If it doesn't it's probably mostly due to technical debt and engineering inertia

You jumped straight from "absence of proof must be proof of the opposite" into as bald a no true scotsmanism as I've seen recently. It's like fallacy central here.

An argument of the form "many X failed, but a new X might not" is not a form of no true scotsman. It would only be no true scotsman if the claim was "X always wins, and Itanium wasn't X"

But anyway I think they're making a fair argument. If something takes an enormous technical effort to switch to, and is similar in performance, nobody is going to put in that effort.

This is emphatically not evidence that the idea "doesn't work".

I believe if you could retroactively wipe out the last decade of improvement on x86 and Arm, and redirect all of that development effort to VLIW, the resulting chips and software would perform just fine by current standards.

Thought it was effected by spectre - meltdown was the exploit that only effected a subset of the market. Meltdown is specifically due to load speculation happening prior to verifying access permissions.

Spectre is more generally due to load speculation modifying the cache while inside a branch speculation context. Basically it effected every high performance cpu in some way. Note that saying there’s a protected/safe branch instruction is only effective if the default branch code gen uses it, so I’m not including such mitigation’s when saying “not effected”. I believe that to be a justifiable decision.

Essentially the only cpus that weren’t effected were those with very little, if any, out of order execution. None of them are considered to have competitive performance.

In this context, getting "affect" and "effect" straight is important.

details :)

All CPUs speculate to some extent (branch prediction has been on everything for basically 30 years now). Intel CPUs got hit first because they're the first ones to speculate so deeply that you can discriminate cache behavior like this. The caches and pipelines just aren't big enough elsewhere, but the techniques are all the same.

Nope, Intel got hit because they did it wrong, and the others did it right. If you speculate you need two status registers. One for the good and one for the bad case. Same for meltdown: you need one for the kernel and one for user space. Intel didn't care, others do.

But Intel got the government ("legal") backdoors in. Priorities.

What are you talking about?

Meltdown was caused by Intel not checking privileges correctly in a speculated context, choosing to defer such checks as long as possible. This allowed the worst and easiest to use exploit, making it possible to read from any mapped location in memory. Most other CPUs did things right even in speculated contexts.

(Intel wasn't quite alone in this, though -- at least one ARM core also had this issue.)

It's affected by spectre, but not meltdown.

I wonder why do we have to rely on CPU to parallelize code. Surely compiler could do better job and CPU should just offer transistors without any smart logic. It's not backwards-compatible and I'm aware about Itanium fiasco, but I'm not compelled that it's a wrong way.

As someone who works in compilers: no, it's not possible to make the compiler do a better job. There's a reason why VLIW architectures keep getting proposed and keep dying.

The instruction-level parallelism that a CPU can extract is primarily a dynamic kind of parallelism. You can, say, have a branch that's true 1000 times, then false 1000 times, then true 1000 times, then false 1000 times, etc.--as a compiler, telling the hardware to predict true or false is going to guarantee a 50% hit rate, but the stupid simple dynamic branch predictor will get 99% on that branch.

Well, there are middle grounds, such as EDGE architectures, that get you somewhere in the middle of a VLIW and CISC machine, where everything is not so rigid. The trick isn't that the compiler must do better in scheduling things perfectly against unknown, dynamic information -- but must emit the information it already knows, and they currently throw away. This is how TRIPS worked, where the compiler placed instructions statically like a VLIW machine (it was a grid-like architecture, so this is obviously important), but the CPU issued instructions dynamically at the basic block level, like a OoO CISC machine. The secret is you encode the dataflow dependencies inside basic blocks, more or less -- so the CPU does not have to rediscover them.

I worked on compilers too, and I think there is definitely still work to be explored here. Really, complicated OoO speculative processors are simply recovering a lot of the information the compiler already calculated! You take this code that is in no way suited for a CPU, and you do a ton of dataflow and high-level analysis on it (that you can only know from the source). You build this graph and optimized based on these facts. And then, you throw away the dataflow graph when you lower things. All this, after painfully calculating it on the assumed basis of "Yes, the CPU will like this code" -- only for the CPU to perform that whole process over again, say "Yes, I do, in fact, like this code" -- so it can execute efficiently without stalls anyway. There's clearly a mismatch here.

I mean, don't get me wrong -- this all seems like a perfectly fine and acceptable engineering tradeoff, but just disappointing from a computer science perspective, to me, at least, that this is not unified :) Of course, VLIW vs CISC etc is one of the classic debates...

Interestingly, Aaron Smith from Microsoft Research (and one of the original members of the TRIPS project) actually demonstrated and talked about their work on newer EDGE processor designs at MSR, and even demonstrated Windows running on an EDGE processor, this past week or so at ISCA2018 (complete with a working Visual Studio target, using an LLVM-based toolchain!) Their design is quite different from TRIPS it seems. I'm hoping the talks/work becomes public eventually, but that might just be a dream.

I should clarify that the idea that is wrong is that just exposing the hardware bits to the compiler and expecting the compiler to do better. There is certainly more scope for the compiler and the hardware to work together to make the results better, but exposing current hardware mechanisms is not the means to do to that. Of course, try getting architecture and compiler people actually talking to each other. :-)

What are your thoughts on the Mill CPU team's claims re: their ISA allowing compilers feasibly to schedule operations statically?

you can do this to some degree on most CPU's - moving loads away from their results being used - compilers, esp on RISC machines with lots of registers, do this today

VLIW machines allow you to provide hints about instruction level parallelism without all the superscalar on-the-fly analysis of instruction level data dependencies (so the hardware can do all that rescheduling on the fly).

I think that if interlock-free software scheduled CPUs allowed us to reach 20GHz clock speeds where complex superscalar machines were stuck at 3GHz we'd all be jumping ship - but they're not

Essentially the CPU is running a JIT branch optimization and speculative loads which cannot ever be matched by static optimization. The interesting part is that any JIT compiler that is so dynamic in performance will have these problems.

The solution might be to actually prevent some kind of security important code from being optimized in this way. Say, forcing full cache sync in-order execution for parts of code with no resource sharing between cores.

hmm ... I wonder if there are JiT compilers with smallish caches that are susceptible to these attacks (but over much larger time scales)

> I think that if interlock-free software scheduled CPUs allowed us to reach 20GHz clock speeds where complex superscalar machines were stuck at 3GHz we'd all be jumping ship - but they're not

Additionally, realize that clock speeds are not a relevant performance metric. Maybe, perhaps, it's possible for a software scheduled CPU to run at 2*n GHz, but that's not interesting, if it's slower in real-world workloads than todays tech at n GHz. And for breadth-deployment in data centers we're mostly looking at throughput per Watt. CPUs with high clock speeds don't do well in that area because semiconductor physics.

right, that's why we don't have really high clock speeds - that's not the point here

They execute everything in order on a fairly short pipeline but they engage in speculative execution just like everyone else. Here's their talk on their branch predictor.


There's a lot less mischief you can get up to when you only execute a couple of instructions beyond a mispredicted branch rather than a hundred but mischief can't be theoretically ruled out entirely.

VLIWs can make use of the same branch predictors that you see in out of order cores like an Intel Core or in order cores like an ARM A53. The big problem that VLIW faces is starting loads early so that they've finished by the time you need their contents. Itanium tried to do that but their mechanism really broke down in the face of possible memory exceptions and it didn't speed things up much. Mill has a cleverer way to speculate with loads which might work better but it's still entirely unable to speculatively load values across boundaries between codebases, like system calls. But of course speculatively loading across system calls is what's causing problems here. So it might be that the best thing is to accept the 80% performance solution in the same of security through keeping the hardware simple enough to understand.

A quote I once heard from a friend: "Don't turn your normal problem into a distributed systems problem."

Arguably, the reason for Spectre, etc., is that people have failed to realize that at the scale of modern intel CPUs, these problems already are distributed systems problems.

The compiler cannot compete with the CPU because the CPU has strictly more information. Any memory access can take between 4 and thousands of cycles of time to execute, and the compiler won't know how long any of them take. The CPU can reorder based on exact knowledge.

Using profile guided optimization can help a lot with this in practice, though it isn't perfect.

> everyone in the industry never though that leaking information out of speculative context was possible

Everyone overlooked it, or everyone had a good reason to think it was impossible? The information is there; it's a target that needs to be secured.

Thank you for your response!

I think Spectre is even less severe then Meltdown though.

We can't really get rid of speculative execution, but we will effectively need to get rid of the idea that you can run untrusted code in the same process as data you want to keep secure.

It's interesting that the future predicted by excellent (and entertaining) talk The Birth & Death of Javascript [1] will now never come to pass.

[1] https://www.destroyallsoftware.com/talks/the-birth-and-death...

> We can't really get rid of speculative execution

Why not?

> but we will effectively need to get rid of the idea that you can run untrusted code in the same process as data you want to keep secure.

If we accept that, then we also need to get rid of the idea that we can branch differently in trusted code based on details of "untrusted data".

Your last point is basically what Spectre v1 mitigations are all about, at least if you throw in the word "speculation" somehow. The rule is: don't speculate past branches that depend on untrusted data (though there are certain additional considerations about what the speculated code would actually do).

It's just that there are a lot of branches that don't depend on untrusted data. Speculatively executing past them is perfectly fine and extremely valuable for performance. That's why nobody wants to get rid of speculative execution.

It sounds like what we really need is a memory model that reflects this notion of trusted and untrusted data. The "evil bit", basically, but for real.

Speculative execution isn't an inherent security risk, but it typically entails a cache fill, which can, if not cleaned up or not isolated, can be used for a timing attack. That's the real issue here: that execution results in mutating a shared datastore, and facts about the likely contents of that datastore can be inferred from performing additional operations and self-measuring aspects of one's own performance.

In the larger sense, mixing various privilege levels on shared hardware is likely a bad idea, despite being fundamental to general-purpose computing. This is because it's fairly difficult to cloak intrinsic "physical" attributes of execution, like execution timing, cache timing, from other processes, and essentially impossible (and/or unreasonable) to cloak it from one's own process. It is both possible for a process to generate lots of side-effects (e.g. IO, cache fill), and for it or another process to try to figure out ways the shared state changes by observing its own execution.

Personally I wonder if it will ever be possible to have a platform totally free of side-channel attacks, whether or not it uses speculative execution.

No, probably not, but as with all security, electronic and otherwise, the goal is to make breaking the security cost more than whatever it's protecting is worth (for some arbitrary definitions of "cost" and "worth").

From what I understand, speculative execution is “worth it” in most non secure contexts. Furthermore, it seems like one could hypothetically implement speculative execution “correctly”, where speculation is still gated by the constraints of non speculated executions. Could this be a problem that proof software could solve? I still have hope.

maybe only on things that have huge targets on their backs?

Just what I was looking for. Guess we'll get a new security announce.


... for debian 8 only, though. Looks like current stable dodged that bullet quite a while ago.

What is the impact of this exactly? If "FP State" just means floating point register values then those rarely contain interesting stuff.

If this also affects register used for crypto acceleration then it could be used for stuff like leaking browser secrets from javascript, right?

"FP State" in this case includes MMX and SSE registers; they're handled via the same "lazy context switching" mechanism. Yes, you can steal keys which are being used for AESNI.

Leaking secrets via javascript -- I'll be impressed if someone pulls that off. There's a very tight timing window to exploit this before a trap fires and flushes your pipeline, and I doubt you can get the right instructions in there using javascript.

Can't you executive the trapping access in a not-taken speculatively-executed branch à la Meltdown?

Sure, but that gives you an even shorter window because the pipeline flushes as soon as the CPU realizes that it mis-speculated.

It is my understanding that you can make that window very long by having the mis-speculated branch depend on a value that has to be loaded from main memory.

Hmm, good point. Yes, that would probably make this much easier to exploit.

Unless the CPU does a partial pipeline flush upon realizing that there would be a trap. I don't know -- do Intel CPUs check for exceptions in order?

Well, they don't for #PF at least, because that's how Meltdown works - I'm not sure if it works differently for #NM though.

Rarely contain interesting stuff, like parts of AES keys.

Do we have PoC code? Has anyone tried attacking FP/SIMD state on other ISAs like Power or AArch64?

I have exploit code -- took me about 5 hours to write after Theo announced all the important details of the vulnerability. I'm not going to publish it yet, though.

AFAIK other systems aren't affected -- is lazy context switching even a thing on them? The fundamental issue here is that one process' data is still in registers when another process is running and we've been relying on getting a trap to tell us when we need to restore the correct FP state.

Lazy context switching is a thing on pretty much every architecture with an FPU.

Hmm, I thought most RISCy CPUs kept FP values in GP registers?

No, I can't think of an arch that does that. Power, SH, Mips, Sparc, Alpha, ARM, and RISC-V all have separate architectural register files for the floating point state.

Some ARM ABIs end up passing floats in integer registers, but that's just for compatibility for code that doesn't assume the presence of an FPU and might be doing everything soft float.

Hmm, ok. It's a long time since I've looked at that. Come to think of it, I think it might be 20 years since I opened CA:QA...

Yeah I’d say that modern OoO Arm implementations (A57, A72, ...) are worth trying to speculate into trapped VFP state. Lazy FPU is definitly a thing everywhere.

My hunch says that chips affected by 4a could easily be fair game (4a is speculating reads into priviledged regs... I wonder if 4a would work on regs that are trapped, not inconceivable)

Is it fixable by microcode update?

Very unlikely. The key issue here is that traps aren't handled until a long way down the pipeline.

But it's a (relatively) simple OS fix.

We'll let this thread take the frontpage spot now that the announcement is official. Previous one: https://news.ycombinator.com/item?id=17304233.

What are we calling this Spectre variant?

Either "LazyFP" or "Lazy FPU".

Not numerating it? Variant 6 is as good as anything...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact