Hacker News new | past | comments | ask | show | jobs | submit login

The stream of critical CPU vulnerabilities starting with Spectre/Meltdown last year are related to speculative execution, not just Intel. (AMD and ARM CPUs are also vulnerable to Spectre, for example.) Intel CPUs are sometimes vulnerable to additional attacks because they speculate in more scenarios than other designs. But fundamentally, as long as multiple different trust domains are sharing one CPU that speculates at all, or has any microarchitectural state (e.g., caches), there are likely to be some side-channel attacks that are possible.

The important thing to realize is that speculation and caching and such were invented for performance reasons, and without them, modern computers would be 10x-100x slower. There's a fundamental tradeoff where the CPU could wait for all TLB/permissions checks (increased load latency!), deterministically return data with the same latency for all loads (no caching!), never execute past a branch (no branch prediction!), etc., but it historically has done all these things because the realistic possibility of side-channel attacks never occurred to most microarchitects. Everyone considered designs correct because the architectural result obeyed the restrictions (the final architectural state contained no trace of the bad speculation). Spectre/Meltdown, which leak the speculative information via the cache side-channel, completely blindsided the architecture community; it wasn't just one incompetent company.

The safest bet now for the best security is probably to stick to in-order CPUs (e.g., older ARM SoCs) -- then there's still a side-channel via cache interference, but this is less bad than all the intra-core side channels.

That's not exactly true. Broadly speaking, there have been two very different kinds of speculative execution vulnerabilities with different security implications and workarounds. Spectre and its relatives are an attack on trusted code that process untrusted data using certain code patterns guarded by conditionals that can be speculatively executed; they're inherent to speculative execution past branches, but they require specific code patterns that can be avoided to work around the issue.

These vulnerabilites and Meltdown allow untrusted code to speculatively access data that it shouldn't be allowed to access at all, and use that speculative access to leak data itself. Unlike Spectre this can be (and to some extent has to be) fixed at the hardware level, because the hardware itself is failing to protect sensitive data. This class of vulnerability seems to have been mostly Intel-exclusive so far (with the main exception being one unreleased ARM chip that was vulnerable to Meltdown). There's nothing inherent about modern high-performance CPUs that requires them to be designed this way.

Edit: This slipped my mind, but Foreshadow / Level 1 Terminal Fault was yet another similar Intel-only processor vulnerability that allowed speculative access to data the current process should not be able to access. It's definitely a pattern.

Yes, that's true, several of the vulnerabilities involve checks that are performed late (not at time of speculative access, but at some point before instruction commit). Not excusing the design choice at all, but it's conceivable that an engineer could make this choice if (i) side-channel effects of the speculation are not considered at all, and (ii) the postponement of the check allows the load latency to be reduced. Again, not justifying, and the vulnerabilities are terrible, but there does seem to be a rational-given-some-assumptions way to reach such a decision.

I’m not a CPU architect, but it seems like Intel saved a couple gates by putting garbage instead of zeroes in the pipeline.

After reading more of the (limited, publicly known) details, it looks like the data leaked isn’t, strictly speaking, total garbage. But I do wonder whether Intel got a meaningful latency improvement by putting potentially wrong data into the pipeline instead of using zeroes or stalling. Zeroes or a stall would require knowing that the data is invalid before continuing with execution, which could be a performance issue.

Your not, but the gentlemen your replying to is. Bet you can’t guess where and what he worked on?

Not sure how to read this, but if you meant it as a personal attack that's totally not ok here.


Yup, I worked for a bit at Intel, but I don't speak for them, I wasn't involved in any of the designs under discussion, and everything I'm saying here is public knowledge in the computer architecture community. I figured that the perspective from the academic comparch world might be interesting.

> There's nothing inherent about modern high-performance CPUs that requires them to be designed this way.

Assuming by "designed this way" you mean: to speculatively execute past security checks, I'd disagree.

I'd say the relevant performance measure for CPUs (as opposed to other kinds of processor) is the speed at which they can execute serial operations. As electronic performance improvements offer increasingly marginal gains, we need to resort to improved parallelism. When operations are needfully serial due to dependency, as are security checks, the only way to accelerate that beyond the limits of the electronics is to make assumptions (speculations).

It's not inherently wrong to do this, but requires speculations never have effects outside of their assumptive execution context.

I think there's quite a big difference between leaking information between security contexts (Meltdown) and leaking within a security context (Spectre). The later is a problem but it's not a the same magnitude of failure as the former is.

Also, no need for older SoCs. High end ARM chips have both in-order and out-of-order cores but the cheaper ones have in-order ones only. A Snapdragon 450 is pretty modern and doesn't speculate deeply enough to be vulnerable to Spectre.

> Spectre/Meltdown, which leak the speculative information via the cache side-channel, completely blindsided the architecture community; it wasn't just one incompetent company.

In the x86 space, Meltdown absolutely was down to one company apparently deciding to over-optimize for performance.

I can't find it now, but I remember reading a thread from (I think) the OpenBSD devs about how the Intel MMU documentation described fairly sane behaviours and how far the reality deviated from the documentation.

> In the x86 space, Meltdown absolutely was down to one company

Serious weasel wording, outside of x86 space every other high end architecture also had Meltdown issues, ARM, and IBM's POWER and mainframe designs.

Erm, no? Meltdown was intel only. Spectre affects absolutely every architecture with speculative execution, but Meltdown (which allows crossing process and security boundaries) are absolutely unique to Intel.

Erm, check out Wikipedia https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili... and follow the links? Meltdown is CVE-2017-5754, see: https://developer.arm.com/support/arm-security-updates/specu... where as I recall ARM initially described the Cortex-A75 as having a "variant", but now just lumps it in with the CVE. And the IBM info is also there, POWER7+ through POWER9, and per Red Hat, mainframe/System Z.

Everyone who does speculative execution had Spectre issues, but Meltdown-style vulnerabilities have been mostly Intel-exclusive. These new ones are too.

Maybe because Intel has shipped a thousand more SKUs and millions more CPUs with Meltdown than ARM, for which the Cortex-A75 was a new design, and IBM, which doesn't ship huge numbers of either POWER or mainframe CPUs??

Why would that make a difference? We're not talking about manufacturing defects, every single unit they sell has the problem, doesn't matter if they sell 10 or 10 million.

It makes a difference because Intel is a more attractive and more consequential target for researchers. AMD's market share in servers is minuscule and even declining a bit as of 19Q1 (?), and modest but increasing nicely in notebooks and desktops https://news.ycombinator.com/item?id=19916279, while IBM's POWER and mainframe systems are expensive to very expensive to access.

ARM is actually a good target with a number of their newest designs using out-of-order speculative execution with Spectre vulnerabilities and their owning mobile space outside of notebooks, one of the newest even being vulnerable to Meltdown, but the significant headline worthy instances tend to come in much more locked down devices. Speed is also an issue, everything else being equal, the faster the chip, the faster data exfiltration proofs of concept will work.

Yes, the vulnerabilities are not just Intel, but they're mostly limited to Intel CPUs. Why is AMD less prone to these mistakes? Perhaps there are simply fewer researchers looking into AMD processors?

This happens only because Intel went much further than AMD and other companies in exploring these effects. If other companies used speculative execution as much as Intel, the result would be the same. It is not a flaw of implementation, it is a flaw of basic design.

They have different cache architectures; Intel uses inclusive (i.e. all levels contain keys from previous levels), AMD uses exclusive cache (each level contains unrelated entries to any other level). This might have different effects on classes of vulnerabilities they are prone to.

I know that some AMD CPUs a long time ago had exclusive caches, but for Ryzen, I'm pretty sure that both L1 and L2 are inclusive, and L3 a victim cache.

the inclusivity types don't really play a role in these types of attack (until you get to a very practical stage where this might matter), not least because there are other sidechannels that can be exploited. an inclusive llc is just convenient.

that said, some newer Intel LLCs are non-inclusive, and amd changed its cache relations as well in ryzen

The focus is on Intel partially because it's the most valuable target, so we might be looking at selection bias.

That's really not the case. Once you know of an attack of this sort it's pretty easy to test everyone's chips for it. Bascially every Intel chip since Nehalem is vulnerable to Meltdown, as are IBM's POWER chips, as is one out-of-order ARM core but not any of the others. And we know that AMD chips are safe from that vulnerability.

Whether you run security checks in sequence or in parallel with a data access is pretty fundamental to the design of a core. Doing the later is has performance advantages but it's really hard to verify that won't be any architectural leaks as a result, let alone these micro-architectural leaks that nobody was thinking about.

I agree with everything you say, but security researchers will still prefer searching for new vulnerabilities in Intel (which is hard, I hope we agree on that) and verify they don't work on AMD, rather than the other way around.

We were talking about alternative CPU designs and a guy commented on how much of the volume of a CPU is memory these days.

I wondered aloud if it wouldn’t be better for use to embrace NUMA, make the bigger caches directly addressable as working memory instead of using them as cache.

I’ve heard somewhere that some or all versions of intel atom are immune to both spectre and meltdown attacks. The bonnell architecture being just a supercharged 486 has none of those fancy features that are being exploited. Not sure about foreshadow since HT is present on Atom processors.

No out-of-order speculative execution, no bugs of this exact sort, expect maybe you'll find something in the odd corners of special features. As far as I know (don't follow them much) Atoms are based on the Pentium super-scalar architecture, which is a "supercharged 486" that allows the CPU to do more than one operation in parallel on different execution engines.

Per Wikpedia, "The Pentium has two datapaths (pipelines) that allow it to complete two instructions per clock cycle in many cases. The main pipe (U) can handle any instruction, while the other (V) can handle the most common simple instructions." (https://en.wikipedia.org/wiki/P5_(microarchitecture) )

HT uses two instruction decoders to keep more busy a set of execution engines, suppose you had the above set of 2 execution datapaths (that may be improved on these Atoms), and couldn't do two operations at a time from one instruction stream, but the other could use the unused datapath.

> The safest bet now for the best security is probably to stick to in-order CPUs

out-of-order execution != speculative execution.

It is possible to have OoO without speculative execution. On the other hand they do tend to come as a pair since they both utilise multiple execution units, for instance the Intel Pentium in 1993 was the first x86 to have OoO or branch prediction (486 and those before were scalar CPUs).

> It is possible to have OoO without speculative execution

That's technically correct (the best type of correct!), but without speculation, the CPU can't see beyond a branch, so the window in which the machine can reorder instructions is very small. (For x86 code, the usual rule of thumb is that branches occur once per 5 instructions.) So in practical terms, an OoO design (in the restricted-dataflow paradigm, i.e., all modern general-purpose OoO cores) always at least does branch prediction, to keep the machine adequately full of ops.

An interesting question, though, given that we're going to do speculation, is what types of speculation the core can do. The practical minimum is just branch prediction, but modern cores do all sorts of other speculation in memory disambiguation, instruction scheduling, and the like, some of which has enabled real side-channel attacks (e.g., Meltdown).

Also, FWIW, Pentium was 2-way superscalar, but not OoO in the fully general dataflow-driven sense; the Pentium Pro (aka P6), in 1995, was the first OoO. (Superscalar can technically be seen as OoO, I suppose, in that the machine checks dependencies and can execute independent instructions simultaneously, but it's not usually considered as such, as the "reordering" window is just 2 instructions.)

OoO without speculation is completely pointless: the tipical reorder window is orders of magnitude larger than the average number of instructions between conditional jumps. I don't think there have ever been an OoO CPU without speculation.

OTOH speculation without OoO Is not only possible but in fact very common. For example the majority of non ancient in-order CPUs.

In fact the original pentium, contrary to your statement, was an in-order design (the pentium pro was the first Intel OoO design). Also I believe that 486 already had branch prediction. Pentium claim to fame was being superscalar (i.e it could execute, in some cases two instructions per cycle), a better FPU and better pipelined ALU (I think it had a fully pipelined multiply for example).

> I don't think there have ever been an OoO CPU without speculation.

To the extent I've looked at it, without reading original documents, the original OoO design that current systems are based on, the IBM System 360/Models 91 and 95's floating point unit using Tomasulo's algorithm https://en.wikipedia.org/wiki/Tomasulo_algorithm didn't extend to speculative execution.

No doubt because gates were dear, implemented with discrete transistors, and that processor was a vanity project of Tom Watson Jr's. And memory was slow, core except for NASA's two 95's with 2 MiB of thin film memory at the bottom, and cache was still a couple of years out, introduced with the 360/Model 85. OoO becomes compelling when you have a fast and unpredictable memory hierarchy, as we started to have in a big way in the 1990s when we returned to this technique.

Interesting. It seems that the machine did have some limited form of branch prediction, but probably the expectation was that FP kernels would be optimized to be mostly branch free, and, as you say, transistors were a premium.

> Also I believe that 486 already had branch prediction

I can't find any evidence to support that [1]

I suppose it's technically possible to have branch prediction on a scalar processor, but I imagine it would not be hugely beneficial.


So it seems that the 486 had a trivial not-taken predictor; but that's still different from stalling on each conditional branch and does require checkpointing and rollback on misprediction (although with a pipeline only 5 deep that's probably also not very complex).

Edit: pentium did have a significantly more sophisticated predictor of course, although not without flaws.

Can we say that, without speculation and caching, and just throwing more and more cpu to work, we would have slower single app performance but better parallel execution ?

Sure -- for a good example, GPUs go partway there by not speculating (they still have cache hierarchies though). It works because GPU workloads have massive data parallelism, so while one group of threads (a "warp") is stalled waiting for data, the cores can just execute other threads. Sun/Oracle had built a number of Sparc chips along this line too, e.g. the Niagara (Sun UltraSPARC T1) tolerates memory latency by having a bunch of SMT threads (8 per core, IIRC?) rather than OoO scheduling.

The problem is that single-thread performance is really important for a lot of workloads, because (i) parallelization is hard, (ii) even for parallelized workloads, serial bottlenecks (critical sections, etc.) still exist, and (iii) latency is often important too (one web request on one core of a server, or compiling one straggler extra-large file in a parallel build, for example).

Really, different trust domain cannot seriously benefit from a common cache: their datasets are by definition disjoint.

So it would be safest to execute them on separate CPUs not sharing a common cache, e.g. pinning them to different CPU sockets on a multi-socket machine, or to different physical machines altogether.

This may be still faster than running on old ARMs.

I wonder if dedicated cloud boxes, where all VMs you spin on a particular physical machine belong only to your account, will become available from major cloud providers any time soon. In such a setup, you don't need all the ZombieLoad / Meltdown / Spectre mitigations if you trust (have written) all the code you're running, so you can run faster.

Scaleway is one such provider. They call it not virtualization but physicalisation.

I should add that AWS has "dedicated instances" which are exclusive to one customer. They are more expensive than standard instances but they are used by e.g. the better financial services companies for handling customer data.

> different trust domain cannot seriously benefit from a common cache: their datasets are by definition disjoint.

Datasets, yes, but what about instructions from shared libraries?

I'll make a wild guess this being seriously beneficial would be limited to uncommon server configurations....

Let's dissect.

- Same OS, server: you likely control all the processes in it anyway, no untrusted code.

- Different OSes in different VMs, server: they likely run different versions of Linux kernel, libc, and shared libraries anyway. They don't share the page caches for code they load from disk.

- Browser with multiple tabs, consumer device: likely they might share common JS library code on the browser cache level. The untrusted code must not be native anyway, and won't load native shared libraries from disk.

- Running untrusted native code on a consumer device: well, all bets are off if you run it under your own account; loading another copy of msvcrt.dll code is inconsequential compared to the threat of it being a trojan or a virus. If you fire up a VM, see above.

There's also Same OS, operating system level virtualization, like Docker, it's the less expensive default for https://www.ramnode.com/vps.php for example. Not at all familiar with this technology, but scanning Wikipedia if you used Docker would the libcontainer library be shared between container instances?

Likely it would be. But really Docker is more about convenience of deployment and (much) less about security. I would not run seriously untrusted code in merely a (Docker) container; I don't know much about the isolation guarantees of OpenVZ.

In any case, containers share OS kernel, OS page cache, etc. This can be beneficial even for a shared hosting as a way to offer a wide range of preinstalled software as ro-mounted into the container's file tree. Likely code pages of software started this way would also be shared.

Reading about speculative execution exploits makes me wonder whether it would be useful to have a speculation register, and an instruction prefix that claims one bit of that register, much as a mutex, while the prefixed instruction is being speculated about. Assuming a compiler intelligent enough to key a bounds check and later array access to the same speculation bit, or a language exposing low-level functionality to the programmer, Spectre could be defeated with minimal performance impact. Perhaps it could also be used to work around many other sorts of speculative execution flaws that might turn up in the future.

In-order does not mean no speculative execution

Yes, but it typically means a significantly smaller speculative window -- rather than a ROB that can fill with several hundred instructions beyond a mispredicted branch, there is just the pipeline depth from fetch to commit worth of speculative work.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact