Hacker News new | past | comments | ask | show | jobs | submit login
Intel vulnerabilities costing 25% CPU performance loss to a cloud provider (twitter.com/waxzce)
332 points by samber on May 17, 2019 | hide | past | favorite | 178 comments

Does anyone actually have some benchmarks of the latest gen AMD vs the latest gen intel processors with all mitigations for spectre, meltdown, and the 10 other sidechannel/speculative execution vulnerabilities applied?

I'd genuinely be curious to find out what the eventual results are because as i understand it AMD is not too far away from Intel as a standalone processor, surely in a "real world" scenario they'd be significantly faster?

For Linux 5.0 and as of March this year, the performance impact of enabling Linux kernel mitigations for Spectre/Meltdown against various CPUs with the latest microcode (as of March) are:

Intel: -13% for the Core i9 7980XE, -17% for the 8086K.

AMD: -3% for the 2700X.

Reference: https://www.phoronix.com/scan.php?page=article&item=linux50-...

Phoronix is due to release new benchmarks tomorrow showing full impact from Spectre/Meltdown/L1TF/MDS. There are some initial benchmarks at https://www.phoronix.com/scan.php?page=news_item&px=MDS-Zomb...

Surely the effect depends on what you're running, and you can't put a single number on it. There are actually somewhat contradictory results I've seen for HPC-type applications, and no useful analysis of them with low level profiling.

For HPC-type applications you're probably not running code from multiple trust domains on one CPU, in which case you might not need the mitigations at all

What about for the newer processors like the 9980XE, 9900K, etc.? I would have assumed that Intel's latest processors have some additional engineering in place to mitigate the spectre/meltdown performance impacts.

> What about for the newer processors like the 9980XE

The 9980XE is Skylake. It's not actually a "new" processor at all. Consumer parts were released in 2015 (Core ix-6xxx) & server parts in 2015 as well (Xeon E3-v5).

In fact the 9980XE itself isn't even a new offering in the HEDT space for Intel, as it's basically a rebrand of the 7980XE. The differences are just a soldered heat-spreader instead of paste & a small clock bump to go along with that. It's +200mhz turbo & +400mhz base, complete with a power consumption increase to match.

EDIT: The 9900K (Coffee Lake) does have in-silicon mitigations for Meltdown & L1TF (Foreshadow), though: https://www.anandtech.com/show/13450/intels-new-core-and-xeo...

It is a moving target. A lot of "day 1" patches have absolutely tanked performance only to regain performance later via smarter mitigations. Both Linux and Windows as well as microcode updates have all seen some of the performance loss regained.

And while it would be nice to call any point "the end" and measure then, as of two weeks ago they were still finding additional vulnerabilities and patching them. So if there is an end we aren't there yet.

That all being said, I am also curious on what the results would show. Just very hard to pin down.

To be fair, at least Microsoft explicitly went with finding perf optimizations elsewhere to reduce the impact, not smarter mitigations (though they also eventually did that when they decided to go for retpoline).

On multithreaded stuff AMD is killing it, they where slightly behind on IPC about 5% but the clocks are typically a bit lower.

I'd expect with both machines patched that clock for clock AMD are slightly ahead.

I've got a 2700X at home and it's a monster and I'd Zen2 is on the conservative end of the leaks/rumours it's going to be a total killer.

Seems like Threadripper 3 might be delayed until 2020, so no 64c HEDT for a while...

On realistic workload (branchy C++ server code, not SPEC and the other crap that Phoronuts like to report) the AMD CPUs are 50% slower ... Intel could pile on the mitigations and still be way ahead. The reason these mitigations cost Intel so much is because they have these speculation features and AMD just didn't ever have them. At least with the Intel parts you can decide for yourself whether to enable or disable them.

Please provide a single benchmark of any kind to support that. It sure sounds like you're stuck in 2016, though, before Zen happened with its +52% IPC improvement over Excavator. Otherwise basically all clock for clock battles shows Zen in single-digit percentage range of Intel on IPC. Combined with Intel having a clock speed advantage that does give Intel +10-20% in pure single-thread performance, but that was pre-mitigations. If this vulnerability fix really does cost 25% that'd erase that advantage entirely.

And no, Intel is not unique in speculation capability.

Xeon absolutely crushes EPYC on MySQL Sysbench. The top-of-the-line EPYC provides performance similar to the five-year-old Xeon E5 v3 (Haswell) from Intel, except at high thread counts where that obsolete Intel part also crushes the EPYC. These results are on Percona's own blog and are also backed up by Anandtech's reviews where Skylake Xeons are ~twice as fast as AMD's.

For large codes there is just no getting around the fact that the AMD parts have a weird and highly fragmented cache architecture, very slow memory access, NUMA memory latency even on the same socket, and a frankly broken branch target buffer. Xeon has none of these problems.

You mean this? https://www.percona.com/blog/2018/07/11/amd-epyc-performance...

If by crushing it you mean performed nearly identically, and if by 5 years old you mean the very latest E5 chip Intel offers[1], then yes.

Also, they were using EPYC through a cloud VM provider, which means not only is there the typical overhead of Xen, but also possibly some aggressive Spectre mitigations. Their Haswell machine was likely bare metal.

[1] Because E5 and E7 chips only migrate microarchitecture every few generations. But since Intels fab headaches they're actually stuck nearly 4 generations behind, rather than the typical 2-3. Only desktop and E3 series chips use the latest microarchitecture.

EDIT: I forgot they changed their naming scheme. The latest 56-thread capable Xeon Platinum chips available last year used Skylake. I was confirming my understanding with https://en.wikipedia.org/wiki/Xeon but didn't see this page: https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microproces...

The E5 v3 is not the latest even under the old naming scheme. It is a chip from 2014. The E5 v4 "Broadwell" came out in 2016. Skylake SP came out in 2017 and Cascade Lake SP is the current SKU. The EPYC performance on that MySQL benchmark is 3 generations behind Intel.

> The EPYC performance on that MySQL benchmark is 3 generations behind Intel.

Intel's performance hasn't changed for 3 generations, either. They've been stagnant for years now with their 10nm roadmap being super late at this point, delaying their architecture updates at the same time.

Definitely wrong. Skylake SP is 50% faster than Broadwell on MySQL Sysbench, and Broadwell was 200% faster than Sandy Bridge. That's almost 500% improvement in 6 years. I haven't got a chance to evaluate Cascade Lake SP yet but I expect the pace of improvement to continue and they have hardware mitigations for Spectre/Meltdown/L1TF.

Horseshit. Unless you're comparing different core counts that'd make MySQL the extreme outlier against hundreds of other tests that have shown again and again and again and again that Intel's IPC has been stagnant for years now with OK but minimal frequency improvements.

With numbers like this you really need to link some benchmarks. I can't find anything remotely similar.

The blog post and cloud provider's website both say the servers are bare metal.

I stand corrected. I didn't see it mentioned on the page but assumed because it was a "cloud server" that they were using multi-tenant infrastructure. But apparently for all servers they're bare metal, presumably provisioned with IPMI.

I still disagree with the previous poster's claims and conclusions, but I really should've done my homework better. I tried but clearly not hard enough.

What do you mean by 'large codes'? Also you should provide a link.

I mean the working set of the executable program is much larger than the processor's L1 instruction cache. Virtually all realistic cloud workloads (search, databases, email, etc etc) meet this definition of "large" whereas almost all synthetic benchmark programs are very small. Not surprisingly AMD looks great on the latter and terrible on the former because the microarchitectural resources they've committed to branch prediction and speculation are meager and their weird memory architecture exacerbates the problem.

Nonsense. Epyc's memory latency is only slightly behind Intel's, and every test other than MySQL shows neck & neck IPC across dozens of real-world programs that definitely don't fit in L1.

POV-Ray doesn't fit in L1 and Epyc 7601 trivially bests the Xeon 8176, and does so while using a ton less power. Similarly NAMD, even with AVX & compiled with Intel's ICC, runs waaaay faster on Epyc 7601 than it does on Xeon 8176.

Maybe your workload is exclusively MySQL or looks a lot like it. If so, sure, go buy Intel. But there's a ton of cases where there is no where close to such a gap, including large cloud workloads. You keep taking a single benchmark result and massively over-extrapolating it to mean the entire cloud hosting world. That's not how this works.

You seem to be very invested in disparaging AMD, but haven't provided any links to back up your claims, some of which are completely bizarre.

@waxzce: "FYI, as cloud provider we rawly loss around 25% of CPU performances the lasts 18 months due to different CVE and issues on CPU and mitigation limiting capacity using microcode, so we stuff more CPUs, but prices didn't go down at all... That's a kind of upselling. #IntelFail"

Is it #IntelFail if Intel had supply issues in last year? It doesn't look like demand was affected by much.

Any way you slice it it's an Intel Fail. They've basically had to lower the performance of their existing chips, they've failed on the road map for new chips, and they don't have capacity for more of the existing chips because they thought 14nm would be winding down by now rather than peak production.

> If AMD was in a better shape, there is a real market momentup here.

What does this comment refer to? As far as I know AMD is in pretty good shape.

AMD's Epyc itself is highly competitive, but there's the very slow turn around to get that into actual server offerings from the major companies. And of course just being competitive isn't going to get everyone to start adopting a different platform.

They probably can't meet the demand for server-grade CPUs. But I'm just guessing.

Most of the current AMD offering, APU, GPU, and CPU is still being fabbed with GF 14nm, which I believe to be very limited in capacity. Once the next generation GPU ( Navi ) and CPU ( Zen 2 ) along with EPYC 2 moves to TSMC 7nm they should be able to flex their muscles a lot more.

I wished they have made the move earlier, like start of this year, but it is now all set for Q3.

Amd should probably keep production of whatever they think will sell +20% to account for whatever new Intel screwup is revealed.

Pure Speculation, judging from the slight delay ( 1Q ) in EPYC 2 introduction, my guess is that AMD wants the launch and fulfil all the demand of the big cloud provider ( AWS, Azure and Google ) while still have enough Chips for the rest of the market. The worst thing could happen is AMD announced their EPYC 2 and AWS getting all the stock with little left for other Channels.

I have high expectation but I trust Dr Lisa Su and her executive team. And Intel is currently having a new low in my book.

They're a cloud provider, so cloud offerings. AMD has had some consumer success but remain rare in the date center.

Hetzner offers AMD processors in some of their dedicated servers, but I have no idea if they also run some of their VMs on AMD.

We are doing perf testing of the AMD versions of HP/Dell servers. We may tech refresh about 40k servers with epyc-2 if performance is better than Intel. It's about time for some new servers anyway.

It seems that they're on track to get epyc into OEMs and cloud providers in quantity. They've been successfully growing the ecosystem in the server space over the last 2 years (Supermicro, dell, HPE, Lenovo all pushing epyc sku's now) so they're very well poised to capitalize on it if they can keep up with supply. The power efficiency gains on 7nm will be pretty tempting to cloud providers and colocation clients alike. If you can put 256 cores (4 epyc2 sockets) into a 1U server and keep it under 1200W that's going to be a landslide win for density and running costs. Factor in pcie 4.0 and things are looking pretty sexy.

I've read that wafer yeilds on zen2 are 70%+ which gives a huge advantage on cost and production efficiency to AMD. I think Intel's skylake yield for an equivalent 28 core die is sub-40%. If anything the limiting factor is going to be TSMCs ability to give AMD capacity on it's very crowded and in-demand 7nm fab.

Dell and HP already have them in quantity. We just have to perform due diligence to ensure all of our workloads will remain the same or better performance. Thus far, testing is looking good.

Quick question: Why put all of your eggs in one basket? I know you want to maximize profits, but you also want to minimize risk? For all anybody knows the next big CPU CVE could be on AMD architecture, especially as people will now being taking the processor much more seriously.

Our eggs are already all in one basket across two vendors and that did already bite us. Cost wise, it doesn't make sense to load balance across Intel and AMD. We have to pick one or spares cost are doubled and tooling has more overhead.

In a few years, if Intel have their act together and AMD start having more vulns, then we may swing back the other direction. Only time and vendor actions will tell. This assumes we even move to AMD.

Have fun replacing all of the CPUs after a few months. There hasn't been an AMD CPU since Opteron that worked correctly on the first stepping. EPYC has a huge number of errata ranging from bit flips to voltage dropouts to hangs and reboots, most of which remain unfixed. I mean "Load Operation May Receive Stale Data From Older Store Operation" is a pretty serious erratum. And AMD has a history of this going back to Barcelona which just flat out didn't work (workarounds for broken Barcelona B3 parts cost 50% performance or more). They don't have the resources to qualify new parts properly and they rely on customers to find all the bugs for them.

What are the possible pitfalls of _not_ runnning spectre et all mitigations, if you're _not_ hosting other people's code, but on your own hardware?

Just guessing here, but suppose an attacker is able to get unprivileged code to run on your machine (by taking over some process). He is now able to extract secrets from other processes on the machine that he ordinarily wouldn't have access to. Think SSL keys etc.

But I believe you're right in that those exploits are not as dangerous in a non-shared hosting scenario.

Quick reminder that a browser with JavaScript is a way to run unprivileged code on your machine.

I'm not even sure code execution is strictly necessary for this style of attacks.

An attacker could carefully craft network packets to force the control flow of existing software to manipulate the CPU state. They could and then use the timing differences in network packets to read data out.

Would be painfully slow, but theoretically possible.

A while a go there was a post on here describing exactly that. I can't find it anymore unfortunately (maybe someone else has it?), but judging from my memory of that discussion this should be doable-ish.

But if you're just running a local home system the chances of being targeted with such an unreliable attack are virtually zero.

I don't think this is correct. The timing attacks here all require extremely high resolution timers, and network + I/O latency would obscure the variance entirely.

People are able to crack poor password comparison implementations over a jittery, latency heavy network. It’s possible to get almost arbitrarily high resolution when doing timing side channel attacks, you’ll just need many more samples.

> suppose an attacker is able to get unprivileged code to run on your machine (by taking over some process). He is now able to extract secrets from other processes on the machine that he ordinarily wouldn't have access to. Think SSL keys etc.

I think that's the catch: it's the set of cases where someone can get code execution without being able to directly retrieve the targeted data, which is already pretty low, and where the best mitigation technique is to harden against these attacks rather than use a dedicated service (e.g. if you're using an HSM for crypto or have a dedicated SSL-offload box, you don't have to worry about keys being compromised when someone finds a bug in the much greater exposed surface area of your application).

According to Apple[1] Full mitigation requires using the Terminal app to enable an additional CPU instruction and disable hyper-threading processing technology.

So, if you have a system which you cannot expose it is a pretty big damper on performance.

[1] https://support.apple.com/en-us/HT210107

And you also benefit from herd immunity. If 99.9% of people have mitigations, then the likelihood of your difficult timing attack working is quite low... so why bother trying?

None, really.

If you're not sharing your hardware with other code you don't trust, it's not a problem.

Says the person posting from a web browser.

So what ? If your machine is exclusively used to build code, or to render 3D animations, then this patch is not mandatory.

Because javascript.

He explicitly says if you don't share hardware with code you don't trust it's fine to disable it. Javascript is not code one trusts...

...with JS disabled.

The list of sites I have whitelisted JS for is extremely short.

So then we agree it's a problem.

What code do you actually trust? (to be incorruptible/unhackable, etc)

I think you're being overly pedantic and ultimately unhelpful and not contributing to the discussion.

When I install software I am implicitly trusting it, whether that software is trustworthy or not. If it is exploitable that doesn't mean I didn't trust the software. On my personal computer, the expectation is that only the software I have implicitly trusted will be running. If I install something that isn't trustworthy but trust it anyway it's game over. If it's exploited, that's a different issue to trusting it.

Shared VMs are a totally different ball game. All of a sudden I am sharing resources with a potentially bad actor.whethrt or not I trust them is kind of indifferent, I don't get to choose who I colocate with, and ultimately that's the difference.

So you've disabled JavaScript, right?

… or are running that JavaScript in Chrome, Firefox, Safari, Edge, or Internet Explorer which all shipped updates last year.

Being pedantic about this is not helping anyone actually be more secure: the odds are orders of magnitude greater that someone will be compromised by a different security problem than by the diminishing returns trying this attack against a current browser, and that the attacker will directly obtain the data they want without needing a side-channel attack.

Not an expert, but if you are running absolutely no untrusted code (which is hard depending on your definition of untrusted), then the risks are low.

However, when a vulnerability is found, then spectre etc. make it easier to abuse that vulnerability to do something useful.

Not much. We've yet to see any live Spectre exploits (that I know of), and I'd be surprised if the first one is a remote exploit.

Good point... I wonder why? The proof of concept code works so well for some variants.

Because most channels for delivering untrusted code have already been fixed. There's really only 2 channels for delivering code these days to attack consumers - app stores & web browsers. Web browsers had crude mitigations immediately (disable SharedArrayBuffer), and have improved since then (full site isolation https://v8.dev/blog/spectre )

As for app stores there's typically easier malware attack vectors if you manage to get an app installed & past the scanners in place with that typical back & forth escalation game. And those run in process-level sandboxes with clamped down syscalls. You can't exploit a spectre BPF attack if you can't use BPF at all due to selinux rules, for example.

Moving past consumer what else are you really going to attack? Sure you could rent some virtual hosting somewhere and hope you get lucky attacking a neighboring VM, but what's really your end goal there to justify the time & risk? It's not like you can reliably pick a target with that, you just have to go fishing and react to what you may or may not find.

I can see how a local Meltdown exploit would be useful but what would be the point of a local Spectre exploit?

Don't get married to Intel, if you need big beefy xen or kvm hypervisor machines, there's lots of good EPYC based motherboards.

I am planning on making a home-server. Any suggestions for EPYC boards?

I just built my first AMD machine since the Athlon days with an ASRock Rack EPYCD8-2T [0] motherboard and an AMD EPYC 7281 CPU [1].

I upgraded a VMWare ESXi server hosting FreeNAS from a consumer motherboard with an Intel Core i7 3930K and it was completely painless.

[0]: https://www.asrockrack.com/general/productdetail.asp?Model=E...

[1]: https://www.amd.com/en/products/cpu/amd-epyc-7281

I've used a lot of ASUS desktop boards and never have been disappointed. However I have heard that they stop pushing BIOS updates for older boards. Have you faces such an issue?

My current HP home-server has been with me for ~10years

If you are running only open source software, you may want to consider this:


Starting at $5800? For 5800 I can build one hell of a threadripper or epyc system with nvme storage.

I've been seeing people try to make power architecture (ibm) servers a thing for 12+ years now, it never happens, because ordinary developers can't afford them. Compared to what you can put under your desk for a thousand bucks based on any Intel or amd, amd64 architecture system .

Or this? https://www.raptorcs.com/content/TL2B01/intro.html

$2700 and probably is outperformed by a $150 ryzen cpu on a $110 motherboard.

"Starting at $5800? For 5800 I can build one hell of a threadripper or epyc system with nvme storage."

Except that it won't have open firmware on a CPU hardly anyone targets. Whatever you build will probably be vulnerable. Unless you're running OpenBSD or something.

I guess something being such terrible value nobody uses it is a sort of advantage...

Less used != terrible value.


In case you are curious to see benchmarks.

Depending on what you're using it for, I would take a look at Threadripper too. There are some crazy deals right now. If you don't IPMI and registered ECC, it's a great option for a ton of cores and PCIe lanes.

Threadripper and Ryzen both have ECC support.

'Registered ECC' is just slightly different than ECC memory, but it's definitely a feature of EPYC, and unsupported on Threadripper and Ryzen.

I cannot find anything at all that supports such a claim. Do you have any links or supporting spec sheets? All I can find are the same "Supports ECC: Yes" listing for Epyc, Threadripper, and Ryzen. The only spec differences I can find for Epyc vs. Threadripper on the memory side of things is 8-channel vs. 4-channel.

Look at specs for any X399 board and they will say something like "Supports Quad Channel DDR4 3600+(OC) & ECC UDIMM Memory" or "8 x DIMM, Max. 128GB, DDR4 3600(O.C.)... MHz ECC and non-ECC, Un-buffered Memory". Both omit buffered/registered DIMMs.

AM4 boards have similar verbiage.

Here's a document from AMD about EPYC: https://developer.amd.com/wp-content/resources/56301_1.0.pdf

They do not mention unregistered DIMMs being supported.

I'm setting up a home server and decided to just go with Ryzen. You get a bunch of server features (ECC RAM, lots of PCIE lanes, and lots of cores) for a lot less cash than the equivalent EPYC build.

Obviously EPYC has a place but for the home usecase you could use a Ryzen as a substitution for Intel Xeons because of the baseline features of Ryzen.

Ryzen only has 20 PCIE lanes, it's not that much. Basically the same as Intel since Intel uses DMI 3.0 (basically pcie x4) for communication to the chipset. So on Ryzen you get 16x lanes directly to CPU + 4 to chipset where it multoplexes, and on Intel you get 16x lanes directly to CPU + dmi3 to chipset which multoplexes to pcie.

Threadripper has 64 lanes, though, which is definitely a lot.

Ryzen has 24 PCIe lanes: 16 for graphics (can be split to 8/8), 4 for NVMe storage, and 4 for the chipset. So you get 4 more lanes compared to Intel's consumer platforms.

Be aware of the fun with numa nodes and AMD. https://www.reddit.com/r/Amd/comments/9ngwzf/amd_epyc_on_esx...

These will be made irrelevant in the immediate future due to Zen2's dedicated I/O chiplet.

Do you need Ryzen Pro for ECC?

You only need your motherboard to support ECC (not sure if they all do), that's the only prerequisite.

They don't all support ECC. ASRock seems to be the most-supported in this category.

Nope. Regular Ryzen supports unbuffered ECC modules.

What about registered (buffered) ECC modules?

Nope, Ryzen and TR don't. Epyc does.


Unbuffered ECC modules can be hard to find, though.

Intel has been proven unreliable and untrustorthy time and time again, people have been complaining about how awful and unscalable x86 is since the 90s.

The future is more likely ARM- I know a few companies are experimenting with moving their internal servers to ARM, and AWS started offering that platform a few months ago.

But I'm not sure how long this takes, people have also been wrongly predicting the death of x86 since the 90s.

This is an argument for organisations to go back to being on-prem, running their non-public facing workloads privately.

How's that an improvement over using 'Dedicated Instances' or the 'Bare Metal' offerings which exist with cloud providers, where you're not colocated with other entities but still enjoy all the benefits of how the infrastructure is managed, flexibility in upscale/downscale (instead of acquiring capital assets which depreciate over 36-48 months).

External access to the bare metal is still possible and thus information can be leaked via Meltdown and Spectre

Is this a chance for ARM providers to move in and outcompete Intel, especially if they can provide similar tech but with security? Basically an ARMs race?

You can't stop this class of vulnerabilities without sacrificing performance because the reason for the performance is also the reason for the vulnerabilities. That's the choice you have, and no brand affiliation will save you.

You can't stop all speculative leaking without sacrificing a large amount of performance. But you can stop all leaks across security barriers with a very modest performance cost. This still leaves web browsers with a hard task in terms of isolating Javascript but it solves the issue for, e.g., web hosting services like the OP.

It's not about brand affiliation, AMD and Intel have vastly different implementations

Which didn't save AMD from Spectre bugs. And everyone else with high performance speculative execution out-of-order designs also had Meltdown bugs, ARM and IBM, both POWER and mainframe/Z: https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili...

Yet, AMD looks way better considering all vulnerabilities reported so far. Why downplay this fact?

Not only AMD. Also POWER, ARM. I don't know about RISC-V or MIPS.

We heard the very same argument about Microsoft Windows back in the 90s and 00s. "Linux [1] just wasn't well tested." It is purely a hypothesis, without any proof whatsoever.

Microsoft lost its dominance. Intel not yet, but this could lead to that. The question with losing dominance (e.g. being monopolist or market leader) is always a question of when, not if. And if it happens some people will lose a lot of money [2]. There's going to be a time where the USA is no longer #1 world power. The question isn't if, it is when. History teaches us this much.

[1] (Whatever that means.)

[2] (Or "money".)

Please edit your comment to clarify what Microsoft lost its dominance in.

Should be obvious. Browser penetration (MSIE was once the most popular web browser), and therefore they couldn't monetize the smartphone market via their desktop dominance.

Do you think a greater or lesser percentage of people use a Microsoft operating system over the time period mentioned? What about web browsers?

They certainly still have vice grips on the word processing, OS, and IDE industries. They lost their grip on the browser industry. I'm not sure which one he's referring to.

You were responding to someone who referred to “Microsoft Windows”, so you will find https://en.wikipedia.org/wiki/Usage_share_of_operating_syste... educational. Even with the attempt to broaden it, your first sentence is only true if you add a qualifier like “in Enterprise IT environments” — Office is the area where they're still most dominant but even there has seen a big shift away from tasks which would have been done in Office in the late-90s/early-2000s but are now using web / mobile apps.

In the end of 90s and start of 00s Microsoft was the dominant OS on graphical clients (GUI). Before that, it was [in no particular order]: CLI (including DOS), UNIX (X11 such as SGI IRIX or SunOS), nothing (pen & paper), and some fragmentation/diversity in SOHO (such as Amiga).

Fast-forward to end of 10s and the dominant client is the web browser stack where Microsoft is barely relevant. That Microsoft Windows and Microsoft Office are still dominant is hardly important as the market trend is that these markets themselves are less relevant. If not merely for the fact that venfor lock-in has shifted, in favor of the other 4 in the FAANG stack.

I keep wondering if...

1) AMD knew about the potential vulnerabilities that would have emerged by making the CPUs faster "Intel-style" and therefore said something like "no we won't take those risks", or...

2) if it was just by pure luck, or...

3) if for them that option wasn't technically feasible,or...

4) if maybe they did not have as "good" (relatively speaking) engineers as Intel.

Personally I don't think that it could be #1 as I think that the pressure to make the CPU faster (and therefore being able to better compete with Intel) would have been a lot higher than to say "nah, this might hit us in the future so let's opt for the safe variant".

It was #1. The potential for these side-channels attacks have been known for well over 15 years, ever since RSA was cracked with a microphone. From an engineering standpoint it was obviously risky to speculate across security domains, which is why AMD consistently abstained from doing so. Intel consistently speculates across security domains. Neither of these is a coincidence.[1]

Intel took a gamble and lost. Well, they lost from an honest engineering standpoint. Still unclear if they lost anything from a market perspective, and we may never know because they're losing even bigger in terms of fabrication, which will obscure the effect of their horrible security fumbles.

[1] You don't need to know the details of an exploit to avoid a vulnerability. You just need to know what you don't know, and for side-channel attacks it's relatively easy to know what we don't know--any calculation where a timing, power, etc differential is even indirectly visible (at any level of precision) is suspect. Sometimes you have to take the risk, but some risks (i.e. speculating across security domains) are just too great, especially in an environment as sensitive and security critical as a CPU. Intel engineers knew. They couldn't have not known; they're hardly incompetent.

AMD was also measurably slower to begin with, in terms of per-core performance. Not just by some small number; depending on the benchmark the difference can hit as high as 40% on the latest architectures. What they lack in single-core perf they make up for in a vastly superior multi-core architecture.

Also, something everyone seems to forget: All of these cloud providers are running Skylake chips, which were launched in 2015. Google Cloud in particular will even give you chips OLDER than Skylake by default if you don't specifically request Skylake. Even assuming instances like an AWS m5a are running on Zen 1, not Zen 2, that's a 2017 architecture.

So you're presented with all these graphs that say "AMD only lost 5%, Intel lost 25%, fuck Intel" but the reality is that Intel was previously far faster than AMD, and they're not even fabbing their best designs for the data center. Intel definitely had more vulnerabilities and they WERE hit harder, but its more nuanced than just blindly wondering why more cloud providers aren't making a fleet-wide switch to AMD.

> AMD was also measurably slower to begin with, in terms of per-core performance.

Not so surprising now that we know Intel merely skipped a lot of checks to get there though. I mean a lot of the vulnerabilities impacting Intel only have been the likes of "when predicting, the processor does it even it shouldn't to avoid the delay from checking".

Is it really fair then to say that AMD doing it the proper way was slower, or that the loss of performance of Intel can't be used as a way to congratulate AMD on being safer on that front ?

It would be fair to say that if AMD had the features but implemented in a way that wasn't vulnerable. That, however, is not the case. AMD simply lacks the features.

> depending on the benchmark the difference can hit as high as 40% on the latest architectures

That's a technically true but also misleading statement. It is exclusively on heavy AVX loads that any such delta appears. If you're not using AVX, then the single-thread differences are ~15% or less. So Intel losing 25% does now potentially put them behind on most single-threaded workloads.

Does AMD do any speculative execution?

My understanding from a Google blog that talked about it indicated that Google felt like almost any speculative execution... is a risk. So while there might not be someone exploiting it now, they considered the practice potentially an issue. Accordingly future performance will possibly still degrade later as further changes are possibly needed?

As a friend and coworker liked to say, "speed kills", and Google is right that it's a risk, just one a lot of people are willing to take.

Every company with high performance single thread designs has out-of-order with speculative execution designs, ARM, AMD, IBM POWER and mainframe/Z, Intel, MIPS, and SPARC. RISC-V is working on it, with the Berkeley Out-of-Order Machine (BOOM) and perhaps other cores.

And doing research I should have done a long time ago, SPARC V9 has Spectre vulnerabilities (https://www.zdnet.com/article/meltdown-spectre-oracles-criti...), and MIPS made a statement that a couple of their designs were possibly vulnerable (https://www.mips.com/blog/mips-response-on-speculative-execu... and this followup assumes that: https://www.mips.com/forums/topic/mips-mitigations-for-side-...)

Yes. All modern processors do. That’s why they’re all vulnerable to the Spectre attack, which is a fundamental consequence of speculative execution.

The other vulnerabilities: Meltdown, Fallout, the recent MDS attacks (RIDL, ZombieLoad, Store to Leak forwarding), are Intel specific because they’re caused by the way Intel specifically chose to skip/defer security enforcement checks in parts of their implementation.

Other companies didn’t do this.

> Other companies didn’t do this.

ARM, and IBM POWER and mainframe/Z also did that (Meltdown): https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili... ARM also has a Rogue System Register Read vulnerable design (https://developer.arm.com/support/arm-security-updates/specu...).

Even more accurately, only the Cortex-A75 was vulnerable to Meltdown. It wasn’t pervasive to their architectural designs the way it was to Intel.

It's pretty clear it wasn't pervasive for ARM because they're climbing to ever higher performance, while Intel reached a conceptual peak in 1995 with the Pentium Pro, and in turn has used mostly Pentium superscalar based cores in their lower performance Atom line. It's implied by the dates of vulnerabilities cited by Google Project Zero I think that Intel's Meltdown original sin goes all the way back to the first Pentium Pro.

Because that logic is the same of the drunk looking for his car keys under the street light, rather than the dark area were he lost them.

AMD's server market share is minuscule and dropped as of 19Q1, 3.2% to 2.9%, although healthier and growing smartly in desktops and notebooks, up to 17.1% and 13.1% as of last quarter (I would guess the two are related if the comments made in this discussion about their being fab capacity limited are correct). That means they're less significant targets for researchers than Intel and ARM. We might also assume AMD has less manpower to devote to finding vulnerabilities than Intel has.

These latest Intel vulnerabilities, Foreshadow/L1TF and this week's? They're all targeting Intel specific details, for example the first Foreshadow version targets the SGX enclave. See also ARM's Cortex-A72 Rogue System Register Read (RSRE), Spectre Variant 3a vunerability: https://developer.arm.com/support/arm-security-updates/specu...

The odds that AMD specific features have vulnerabilities than simply haven't been looked for yet is very high, their Spectre vulnerabilities show that they too generally got caught with their pants down.

ARM is more less examined than less vulnerable. Most of their cores aren't speculative execution out-of-order, that's expensive in power and silicon real estate/cost, but they have 13 vulnerable to Spectre, including 1 known for the 2018 Rogue System Register Read which like this latest set for Intel is design specific. It would be wise to assume there are more vulnerabilities hiding, the name Spectre was chosen because we'll be haunted by it for a long time. Also one of their newest designs is vulnerable to Meltdown, see the whole list here: https://developer.arm.com/support/arm-security-updates/specu...

If they want to compete with intel and AMD on performance, they will have to include the same speculative out-of-order architecture. These bugs are actually not bugs in the sense that the hardware does something it shouldn't do. It's 100% to spec. The problem is the spec. AMD has less problems, but I'm not so sure this is only because they are more secure. I think there are more eyes looking for exploitts for intel CPUs than for AMD.

I think it's more likely that AMD will pick up the slack.

Now how much this 25% is in terms of CO2 emissions?

About 25%.

It's a little funny to see the "use AMD!" comments --- since from what I understand, Intel's optimisations that lead to these side-channels are specifically for performance, so using AMD instead of Intel might mean the same amount performance loss.

Not at all. AMD does a hardware check for memory accesses in serial with the reading the data rather than in parrallel. This probably adds an FO4 or so of latency to the critical path of a read which could mean a lot of things from a design standpoint. Maybe you reduce your L1 cache size or associativity or such to make back that Fo4. Maybe you just say "All our stages are going to be 15 FO4s instead of 14" and you clock 5% slower but your pipeline gets shorter and you have more room for cleverness in other places. I'd be very surprised if the net performance impact was more than 1%.

By contrast having to deal with the issue in software or worse, by turning off SMT, is much more damaging and gets you the 10s of percents of performance impact that people are talking about.

Current AMD offerings are generally beating Intel, and in the price-to-performance game it's a complete slaughter.

Alternate perspective: AMD not doing these risky optimizations reflects a better engineering culture, or at least less risky priorities.

Go with Intel and risk losing a huge chunk of your performance some day, or go with AMD and know what to expect. Different people and companies have different risk tolerances, so it's not an obvious choice one way or the other.

Might have been the case during AMD bulldozer. There, they were still doing well by surviving with a massive fab disadvantage.

From Zen thereon, they've been more or less on par in single threaded performance, and have the performance edge on multicore performance, while being cheaper and unaffected by most of these new CPU vulnerabilities, on top of that.

Amd was walways better wit bang for buck. Intel just had the fastest CPUs for rich who nwed best single core performance. If Intel doesn't have the fastest CPUs then AMD is just hands down cheaper and Intel has no advantage.

They have completely different architectures, some optimizations can be done in safer ways.

Isn't one of the "secrets" to AMD's post-486 success adopting the general approach the Pentium Pro took along with others in the 1990s, which in turn are based on IBM's 1960s Tomasulo out-of-order algorithm adding speculative execution? This general approach made all out-of-order speculative designs, including ARM, POWER and IBM Z vulnerable to Spectre, so I submit they're not "completely different".

AMD is in a better position here

Why is nobody suing Intel? They sold defective chips.

This isn't a defect in the traditional sense. The device is behaving to specifications, but a new method of attacking it has been found.

Rowhammer is arguably more aptly described as a defect; in that case the RAM is not behaving to spec.

Is there an End User License Agreement that users agree to before using these chips? If it is anything like software EULA's, then Intel is likely protected from lawsuits for defects.

Except unreasonable EULA terms are void in most (all?) of Europe.

>Is there an End User License Agreement that users agree to before using these chips?

I certainly didn't see one when booting up a new Intel computer.

I was under the impression EULAs are basically useless.

Not voluntarily?

So? If I buy a car that is promised to have 200HP but it actually just has 150HP after a software update wouldn't I be able to sue the manufacturer at least for the lost resale value?

HP is to Mhz what vehicle performance is to server performance.

Intel's hardware still provides the the same MHz. So the analogy would be more like due to safety concerns, a software update to the car increases the braking used by traction control.

That feels charitable to Intel, but unless they knew about the exploit, they used the tools in their arsenal to maximize performance. Now they are reducing it for safety.

Maybe it's more like how auto manufacturers are now having to increase wind resistance by designing pedestrian friendly hoods. Obviously that's not "after purchase" but I can't think of a better analogy!

Because processors get faster, speed = age: fast = new, slow = old.

In that spirit, here is a different car analogy for you.

This is like buying a new car, and then having the odometer suddenly jump by 150,000 km: you paid for new, but are stuck with old.

Or, we could go by features. Cars get new features. Suppose you pay for a car with various bells and whistles: on-board Wi-Fi, navigation, semi-autonomous driving and whatevernot. But then most of it doesn't work, so your car is like something from 2005. Forget bluetooth; just 1/8" aux jack, or CD player.

Did Intel promise you a CPU that would be impervious to all potential security vulnerabilities, currently known and unknown in perpetuity?

I would say, yes, in their documentation.

If you follow the semantic descriptions given in their architecture manuals, can you infer these insecure behaviors?

Perhaps, but what performance did Intel actually promise that you'd be able to argue against?

Intel promises you better performance from a new processor than what you can get from one in the same class one that is several years old, also from Intel.

You're paying for a new one, but it performs like something that is several years old that you can obtain much more cheaply.

Another aspect of it is consumers who chose an Intel chip over a competitor's. Intel does promise competitive performance. Their pitch isn't "we are slower, but just buy us on the brand name alone".

That not a good analogy dude

Intel vulnerabilities are providing additional 25% revenue to Cloud Providers.

Most cloud providers are trying to move up the value chain & provide higher margin differentiated services (managed db, queues, etc) instead of staying at the race to the bottom vm market.

These intel mitigations impact those services just like everyone else.

Does it? If customers can't run code on those servers, then these CPU level side channel attacks aren't an issue.

This assumes the servers for S3 etc are dedicated to the task.

> Does it? If customers can't run code on those servers, then these CPU level side channel attacks aren't an issue. This assumes the servers for S3 etc are dedicated to the task.

I’m not sure that follows. Unpatched, most of these Intel CVEs are almost like unpatched local privilege escalation vulnerabilities. Once you’ve compromised a low privilege process you can sniff your way to more and more powerful credentials on the machine, host, and in the network.

Unless higher abstraction cloud service providers are dumb enough to just run everything as root, in which case this doesn’t change anything, they probably can’t (competently) get away with not patching these. Defence in depth matters.

If I'm offering it as a platform as a service (instead of VM/container as a service) then this adds significant costs.

Interesting take: It's better to offer it as a "one click app" aside a basic cloud (docker/kubernetes) platform if you don't want to suck these costs.

In the immediate short term: maybe. But fundamentally, these vulnerabilities are (hyper)threatening their business model, because not sharing the infrastructure suddenly becomes far more attractive.

So to the extend that you're suggesting that cloud providers are happy, or may even have had a hand in this: nahh.

To Intel maybe. It's a loss to the cloud providers and their customers.

I pay by the hour, if I need to provision more servers AWS doesn't take a hit, I do

On EC2 and Lambda and similar, yes. On S3 and other managed services, AWS takes the hit.

Do they really need to apply all mitigation patches if they only execute their own code on machines running S3 and other managed services? (For EC2/... of course they do because they execute customer code.)

I wonder if I can skip the performance penalty for my own (hardware) server too... But I am not smart enough to answer this question myself.

Do you run everything on the machines you only execute your own code on as root?

Most people don’t. It’s a defence in depth thing — if someone hits an exploit in a service you want it to be running the lowest privileged user it can get away with so that it can’t rapidly pull all the data out of other subsystems, roam out into the local network and dump DBs, etc.

With these vulnerabilities unpatched, an attacker running code as the low privilege user account can sniff credentials for other accounts on the machine, or underlying host, private keys that authenticate against this machine and therefore probably others in the network, etc.

I’d say it increases risk quite a bit.

I'm not convinced this is the case for lambda. You get billed for lambda CPU time. If the underlying hardware is unable to provide as much CPU time after mitigations are applied, and AWS hasn't raised their rates, then I'm not the one paying for it.

They could very well optimize only for their own service costs, rather than for the customer.

True, but for S3 the major cost is not CPU-bound.

Not really: https://twitter.com/waxzce/status/1129381076160454657

> The performance loss is not visible for our customers, we have to manage the loss in ourselves. It’s a kind of hidden defect.

I run a high traffic web service in the cloud. Well over a billion requests served per month from the origin. We run our CPU's around ~60%.

We've seen maybe a 2% hit from spectre & meltdown mitigations. It's hard to even tell because it's a small enough amount that it tends to get lost in the noise.

other replies mentioned that the cost of the slowdown is absorbed by the cloud provider; let's just pretend it's fully passed down to the user and only nitpick at the math behind the joke:

Imagine a slowdown (performance penalty) of 50%. It would double the time to complete a task, thus doubling the cost (expressed in cost increase that would be a 100% increase!).

Generic formula: 1÷(1−x)−1

For x = 25% ---> 1÷(1−0.25)−1 = 33% duration (cost) increase

Perhaps this is the reason for Intel's recent push on their self created version of Clear Linux?

How does that help? You still take the perf hit?

Tests on Phoronix show Clear having significant wins in a number of tests. They have done so by rewriting libraries to perform well with Intel hardware. Those who complain about a 25% performance hit elsewhere can now be shown pretty good performance with Clear that isn't attainable with other distros.

Oh, gotcha; boost from optimizations cancelling out against loss from patches. I would argue that that still constitutes a performance loss since before the patches you could run with just the improvements get better performance rather than neutral. But you're not wrong.

I bet Google isn't so giddy now about being FIRST!! [1] with Skylake in the data center a couple of years ago, or "going on all-in" with Intel on the Chromebooks (it didn't even give AMD a chance until very recently...), despite Chrome OS being one of the very few operating systems that are truly architecture agnostic.

Now it's paying dearly for that mistake, with up to 40% performance loss on Chromebooks due to the disabling of HT:


Google broke one of the most basic business rules: never rely on a single supplier. You're always worse off in the end, even if the exclusivity deals seem very tempting in the short-term.

[1] https://cloud.google.com/blog/products/gcp/compute-engine-up...

> "going on all-in" with Intel on the Chromebooks

Weren't Chromebooks on ARM before x86?

> Google broke one of the most basic business rules: never rely on a single supplier. You're always worse off in the end, even if the exclusivity deals seem very tempting in the short-term.

I don't think this is universally true, although I agree that it's probably prudent. Diverse options/suppliers are a risk mitigation, but they do have a cost.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact