Hacker News new | past | comments | ask | show | jobs | submit login
Intel Confronts Potential ‘PR Nightmare’ With Reported Chip Flaw (bloomberg.com)
1026 points by el_duderino on Jan 3, 2018 | hide | past | favorite | 543 comments

This is a clusterf/big deal. Beyond the security implications, it means that all companies paying for computing resources will have to pay roughly 30% more overnight on cloud expenses for the same amount of CPU, assuming that they can just scale up their infrastructure.

I know that bugs happen and that there was nothing intentional on this one, but at times like this is hard to held at bay the temptation of claiming for a class lawsuit against Intel....

It's a good thing CPU is fairly compressible. Unless you meter it very carefully, you'll see the performance hit and it'll not impact you that much. Very few of my physical boxes are over 70% CPU utilization on a daily average.

It's, however, really bad if you sell CPU cycles for a living. You just lost between 5 and 30% of your capacity. If you have a large building, you just lost part of your parking lot to the Intel Kernel Page Problem building.

Problem is, most companies that need a lot of power only care about one thing - peak performance. And they tune it carefully in order to not overspend while guaranteeing minimal downtime. This means that they'll have to pretty much scale their infrastructure up by exactly 30%. That's a LOT for these big clients.

Honestly, I'd just make sure the server firewalls are super tight and not take in the future patches. At least for now.

Very very few highly tuned "peak performance" workloads are dominated by syscall overhead like the test that produced that 30% number was. It's best to hold off on the hyperbole.

I have a power dependent workload that scales horizontally and is currently already dominated by the cost of system calls. This will effectively, directly cause me to buy 30% more compute on a huge infrastructure. (2,000~physical machines. Quite beefy dual socket machines with a lot of memory)

I know I’m not alone.

Then again. Think of microservices, Kubernetes for instance; Network requests are system calls.

If your workload has no code that's untrusted, you can safely skip this patch or disable it on boot. If not, at 2000+ physical machines, it may be worth to move some of that into kernel modules that would collapse a couple syscalls into a single higher level one.

The VM host will still have the patch applied, won't it?

Yes, but if it's your metal, you don't need to.

Good idea. But it’s a windows executable and depends quite heavily on windows specifics.

(In my case anyway)

30% overhead might be inscentive to revisit the assumption we can’t rewrite it for Linux.

You still can move some functionality to a device driver or something else that runs in the NT kernel space.

But then you have to release the module as GPL, no?

Only if you want to do something that would otherwise be a violation of copyright - e.g. distribute the module to other people (assuming it's sufficiently entangled with linux to be a derivative work therof). The GPL only licenses you to do things that you otherwise couldn't do, it doesn't restrict you from doing things that were never a violation of copyright (e.g. privately modifying your own things).

No kernel modules can be closed source; GPU drivers are a common example of a closed source kernel module.

Will you be buying Intel-based machines? Or will you be running a hybrid-architecture cluster now?

I don’t know very much about computing on that scale, but I wonder if all the people selling off Intel stock are thinking this story through.

AMD server CPUs currently outperform Intel on some multi-threaded benchmarks. This usually isn't a problem for people buying for peak-performance because you can always buy more CPUs to increase parallel programming speeds, but it's harder to make single threads faster.

It's possible that the patches applied to fix this bug will cause some single-threaded benchmarks to change from Intel being the fastest to AMD being the fastest.

So for what it is worth my company has all Intel kit. We run servers that run docker. In each docker container we do build / test for our product. That is all we use them for. 1RU with 2 blades, each blade is dual socket, 72 total cores, 512GB RAM. We will not apply this patch as none of this is public facing and we do not want the hit to build / test throughput. The one big thing that this has done is we were looking at AMD for new servers and that has now become a higher priority on the to do list. Given our environment we care about the number of containers we can run, period.

It is overwhelmingly likely that we’ll buy more intel. Power/Watt has always been superior and AMD has to prove itself over time before we’d buy it.

Not trying to kill expectations. This decision isn’t mine alone. You know the old saying “nobody got fired for buying Cisco” that applies to Intel too.

No, but lots of workloads are built with latency in mind. For APIs that talk to each other in long serial chains, don't be surprised if request responses take significantly longer in many, many workflows.

I wouldn’t isolate your concern to firewalls and bad actors that break in over SSH. If they manage to find a vulnerability in your app that allows remote code execution this could help them make that problem much worse. Also VM/container escapes are a big problem if you use a cloud provider.

> I'd just make sure the server firewalls are super tight and not take in the future patches. At least for now.

Good security is about layers. No one layer can be assumed to be watertight, but with enough layers you hopefully get to a good place.

If they really care about peak performance, I don't believe the PTI patch will affect them. If you can change your system in a way that the power-hungry part does not work on untrusted data, you can not with "nopti" and ignore it. Systems which both need lots of maxed-out CPUs and traffic directly from wild internet are pretty rare. They're unlikely to run on a virtualised systems either.

Systems which both need lots of maxed-out CPUs and traffic directly from wild internet are pretty rare.

That's a good description of basically every cloud environment out there, from AWS on down.

In other words they are extremely common.

There are many ways to tune such workloads and I suspect our software will get better as a result.

We'll start to get conscious about the number of syscalls we use on each operation, start using large buffers, start buffering stuff user-side...

The CPUs in cloud environments are not maxed out in general. There will be some area like batch processing and compute-specific VMs. For other cases, there's quite a bit of overcommitting of resources. And that's before you start doing scheduling that mixed workloads on a physical host for better utilisation. Source: worked on a public cloud environment.

I agree with you on most VMs. But once you schedule mixed workloads, you want each host to be balanced so that all of its capacity is utilized evenly. Which means that if CPU use increases across the fleet, you will want new hardware with more CPU.

Either that, or you'll have to put up with some processes taking longer.

About 10 years ago I was mentored by a guy who was an utter wizard at queuing theory, and who bugfixed a whole bunch of nasty issues in cellular telecoms hardware through his understanding of how queuing theory impacted code execution.

TL:DR - queue behaviour gets nonlinear as you approach the theoretical max load. If you are running your processors at a high load, even a small change in code throughput makes a huge difference to real world behaviour.

I believe I saw a strangeloop talk about this specific issue in Clojure. The talk giver was talking about channels, not queues though.

Do you have a link?

Hmm. Since everyone that sells (Intel) CPU cycles for a living suffers the same loss of supply this boils down to pricing; the same demand chasing fewer cycles will drive up prices and the market will adapt.

30% is a big hit. I'm wondering if that isn't a bit exaggerated, or perhaps the consequence of a poorly optimized workarounds that will rapidly improve. I recall seeing figures on the order of 3% only a few days ago.

30% is a worst case for a workload optimized to hit the performance bug as hard as possible.

How big it will be for your workload is a function of what your workload is. Benchmark if it is important to you.

50%, rather, is the worst case scenario. 30% is a bad case scenario, and 5% best-case scenario. Which is still a lot for large cloud providers like amazon, google, microsoft.

Any public information on how someone like Google or Facebook handle this? Do they have enough spare capacity to patch or will they need to build further capacity first? I could imagine 10% of Google's capacity (internal services, not Google Cloud) is at least a large datacentre.

I know someone who works for amazon and he said that they didn't need to do anything or buy any more servers.

>It's, however, really bad if you sell CPU cycles for a living.

Who really sells CPU cycles? Cloud providers sell instances priced per core. So the real hit is by the customers since they have to shell out for more instances for the same amount of computing power.

The hit I see is by providers of 'serverless' computing, since they charge per request and have their margins reduced.

> The hit I see is by providers of 'serverless' computing, since they charge per request and have their margins reduced.

AWS, Azure, and GCP all bill serverless with a combination of per-request fees and compute (GB-seconds), so I'd expect the entire hit to be passed on to the user since this will cause increased compute time for each request. N requests that used to average 300ms each will now be N requests that average, say, 400ms, so the per-request billing remains the same and the compute billing will increase by approximately 30%.

I don't understand what exactly you're saying. All of those services have serverless services, but they also have server based instances which abstract compute to amount of cores and RAM rather than CPU cycles. And most use is out of the services which aren't serverless.

Then it's because you don't know how modern cloud works.

I now see that my misunderstanding was about who exactly the users were and who the provider was. My opinion of provider was only GCE, AWS, etc. while the commenter I believe when talking about providers included users of those services (who again were providers of serverless services).

A lot of companies do, including many ycombinator startups. Think of anything that's analytics, data science, data warehouse or advertising related. The costs to run their service just took a hit.

Their competitors are also affected.

Also a 30% decrease is also equivalent to setting Moore's law back 7 months. A 5% loss is only setting it back 1 month. I know that's a bit of a naive calculation. But the point is computing power has long operated in an exponential domain. So big differences in absolute numbers aren't necessarily a big deal.

According to this patch comment, AMD x86 chips are not affected: https://lkml.org/lkml/2017/12/27/2

Sure but who is using AMD chips in place of Intel server chips? If company A competes in the widget market against company B and they both built their server infrastructure on Intel then neither company gained an advantage due to a performance degradation in Intel hardware.

> Sure but who is using AMD chips in place of Intel server chips?

Well... Everyone who bought AMD. Some people managed to see beyond the hype and go for the optoon that made sense.

The overwhelming majority of the cloud runs on Intel. Saying AMD is slightly better off doesn't really help if my systems are built on Intel. This is the case for most people.

What hype are you referring to? Are you suggesting the people who bought AMD knew this was a problem for Intel?

Azure got some AMD EPYC.

>Sure but who is using AMD chips in place of Intel server chips?

Maybe a lot more now?

CPU speed hasn't followed Moore's law since 2003ish. (Number of transistors is still following Moore's law, but that doesn't necessarily directly help you when your program is suddenly 3-30% slower.)

A CPU from 2017 is going to run your programs a hell of a lot faster than one from 2003. Even if they technically have the same clock speed. Look at benchmarks for instance: https://www.cpubenchmark.net/high_end_cpus.html

The claim wasn't "CPUs in 2017 are not faster than CPUs in 2003" or even "CPUs in 2017 are not much faster than CPUs in 2003"; the claim was that they haven't followed Moore's law since 2003, so applying it to CPU speed nowadays is inaccurate. Of course CPUs are faster now than they were 14 years ago, just not as fast as the case where Moore's law still applied to CPU speed.

Moore’s Law doesn’t describe CPU clock speed increases.

Dennard scaling however did deal with clock speed (indirectly via power). It has failed since about 2005.

I don't think you can reliably phrase this in terms of Moore's law. Moore's law mostly concerns raw FLOPs. It's less useful for predicting hardware performance for operations that are governed by limitations like I/O and memory latency. And this slowdown, if I understand it correctly, is largely driven by memory latency.

One of the rationales for cloud computing is it saves money by cracking up utilisation. Providers observe how much users "really" use and then provisioning that much.

True, sometimes you will leave boxes at low utilisation for various reasons, e.g. to deal with traffic spikes. But those reasons have not gone away. So now instead of heaving a predictable increase in CPU cost, you have an unpredictable increase in performance snafus.

The only good news is that the real performance hit will be less than 30% on many workloads. Especially once the providers start juggling and optimising.

>"It's a good thing CPU is fairly compressible."

What do you mean by "compressible"?

Presumably, for a certain important class of application, CPU is not used "densely", i.e. continually. Instead it's used intermittently, like a gas rather than a solid... Hence compressibility. Such applications are far from being CPU-bound, in other words.

So a cloud provider would be an example. Compressible similar to a sparse file I guess as well. Thanks this makes sense.

I think it was meant that a normal application does not utilise the CPU all the time, which can be seen by looking at the task manager CPU usage % = X. Any extra processing needed to fix this bug will have to come out of the remaining 100-X%. This is OK as long as you have enough spare %, and can afford the extra power usage for that processing.

That makes sense, thanks. This is a big deal.

Virtualization is one popular way to drive up CPU utilization. The more diverse workloads run on a given server, the more even the CPU usage tends to get. This way, if you have 100 workloads that peak at 100% but average at 1%, your CPU usage will tend to be smooth at 100%, any overallocation will smooth out over time (a job that would take 1 second may take up to 10).

No vendor can afford to do it at a loss for long. One way or another the customer will end up paying

There's also latency though :/ it seems that programs that make a lot of syscalls will be affected more than programs that are doing in-process calculations

We'll start being more syscall conscious when we write our programs. We'll batch more at the user mode side and try to use less syscalls to do the job.

Kernel ABIs will eventually reflect that and crop up higher level expensive calls that replace groups of currently cheap syscalls (that will become expensive after the fix).

And Intel will profit handsomely from next generation CPUs that'll get an instant up-to-30% performance boost for fixing this bug.

What about all the kernel interrupts due to network and storage traffic?

Maybe the scheduler could dedicate a core to interrupts and software that has small quanta and page tables? Can't really think of a code solution that doesn't sound stupid when I type it.

Another consideration is power usage in data centers. Server power usage is annoyingly complex, and once you get above 70% utilization power usage may go up considerably.

As a billion other people have already said, that all depends on their workloads. This isn't a 30% clockspeed deduction.

As i understand the problem, this isn't about clockspeed reduction, now it is the software's responsibility to check if the page is a kernel page/user page. So, the impact is significant. So, every time either pages are touched/accessed this check needs to be triggered, which causes it to be much slower.

> So, every time either pages are touched/accessed this check needs to be triggered, which causes it to be much slower.

Not to be mean, but that's not what is being changed.

You're right on the bug - userlevel code can now read any memory regardless of privilege level. However the fix isn't to manually check the privileges on each access - that would be extremely slow and wouldn't actually fix the problem.

The fix is to unmap the kernel entirely when userspace code is running. Because the kernel will no longer be in the page-table, the userspace code can no longer read it. The side-effect of this is that the page-table now needs to be switched every-time you enter the kernel, which also flushes the TLB and means that there will be a lot more TLB misses when executing code, which slows things down a lot.

So, to be clear, it is not accessing pages that is being slowed down, it is the switch from the kernelspace to the userspace.

But doesn't the CPU enter kernelspace every time a syscall takes place? So based on what you've described, every time a syscall returns control back to userspace, the TLB will be flushed, which means slower page access times in general.

The distinction I was trying to make was the above commenters thinking that the kernel is now checking page permissions instead of the CPU doing it - IE. Doing privileged checks in software. That's not what's happening, the kernel is just unmapping itself when usercode is run so the kernel can't be seen at all. Then the privileged checks (which are now broken) don't matter because there is no kernel memory to read.

All your points are right though. Page access times will in general be slower because of all the extra TLB flushes, leading to more TLB misses when accessing memory.

Right, but how often that happens is workload dependent. Basically, how often is your code making syscalls.

But don't all FS accesses (e.g., write to socket, read from DB) require a syscall? In that case, basically all web applications would be affected.

Or am I completely off the mark?

No, you are correct. Really, every application will be affected, they all make some syscalls. How much will vary, though.

At least one syscall happens at some point, but performance-tuned systems already use "bulk" syscalls where a single syscall can send megabytes of data, check thousands of sockets, or map a whole file into your address space to access as if it were memory.

> how often is your code making syscalls.

And how often the kernel services interrupts.

You don't understand the problem.

> claiming for a class lawsuit against Intel

If people who received written assurance from Intel that their hardware is 100% bug free can form a legal class, sure. I highly doubt there is even a single one such customer.

Anyone can sue anyone else at any time. If you think Intel isn't going to be sued for this, you're wrong.

It depends on how they handle user compensation. Going by the FDIV precedent, they should typically replace all defective products for free, and they will be in the clear.

What I meant was that the presence of the bug itself is not a valid cause, for example you can't claim that due to the error you lost 1 trillion dollars via a software hack - even if it's true. If Intel can prove they acted ethically when disclosing the bug and that they replaced / compensated users up to the value of the CPU, they are in the clear.

I read that this bug goes back several years; "replacing all defective products for free" could be a massive expense, and assuming it includes current chips, there's also some lag time and engineering effort to get to the point where they could start doing so.

How do they replace defective products? Or more specifically how do you get your laptop CPU replaced if Intel offers a replacement for free?

Presumably, by visiting an agreed service center in your western, sue-happy country, or sending the computer to the nearest one on your expense in the rest of the world. If the CPU is not replaceable or no longer in service, you would get a voucher for the lost value of the CPU/computer that is now 30% slower. Something like 10-20$ for anything older than 3 years, so most people won't bother. If Apple can do it, surely Intel will manage, but it will cost them in the billion order of magnitude, a non-negligible fraction of their yearly profit.

Good call. Will be fascinating to see how this plays out!

They need some legal standing or the case can be dismissed out of hand. It may very well be a question of who has the better legal team.

> They need some legal standing or the case can be dismissed out of hand.

Yes and no. Yes, Intel would get a chance to claim that the case should be dismissed out of hand. To do that, they have to prove that, even assuming all the claimed facts are true, the people suing still don't have a valid case. That's a high bar. It can be reached - there's a reason that preliminary summary judgment is a thing in court cases - but it takes a really flawed case to be dismissed in this way.

How flawed? SCO v. IBM was not completely dismissed on preliminary summary judgment, and that was the most flawed case I've ever seen.

> It may very well be a question of who has the better legal team.

Well, Intel can afford to hire the best. A huge class-action suit can sometimes attract the best to the other side as well, though. (There's not just one "best", so there's enough for both sides of the same court case.)

IANAL, but it looks to me like there's at least the potential for a valid court case. CPUs are (approximately) priced according to their ability to handle workloads; if they can't provide the advertised performance, they didn't deserve the price they sold for.

What is bug free? CPUs work just fine. There is no bug.

The question is did anyone receive performance assurance from Intel? Probably not.

Some cloud providers or compute grids just lost a lot. Maybe they will find an angle to claim compensation.

From what I've read, this slowdown only affects syscalls, which, since they aren't usually a huge percentage of processing in the first place, should not have such an effect. You're more likely looking at a few percent at most, which is not going to be enough to make AMD outperform Intel. Let's stop the fear mongering and wait for actual metrics.

> From what I've read, this slowdown only affects syscalls

Incorrect. It also affects interrupts and (page) faults.

Any usermode to kernel and back transition.

So this is evil for virtualization hosting, which is the major enterprise application for Intel chips.

Hosting on bare metal will become more attractive. Too bad you can't long OVH and Hetzner.

>Too bad you can't long OVH and Hetzner.

What does that even mean?

Also Hetzner just introduced some AMD Epyc server.

"Long" as a verb means to purchase their stock.

As opposed to "shorting" a stock, which means making a bet that it will go down in value.

Ah that makes sense. Thanks!

For some reason I can’t reply to ‘chrisper’ but I think ‘api’ is referring to going long in the stock market.


> For some reason I can’t reply to ‘chrisper’

HN doesn't let you do this to new comments to avoid back-and-forth commenting that is typical in flamewars.

Coming up on 9 years here and I'm still finding out new things about how this site works. I've been wondering recently why some comments aren't replyable.

I think there is a time delay. Wait a few minutes or hours and you can reply. That cooling off period has helped me really think through my replies.

They become replyable after some amount of time. The amount of time varies based on how active the thread is and/or how deeply nested the comment is.

You can reply anyway, but you have to click on the timestamp ("X minutes ago") to do it.

Usually you can click on the <posting time> and go to a page that displays only that comment, which has a reply box even when there's no reply option on the main page.

That 100Hz timer tunable just got a lot more attractive...

Does that mean that you can get an instance on AWS and slowdown the underlying server for all others by forcing a lot of syscalls? Or how is performance distributed between tenants?

Prelim benchmarks show a significant impact (~20%) on Postgresql benchmarks.

20% when running SELECT 1; over a loopback network interface, not in real-world workloads.

The other benchmark that has generated some consternation is running 'du' on a nonstop loop.

Both of these situations are pathological cases and don't reflect real-world performance. My guess is a 5-10% performance hit on general workloads. Still significant, but nowhere near as bad as some of the numbers that are getting thrown around.

And, databases are the worst case scenario, most real-world applications are showing 1% performance impact or less.



It's not really a worst case scenario when you consider where the majority of Intel's revenue comes from: selling their high margin server chips for use in data centers, a significant portion of which are running some kind of database.

Why should we trust your guesses over numbers being thrown around?

Your last link is all gaming benchmarks, which as the article mentions are not affected much.

Another quick postgres estimate [1] with lower impact and a a reply from Linus Torvalds that this values are in the range what they are expecting from the patch. "... Something around 5% performance impact of the isolation is what people are looking at. ..." [2]

[1] http://lkml.iu.edu/hypermail/linux/kernel/1801.0/01274.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1801.0/01299.html

> syscalls, which, since they aren't usually a huge percentage of processing in the first place... Let's stop the fear mongering


(We should probably also stop overgeneralizing about the nature of computational workloads.)

Software development workflows have some of the worst syscall profiles out there. This is going to hit most of us where we live.

> I know that bugs happen

This isn’t an excuse for Intel consistently having terrible verification practices and shipping horrendous hardware bugs. From 2015: https://danluu.com/cpu-bugs/ There have been more since then.

I’ve talked to multiple people who work in intel’s testing division and think “verification” means “unit tests”. The complexity of their CPUs has far surpassed what they know how to manage.

This is typically what happens when you go for a long time without real competition. You get way too comfortable and bad habits start to pile up.

Isn't why this problem even exits the exact opposite? Intel was losing on the mobile market and changed internal testing to iterate faster by cutting corners.

Found a quote:

"We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are".

Man, you should see the errata for some ARM-based SOCs. It's amazing that they work at all.

Vendor, in conversation: "We're pretty sure we can make the next version do cache coherency correctly."

Me (paraphrased): "Don't let the door hit you in the ass on the way out."

Management chain chooses them anyway, I spend the next year chasing down cache-related bugs. Fun.

ARM is such a shitstorm. At least the PC with UEFI is a standard. With every ARM device, you have to have a specialized kernel rom just for that device. There have been efforts made on things like PostmarketOS, but still in general, ARM isn't an architecture. It's random pins soldered to an SoC to make a single use pile of shit.

Why is it an issue to need a different kernel image for each device? I don't see a problem as long as there is a simple mechanism to specify your device to generate the right image. It's already like that with coreboot/libreboot/librecore, and it worked just fine for me.

Imagine that you are the person leading the team that's making an embedded system on an ARM SOC. It's not Linux, so you have your own boot code, drivers and so forth. It's not just a matter of "welp, get another kernel image." You're doing everything from the bare metal on up.

(I should remark that there are good reasons for this effort. Such as: It boots in under 500ms, it's crazy efficient, doesn't use much RAM, and your company won't let you use anything with a GPL license for reasons that the lawyers are adamant about).

So now you get to find all the places where the vendor documentation, sample code and so forth is wrong, or missing entirely, or telling the truth but about a different SOC. You find the race conditions, the timing problems, the magic tuning parameters that make things like the memory controller and the USB system actually work, the places where the cache system doesn't play well with various DMA controllers, the DMA engines that run wild and stomp memory at random, the I2C interfaces that randomly freeze or corrupt data . . . I could go on.

It's fun, but nothing you learn is very transferrable (with the possible exception of mistrust of people at big silicon houses who slap together SOCs).

The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware. If the manufacturer doesn't provide the necessary documentation, then that's exactly that: insufficient documentation to use the device.

There are hardware manufacturers that are better than others at being open and providing documentation. My minimal level of required support and documentation right now is mainline linux support.

Can you document your work publicly, or is there something I can read about it? I'm very interested in alternative kernels beside Linux.

> The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware.

When you buy an SOC, the /contract/ you have with the chip company determines the extent and depth of their responsibility. On the other hand, they do want to sell chips to you, hopefully lots of them, so it's not like they're going to make life difficult.

Some vendors are great at support. They ship you errata without you needing to ask, they are good at fielding questions, they have good quality sample code.

Other vendors will put even large customers on a tier-1 support by default, where your engineers have to deal with crappy filtering and answer inane questions over a period of days before getting any technical engagement. Issues can drag on for months. Sometimes you need to get VPs involved, on both sides, before you can get answers.

The real fun is when you use a vendor that is actively hiding chip bugs and won't admit to issues, even when you have excellent data that exposes them. For bonus points, there are vendors that will rev chips (fixing bugs) without revving chip version identifiers: Half of the chips you have will work, half won't, and you can't tell which are which without putting them into a test setup and running code.

Arm is a problem for all kernels not just Linux in how they map on chip peripherals, etc. All the problems that UEFI solve, are not solved on Arm.

Yep. I've seen scary errata and had paranoid cache flushes in my code as a precaution.

My favorite ARM experience was where memcpy() was broken in an RTOS for "some cases". "some cases" turned out to be when the size of the copy wasn't a multiple of the cache line size. Scary stuff.

Obvious hypothesis: first complacency leads to incompetence, then starting to cut corners has catastrophic consequences. The two problems are wonderfully complementary.

As other comments suggest, there might be a third stage, completely forgetting how to design and validate chips properly.

Or the system was designed poorly to begin with and now you're stuck with the design for backwards compatibility reasons.

I'd expect engineers that are aware of such serious bugs to spit on the grave of backwards compatibility. After all, the worst case impact would be smaller than the current emergency patches: rewriting small parts of operating systems with a variant for new fixed processors.

I think that could also have been the "official reason".

The same reason could have been used to give the NSA some legroom for instance, but tell everyone that's why they won't do so much verification in the future.

This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster. So essentially this is payback time for Intel for keeping compatibility with older code and simpler to program architecture (stricter cache coherence etc.). It is like one can only have 2 of cheap, reliable, easy-to-program.

I'm sure ARM vendors have their own problems... it is just that they tend to be used in application specific products so the bugs are worked around. Having come from a firmware background I've worked are tons of ugly workarounds for serious bugs in validated hardware.

Furthermore, I just a read an article (can't find the link) that certain ARM Cortex cores have this same issues as Intel.

> This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster.

More likely "good enough" is much lower because ARM users aren't finding the bugs. The workloads that find these bugs in Intel systems are: heavy compilation, heavy numeric computation, privilege escalation attackers on multi-user systems. Those use cases barely exist on ARM: who's running a compile farm on ARM, or doing scientific computation on an ARM cluster, or offering a public cloud running on ARM?

Where’s that quote from? ISTR reading it (or something very similar) as reported speech in a HN comment.

Overall it’s a depressing story of predictable market failure as well as internal misbehavior at Intel, if true. Few buyers want to pay or wait for correctness until a sufficiently bad bug is sufficiently fresh in human memory. And if you do want to, it’s not as if you’re blessed with many convenient alternatives.

The quote is from the link above (referencing an anonymous reddit comment).

That is a very interesting perspective, and as far as I know it is correct, though perhaps Intel's situation in the mobile market was exacerbated by complacency?

There are people looking to deploy ARM servers now. However I wish there had been more server competition. Many companies write their backend services in Python, JVM (Java/Scala/Groovy), Ruby, etc. Stuff that would run fine on Power, ARM or other architectures. There are very few specialized libraries that really require x86_64 (like ffmpeg and video-transcoding)

ffmpeg works great on ARM. I don't know if the PPC port is all that optimized lately.

But why do AMD chips not have similar issues? To me it looks like Intel tried to micro optimize something and screwed up.

According to LKML: https://lkml.org/lkml/2017/12/27/2

> The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault.

Out-of-order processors generally trigger exceptions when instructions are retired. Because instructions are retired in-order, that allows exceptions and interrupts to be reported in program order, which is what the programmer expects to happen. Furthermore, because memory access is a critical path, the TLB/privilege check is generally started in parallel with the cache/memory access. In such an architecture, it seems like the straightforward thing to do is to let the improper access to kernel memory execute, and then raise the page fault only when the instruction retires.

Maybe the answer lies in Intel’s feted IPC advantage over AMD? Or is it the case that AMD has simply been relatively lucky so far?

Sounds like Facebook and Youtube, too.

It depends on whether it's an attack against HVM hypervisors or not.

If it, like it seems, is just an attack on OS kernels and PV hypervisors, you can simply turn off the mitigation, since nowadays kernel security is mostly useless (and Linux is likely full of exploitable bugs anyway, so memory protection doesn't really do that much other that protecting against accidental crashes, which isn't changed by this).

Even if it's an attack against hypervisors any large deployment can simply use reserved machines and it won't have a significant cost.

More than the lawsuit, it attacks one of the core aspect of Intel's brand: performances. Intel chips are supposed to be faster. Now they are suddenly 30% slower because they carelessly implemented performance features over security ones.

> all companies paying for computing resources will have to pay roughly 30% more overnight on cloud expenses

Well, if I rent a VPS with x performance, I still expect x performance after this flaw is patched. The company providing the virtual machine will perhaps have to pay 30% more to provide me with the same product I've been getting.

Since most VPS offerings arbitrage shared resources, this will not increase costs of providing VPSes by the full performance penalty.

But all you are ever getting with vps offerings is a description of the number of cpus and amount of ram and suchlike. I haven't seen vps offerings that say "x" chips yield "y" performance. Granted it is sort of implied that the hardware meets certain expectations but there isn't any guarantee. I was just now reading the TOS for AWS just to check and so far as I can tell they aren't guaranteeing any kind of specific performance.

Well if you use m5.xlarge instances from AWS you were getting 4 vCPUs for your money. I don't expect you'll now get 5...

No, but the underlying hardware that perviously hosted two m5.xlarge instances may instead host one M5.xlarge and one M5.medium, so that performance is not degraded.

Yeah but AWS doesn't guarantee that "2x m5.xlarge" will meet any kind of performance requirements, particularly your own application's, do they?

So you may suddenly find that your own performance requirements, that were previously satisfied by "2x m5.xlarge" are no longer being met by that configuration, and I doubt AWS will just provide you with more resources at no additional charge.

> Well, if I rent a VPS with x performance, I still expect x performance after this flaw is patched.

Are there any providers that state you will get x performance? Most that I've seen say you will m processors, n memory, and p storage but don't make any guarantees about how well those things will perform.

Last I checked Amazon AWS has a virtual processor metric not actual hardware metric. This is most noticeable in their lowest power instances which don't get a full modern CPU core.

If the virtual metric is tied to real performance then it could mean a drop in performance while maintaining the same power rating... It will be interesting to see if vendors directly address this.

Cloud services may not need to worry about the issue depending on the OS the customer choose to use, the patched or non patched version.

For the cloud providers it's the security of the hypervisor that's at stake.

Why would the OS of the customer matter? The patch would be applied to the kernel of the hypervisor / host OS.

Forgive me my ignorance, but I fail to see how this is such a big deal. Even 50% performance hit/cost increase would be... bearable, computations are rather cheap today. ML and other intensive calculations aren't done on CPU anyway. It's not like technical progress of our civilization is slowed down by 30% or something...

On the other hand, shrinking Intel's market share due to bad PR and thus adding some competition into the industry could actually foster that progress.

If you run things efficiently you're eaking every ounce of performance out of this hardware. A 30% performance hit means a 30% cost increase.

The bigger issue is for things that don't scale easily. That sql server that was at 90% capacity is suddenly unable to handle the load. Sure that could've happened organically, but now it happens (perhaps literally) overnight for everyone all at once.

Expect a bunch of outages in the next few weeks as companies scramble to fix this.

"A 30% performance hit means a 30% cost increase."

Just wanna point out that a 30% performance hit means a 43% cost increase.

Yes. This is so often forgotten when talking about stock prices (which those 2x or 3x daily derivatives are so dangerous).

For those confused: the math here is a 30% decrease puts you at 70%. To go from 70% back to 100%, 30% only gets you to 91% (0.70*1.3). 1/0.7 = 1.43 means you need 43% to recover.

Should the individual companies hurry to patch it. There is no news of exploit as such.

There is now !

Intel CEO added - "But when you take a look at the difficulty it is to actually go and execute this exploit — you have to get access to the systems, and then access to the memory and operating system — we're fairly confident, given the checks we've done, that we haven't been able to identify an exploit yet."

It seems you need root or physical access to the system as a prerequisite for the attack.

You don't need root, and you don't need physical access. For Meltdown, you only need the ability to run your own code on the target machine.

Where that gets tricky is when everyone's using cloud hosting solutions where the physical machines are abstracted away, and a given physical server may be running multiple virtual servers for different customers.

Think of it like this:

* Somewhere in a data center at a cloud provider is a physical server, wired up in a rack..

* That server runs virtualization software, allowing it to host Virtual Server 1, Virtual Server 2, and Virtual Server 3.

* Virtual Server 1 belongs to Customer A. Virtual Servers 2 and 3 belong to Customer B.

* Normally, Virtual Server 1 can't access any memory allocated to Virtual Servers 2 and 3.

* BUT: Customer A can now use Meltdown to read the entire memory of the physical server. Which includes all the memory space of Virtual Servers 2 and 3, exposing Customer B's data to Customer A.

That's the threat here.

Have you worked in a company where you've hit CPU performance limits. At my last job, we'd have some services run in 25 containers in parallel and we'd have to optimize as much as we could for performance bottlenecks. We'd literally get thousands of assets per minute some mornings, and had a ton of microservices to properly index tag, thumbnail and transcode them.

Our ElasticSearch nodes all had 32GB of ram and we had 10 of them and they were all being pushed to the max.

Something like this would be a massive hit, requiring a lot more work into identifying new bottlenecks and scaling up appropriately.

I think you're vastly underestimating the potential impact to cloud providers. Azure/AWS/GCP all definitely have extra capacity, but they have forecasting down to a science. Requiring even 10% more capacity is quite a large undertaking alone.

Even the non provider side of google will see some impact and even 5% datacenter increase won’t happen overnight

Best summary I've found for the somewhat technical but not hardware-or-low-level-hacker reader is arstechnica. https://arstechnica.com/gadgets/2018/01/whats-behind-the-int...

My head is still spinning writing an OS is a BIG DEAL!!!!

Can someone help me understand why this is such a big deal? This doesn’t seem to be a flaw in the sense of the Pentium FDIV bug where the processor returned incorrect data. It doesn’t even seem to be a bug at all, but a side channel attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it. And it doesn’t seem like it can be used for privilege escalation, only reading secret data out of kernel memory. It seems pretty drastic to impose a double-digit percentage performance hit on every Intel processor to mitigate this.

There is this thing called "return oriented programming". You write your program as a series of addresses that are smashed onto the stack through some other type of vulnerability. When the current function returns, it returns to an address of your choosing. That address points to the tail end of some known existing function, such as in the C library and other libraries. When the tail end of that function returns, it executes your next "instruction" which is merely the next return address on the stack.

The first "instruction" of your program is the last address on the stack, in the list of addresses you pushed to the stack.

You are executing code, but you did not inject any executable code, you did not need to modify any existing code pages (which are probably read only), you did not need to attempt to execute code out of a data page (which is probably marked non executable).

Address Space Layout Randomization is a way to prevent the "return oriented programming" attack. When a process is launched, the address space is randomly laid out so that the attacker cannot know which address in memory the std C lib printf function will be located at -- in this process.

Now let's think about the kernel. If you could know all of the addresses of important kernel routines, you could potentially execute a "return oriented programming" attack against the kernel with kernel privileges. Without modifying or injecting any kernel level code. These hardware vulnerabilities allow user space code to deduce information about kernel space addresses.

Now that's a lot of hoops to jump through in order to execute an attack. But there are people prepared to expend this and even more effort in order to do so. Well funded and well staffed adversaries who would stop at nothing in order to access more and better pr0n collections.

Thanks for the explanation. But I don't understand this part:

> If you could know all of the addresses of important kernel routines, you could potentially execute a "return oriented programming" attack against the kernel with kernel privileges. Without modifying or injecting any kernel level code.

The user <-> kernel transition is mediated (on x86-64) with the SYSCALL instruction, which jumps to a location specified by a non-user writable MSR. How does return-oriented programming work in that case?

Basically, let's say there's a syscall that takes a user buffer and size and copies it into kernel stack for processing. (This is common.) If you overflow that buffer, you can overwrite the return address in the kernel stack, which you can then launch into ROP.

If you overflow that buffer, you can overwrite the return address in the kernel stack, which you can then launch into ROP.

The crucial point here being that there must already be an existing overflow vulnerability in the kernel. Knowing all the addresses is no use if you can't force execution to go to them.

The hypothesis I've seen, and why people seem to be rushing to patch it without explaining, is that you might be able to not only leak addresses, but actual data, from any ring, into unprivileged code, at which point, your security model is burned to the ground.

AIUI, the present circumstances are:

- there exists a public PoC from some researchers of side-channel leaking kernel address information into userland via JavaScript which may be unrelated

- there exists a Xen security embargo that expires Thursday that might be unrelated

- AWS and Azure have scheduled reboots of many things for maintenance in the next week, which seems unlikely to be unrelated to the Xen embargo

- a feature that appears to be geared toward preventing a side-channel technique of unknown power has been rushed into Linux for Intel-only (both x86_64 and ARM from Intel)

- a similar class of prevention technique has been landed in Windows since November for both Intel and AMD x86_64 chips (no idea about ARM)

- the rush surrounding this, and people being amazingly willing to land fixes that imply a 5-30% performance impact, strongly suggest that unlike almost every major CPU bug in the last decade, you can't fix or even work around this with a microcode update for the affected CPUs, which is _huge_. The AMD TLB bug, the AMD tight loop bug that DFBSD found, even the Intel SGX flaws that made them repeatedly disable SGX on some platforms - all of them could be worked around with BIOS or microcode updates. This, apparently, cannot. (Either that or they're rushing out fixes because there's live exploit code somewhere and they haven't had time to write a microcode fix yet, but O(months) seems like they probably concluded they outright can't, rather than haven't yet.)

Addendum for anyone still reading:

- Intel issued a press release saying they planned to announce this next week after more vendors had patched their shit, which lends me more cause to believe that the Xen bug might be the same one [1]

- Intel claims in the same PR that "many types of computing devices — with many different vendors’ processors" are affected, so I'll be curious to see whether non-Intel platforms fall into the umbrella soon

- macOS implemented partial mitigations in 10.13.2 and apparently has some novel ones coming up in 10.13.3 [2]

- someone reasonably respected claims to have a private PoC of this bug leaking kernel memory [3]

- ARM64 has KPTI patches that aren't in Linus's tree yet [4] [6] ([6] is just a link showing the patches from 4 aren't in Linus's tree as of this writing)

- all the other free operating systems appear to have been left out of the embargoed party (until recently, in FBSD's case), so who knows when they'll have mitigations ready [5]

- So far, Microsoft appears to have only patched Windows 10, so it's unknown whether they intend to backport fixes to 7 or possibly attempt to use this as another crowbar to get people off of XP 2.0

- Update: Microsoft is pushing an OOB update later today that will auto-apply to Win10 but not be forced to auto-apply on 7 and 8 until Tuesday, so that's nice [7]

[1] - https://newsroom.intel.com/news/intel-responds-to-security-r...

[2] - https://twitter.com/aionescu/status/948609809540046849

[3] - https://twitter.com/brainsmoke/status/948561799875502080

[4] - https://patchwork.kernel.org/patch/10095827/

[5] - https://lists.freebsd.org/pipermail/freebsd-security/2018-Ja...

[6] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[7] - https://www.theverge.com/2018/1/3/16846784/microsoft-process...

https://security.googleblog.com/2018/01/todays-cpu-vulnerabi... https://googleprojectzero.blogspot.com/2018/01/reading-privi...

Seems that Google/Project Zero felt the need to go ahead and break embargo. Worth adding to the above list of news sources.

No, that's not accurate.

If you read the article you quoted:

> We are posting before an originally coordinated disclosure date of January 9, 2018 because of existing public reports and growing speculation in the press and security research community about the issue, which raises the risk of exploitation. The full Project Zero report is forthcoming (update: this has been published; see above).

Just from public Gooogling, I believe it may have been the Register who tried to get in on the scoop, and broke the embargo:


No-one necessarily broke the embargo. A blogger noticed unusual activity around a certain linux patchset and put two and two together, and the register mostly sourced from his article ( http://pythonsweetness.tumblr.com/post/169166980422/the-myst... )

Also from the P0 blog post: Variant 1: bounds check bypass (CVE-2017-5753) Variant 2: branch target injection (CVE-2017-5715) Variant 3: rogue data cache load (CVE-2017-5754)

My checking doesn't show any of those three explicitly listed in Apple's security updates up through 10.13.2/2017-002 Sierra.


Thanks for summarizing. Does anyone have time to link to more on the "side-channel leaking kernel address information into userland via JavaScript" ?

This isn't exactly that, but here[1] is a talk linked in the post from the other day which shows a PoC breaking ASLR in Linux from JavaScript running in the browser, via a timing attack on the MMU. There's a demo a half hour in.

EDIT: This post[2] discusses the specific speculative execution cache attack and claims there is a JavaScript PoC (but doesn't cite a source for that claim)

[1] https://www.youtube.com/watch?v=ewe3-mUku94

[2] https://plus.google.com/+KristianK%C3%B6hntopp/posts/Ep26AoA...

[1] was what I was referencing, thank you.

Also, RUH-ROH. https://twitter.com/brainsmoke/status/948561799875502080

More importantly, it also switches stacks so user-mode code cannot modify the return addresses on the kernel's stack.

Everything you've said is right, but I'll expand a little more because ROP is fun.

ASLR, PIC (position independent code: chunks of the binary move around between executions), and RELRO (changing the order and epermissions of an ELF binaries headers: a common ROP pattern is to set up a fake stack frame and call a libc function in the ELFs Global offset table) are all mitigations against ROP, but none solve the underlying problem.

The reason ROP exists is that x86-64 use a Von Neumann architecture, which means that the stack necessarily mixes code (return addresses) and data. The only true solution is an architecture that keeps these stacks separate, such as Harvard architecture chips.

As for bypassing the aforementioned mitigations...

ASLR: Only guarantees that the base address changes. Relative offsets are the same. So to be able to call any libc function in a ROP chain, all you need is a copy of the binary (to find the offsets) and to leak any libc function address at runtime. There are a million ways for this data to be leaked, and they are often overlooked in QA. Once you have any libc address, you can use your regular offsets to calculate new addresses.

PIC: haven't yet dealt with it myself, but you can use the above technique to get addresses in any relocated chunk of code, but I think you'll need to leak two addresses to account for ASLR and PIC.

RELRO: This makes the function lookup table in the binary read only, which doesn't stop you from calling any function already called in the binary. Without RELRO, you can call anything in libc.so I think, but with RELRO you can only call functions that have been explicitly invoked. This is still super useful because the libc syscall wrappers like read() and write() are extremely powerful anyway. Full RELRO (as opposed to partial RELRO) makes the procedure linkage table read only as well, which makes things harder still.

If this is the kinda thing that interests you, I heartily recommend ropemporium.com which has a number or ROP challenge binaries of varying difficulty to solve. If you're not sure where to start, I also wrote a write-up for one of the simpler challenges [1] that is extremely detailed, and should be more than enough to get you started (even if you have me experience reversing or exploiting binaries)

Disclaimer: I'm just some dipshit that thinks this stuff is fun, if I've made a mistake in the above please let me know. I also haven't done any ROP since I wrote the linked article, so im probably forgetting stuff.

[1] https://medium.com/@iseethieves/intro-to-rop-rop-emporium-sp...

> If you could know all of the addresses of important kernel routines

Are those kernel logical addresses?

My BTC wallet feels more vulnerable then ever

If you can read kernel (and hypervisor) memory then it seems like a very small step from that to a local root vulnerability - e.g. forge some kind of security token by copying it. There's an embargoed Xen vulnerability that may be related to or combine with this one to mean that anyone running in a VM can break out and access other VMs on the same physical host. That would be a huge issue for cloud providers.

> If you can read kernel (and hypervisor) memory then it seems like a very small step from that to a local root vulnerability - e.g. forge some kind of security token by copying it.

This seems very wrong. I'm not aware of any privilege isolation in Windows relying on the secrecy of any value. Security tokens have opaque handles for which "guessing" makes no sense. Are you aware of anything?

I can think of a few ways to get privilege escalation if you already have rce as unprivileged user:

1. Read the root ssh private key from the openssh deamons kernel pages maintaining the crypto context and ssh into the system

2. Read a sudo auth key generated for someone using sudo and then use that to run code as a root user

3. Read the users password's whenever a session manager asks the users to reauth

4. If running in AWS/GCP inside a container/vm meant to run untrusted code, read the cloud provider private keys and get control on account

5. RCE to ROP powered privilege escalation exploit seems reasonable...

6. Rowhammer a known kernel address (since you can now read kernel memory) to flip some bits to give you root

Also remember running JS is basically RCE if you can read outside the browser sandbox, ads just became much more dangerous...

Thanks! I see. So it seems like the program basically has to capture sensitive data while it is in I/O transfer (and hence in kernel memory) just at the right time, right? Which is annoying and might need a bit of luck, but still possible.

Incidentally, this seems to indicate that zero-copy I/O is actually a security improvement as well, not just a performance improvement?

4,5 and 6 don't need to time the attack.

I am not really sure how/if zero copy may/may not solve this problem.

If this bug only allows reading kernel pages, zero copy may actually help if the unprivileged user can't read your pages, but from the small amount of available description it looks like it can read any page, but kernel pages are more interesting because thats a ring lower and which is why all the focus is on that.

I am fairly certain there is more protection against being able to read memory owned by process on a lower ring level so zero copy may be a bad idea for security critical data.

And based on the disclosure that google published, looks like any memory can be read

If “reading secret data out of kernel memory” translates into “read the page cache from a stranger’s VM that happens to be on the same cloud server” then this could be worse than Heartbleed.

Or maybe random javascript in the browser can stroll upon your ssh private key in the kernel's file cache... and so on.

Excellent point, I didn't think about the implications for stuff like JavaScript.

The privilege escalation is being fixed in software. The problem is that mitigation involves patching the kernel and that patch results in around 30% slowdown for some applications like databases or anything that does a lot of IO (disk and network). That's the big deal. Imagine you are running at close to full capacity after security fix reboot your service might tip over. It could mean a direct impact to cost and so on.

Oh good, I put my SaaS (running mostly on Linode) up yesterday, then this happens. Can't wait for Linode to apply this patch to their infrastructure :(

I'm cursed when it comes to timing. It's like when I bought that house in 2007, held onto it waiting for the market to recover, then tried to sell it only to find out my tenants had been using it to operate a rabbit-breeding business for years and completely trashed the place (thank you, useless property manager), forcing me to sell it at a loss anyway (6 months ago).

Also, I hate rabbits now. And I veered off topic, sorry.

You might try luck to sell this as comedy/drama movie script. :)

> Also, I hate rabbits now. And I veered off topic, sorry.

Well I guess you're not the right person to without about a great ninja-rockstar position at our new RaaS startup.

/one has to joke sometimes to avoid crying over taking a 30% hit in costs... over a stupid CPU bug

I would love to see some SQL Server benchmarks on this patch

SQL Server license disallows publishing of the results of benchmarking (much like Oracle does)

Wait, really? That's kind of messed up.

Remarkable that no throwaway HN accounts considered that a challenge.

Likely very similar to the postgres benchmarks. Fundamentally a RDBMS needs to sync each transaction commit to the log file on disk and that sync is always a syscall. If your DB is doing thousands tx/sec to low latency flash and you rely on that low latency, you're going to get hit.

Note that the postgres benchmark numbers passed around (most based on my benchmarks) are readonly. For write based workloads the overhead is likely going to be much smaller in nearly all cases, there's just more independent workload. The overhead in the readonly load profiles comes near entirely from the synchronous client<->server communication, if you can avoid that for workloads (using pipelining, other batching techniques) the overhead's going to be smaller.

Reading secret data out of kernel memory is very bad on cloud environments. Keep in mind that the kernel deals with a lot of cryptography.

Sounds like reading HTTPS cert/key details from other-peoples-VM's on cloud providers wouldn't be too much of a stretch. Especially with the memory dumping demo. Combine that with something that looks for the HTTPS private key flag string and it's sounding pretty feasible. :/

Is there anything this bug can give you that you can't get with

    sudo cat /dev/mem

I'm having a hard time understanding why this is worse than any other local root escalation bug except for the consequences of the necessary patch.

EDIT: I see that /dev/mem is no longer a window on all of physical RAM in a default secure configuration. Is it true that there's no way for root to read kernel memory in a typical Linux instance? If so, the severity of this issue makes more sense to me.

You don't need to be root.

> I'm having a hard time understanding why this is worse than any other local root escalation bug except for the consequences of the necessary patch.

It's not, as far as I'm aware. The fact that the patch has perf consequences is why it's such a big deal.

I think the idea is that it is worse because the bug is in the hardware. The OS patches are just a workaround to make the hardware bug unexploitable, and they can lead to a significant performance penalty.

We don't know what the actual bug is yet, or how easy it would be to exploit it. People are speculating that either:

a) It would allow any non-root process to read full memory, including the kernel and other processes, or

b) It would allow one cloud VM to read full memory of other cloud VMs on the same physical machine, or

c) With enough cleverness, it would allow even sandboxed Javascript on a web page to read full memory of the computer that it is running on.

`/dev/mem` is not available in a container, so I cannot use `/dev/mem` to read other tenants' memory on my VPS.

>And it doesn’t seem like it can be used for privilege escalation

Based on all the hoopla around the linux kernel patches the thinking is : yes it can. Or VM escape. Or both.

It's a bug, even if it's a side-channel attack only. Notice that AMD chips aren't vulnerable to this attack.

I have no idea how or if this is a big deal but:

>>attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it.

if you're going to put in features with expected attacks you should definitely be putting in features to prevent it , and if it is an expected attack it shouldn't be special measures it should just be an inherent part in introducing the feature.

When speculative execution (and caches) were invented and put into widespread use, no one thought about timing attacks, nor was the practice of running untrusted code on one's own machine common.

UNIX has been multi-user for a very long time and the intended use case is that those users not be able to compromise each other or get root.

nor was the practice of running untrusted code on one's own machine common

Doesn't multi-user timesharing and virtualization predate every modern CPU and OS though?

Yes, but it went out of style for a while.

At first, computers were very expensive, and so were shared between many users. Mainframes, UNIX, dumb terminals, etc.

Then computers became cheap. Users could each have their own computer, and simply communicate with a server. Each business could have their own servers co-located in a datacenter.

Then virtualization got really good, and suddenly cloud servers became viable. You didn't have to pay for a whole server all the time, and if demand rapidly increased you didn't need to buy lots of new hardware. And if demand decreased you didn't get stuck with tons of useless hardware.

The second stage (dedicated servers) was the case when speculative execution was implemented. We're currently in the third stage, but Intel haven't changed their designs.

Old multi-user time sharing generally had agents that were 'somewhat' trusted. Most systems that had 'secret' data didn't allow time sharing with data of less privileged. Also, outside of the timeshared server users attempting to exploit this wouldn't likely have the processing capability to deduce the contents of said cache.

It doesn’t even seem to be a bug at all, but a side channel attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it.

Indeed, this reminds me of cache-timing attacks, which probably can be done on every CPU with any cache at all --- and they've never seemed to be much of a big deal either.

I don't think we even know what the bug is yet, just lots of informed speculation...

Ironically, "lots of informed speculation" seems to be exactly what the bug is about. ;-)

The thing is, AMD probably very narrowly just missed this one --- if they did more aggressive speculative execution, they would be the same.

I’m not (yet) clear on if/how this impacts aarch64 (ARM architecture) chips, the distinction between how it affects Intel vs. doesn’t affect AMD reminds us of a fundamental lesson we seem to have conveniently forgotten: monocultures of anything are bad. We need diversity and diversification in order to have reasonable amount of robustness in the face of unknowable, unpredictable risks.

I’m wondering whether ARM chips are affected if they are whether they are uniformly affected or whether it depends on vendor implementation choices.

A patch is in the works for ARM chips as well (http://lists.infradead.org/pipermail/linux-arm-kernel/2017-N...), but I am not clear on whether it's enabled by default. It seems like a good idea to have this, independent of current ARM vulnerability.

Yep, I'm still wondering how this affects ARM and if it can be corrected in microcode on that platform.

I'm also wondering if/hoping for a fix that involves increased memory usage instead of the speed.

I’m not a very proficient programmer/developer, so please bear with me. I’m intrigued by your reference to trading off greater memory footprint in exchange for diminishing performance by less. I'm trying to understand how this would work in practice: do you envision ‘padding’ the critical data structures with more empty or randomised buffer zones? Wouldn't that incur an additional penalty for the data transfer (Von Neumann bottleneck)? Would blank data be sufficient or would there be some additional kind of memory effect in the DRAM/SRAM that would demand using randomised data overwrites? How would you generate that random data?

(I apologise if this is blindingly obvious for somebody well versed in low-level programming.)

Oh I have no idea actually; it's just pretty normal to see a speed-memory trade in a lot of problems. I'm definitely not low-level enough either.

ARM posted a good overview and affected product list today:


>We need diversity and diversification in order to have reasonable amount of robustness

Ironically, in human populations it produces the opposite effect.

Please don't take HN threads on ideological tangents. The point you're making has by now become well-known ideological flamebait. We ban users who derail HN threads in such directions, so please don't.


Edit: since https://news.ycombinator.com/item?id=16063749 makes it clear that you're using HN for that purpose, which is not allowed here, I've banned this account. Would you please not create accounts to break the site guidelines with?


More diverse population—> lower trust —> less social will to support each other.

So we should be tribal societies, then? (I.e., endogamous. As in cousin marriage.)

Here are some numbers quantifying the problem. Big caveats apply as they are very preliminary, but the hit due to the software patches looks extremely significant:


Superficially, it seems like the performance hit mostly scales with IOPS or transactions per second, which might have some pretty serious implications for performance/dollar in the kinds of intensive back-end applications where Intel currently dominates and AMD is trying to make inroads with EPYC.

It has very little to do with what kind of syscalls (I/O or other kinds) and all to do with how many syscalls a given application makes per given time period. Compute bound applications are already avoiding syscalls in their hotter parts. This will mostly be a blow to databases, caching servers and other such I/O limited applications.

In other words, don't worry - pretty much all the key performance bottlenecks most of us deal with at work will be getting tighter, but at least our video games will still run OK.

Well, sounds like that time when there is no water left but there are still beers in the fridge.

At least we can play our sorrows away.

Right now I am just hoping that it wont add significant overhead to OpenGL. My application already has a bottleneck on changing OpenGL states and issuing render commands and I have no idea how much of that time is spend making syscalls.

OpenGL implementations shouldn't be effected by syscall overhead. Historically it's been DirectX that had a syscall per draw call, but I believe both now just write the command to GPU RAM directly.

That's at least somewhat encouraging. Nevertheless, it sounds rather like the old, "Yes, but will it run Crysis?" question will perhaps have renewed relevance.

That’s terrible - these are precisely the sort of customers who will suddenly get hit with a performance hit that will negatively impact their operations!

This sounds like it’s positively evil for outfits that rely heavily on virtualisation also.

Would it be fair to say that this might cause acceptation in the shift from on prem to the public cloud, where there are performance guarantees?

There aren't really performance guarantees for CPU and the ones that are there won't help here. When this patch is released big providers will have the same hardware as before and sell it in the same way - but the OS and userspace will just be slower for some use patterns.

There isn't a guarantee that will compensate for that any more than if you updated some piece of your software infrastructure to a new version that just got slower.

As I mentioned in the other thread yesterday database and database like applications are going to be hit particular hard. Even more so on fast flash storage. Double whammy compared to apps just doing network IO.

And while databases try to minimize the number of syscalls they still end up doing a lot of them for read, writeout, flush.

How would you trade this knowledge?

Intel has already dropped and AMD is up. Maybe there's more to move, but first-order effects are at least partially priced in already.

But what about second-order effects? Seems like virtualization should be vulnerable (VMWare and Citrix), but maybe they actually benefit as customers add more capacity.

Software-defined networking and cloud databases should also suffer though it's unclear how to trade these.

AWS, Google Cloud and Azure might benefit as customers add capacity but there's no way to trade the business units. So what about cloud customers where compute costs are already a large percentage of total revenue?

Netflix should be OK but Snap and Twilio could get squeezed hard. Akamai and Cloudflare might have higher costs they can't pass through to customers.

And where's the upside? Who benefits? If the performance hit causes companies to add capacity, maybe semiconductor and DRAM suppliers like Micron would benefit.

First and foremost, dont buy Intel stock. This doesnt help, but I long considered Intel as a company that doesnt know what they are doing. Here is why:

I owned Intel back in 2010 when they bought McAfee for 7.8 billion. They said the future of CPU's and chip tech was embedding security on the chip. the real answer was mobile and gpus.

Not only did I immediately know this was a horrendous deal, it clearly showed that the CEO and management had no clue on their own market's desires and direction. At the time, I was hoping they were going to buy Nvidia, it would have been a larger target to digest at 10 bil, but doable by Intel at the time.

The MacAfee purchase turned out to be one of the worst large cooperate purchases in history. Had they invested the 7.8 billion $ blindly into an sp500 index fund, their investment would be worth ~19-20 billion.

Arguably, the only sane trade here would be to buy Intel and short AMD if you think the size of the move is greater than it should have been. However, there are many reasons not to do this until there is more information, and as that information comes out, it will likely be incorporated into the continual corrections in price. As to second order effects, don't count on this mattering. Unless you are planning on trading huge amounts of money, the risk/reward is probably not great. If you have to ask...

I think the model for cloud vendors would be quite complicated. Not every version of the CPU and not every application is impacted as much (new intel processors with PCID will suffer less).

Add on top of that the fact that a lot cloud customers over provision (there's good scientific papers on how much spare CPU capacity there is). Cloud service providers that sell things on a per request / real CPU usage model (vs reserved capacity) prob benefit more.

Also, you can't just separate trading in AWS or GCE from the rest of the core business.

Potentially business units of DELL, HP, IBM, ... should do better as people use this as a justification to upgrade overdue hardware they should cover 5% to 10% lower performance (needing more units to cover that).

Agree on that last paragraph. The only reasonable thing people can do is buy more hardware to cover the performance loss and/or buy more hardware that's not needed using the bug as a pretext to get the budget approved now.

SDN might actually be OK. On the high-end they bend over backwards to not enter the kernel at all anyway, using stuff like DPDK. They stopped even using interrupts years ago.

Possibility: a cloud vendor who mostly-uses-AMD, versus one that mostly-uses-Intel, just got handed a massive price/performance relative advantage.

Automated ic-layout and software proofing benefits from this.

Not sure why you are being downvoted. It's an interesting question. I'm in on AMD for the time, just to see how to flows.

Don't flash devices use nvme (eg userspace queues) now and avoid the kernel all together for read and write operations? Shouldn't they have no impact?

NVMe means each drive can have multiple queues, but they're still managed by a kernel driver. You may be thinking of SPDK, which includes a usermode NVMe driver but requires you to rewrite your application. And many systems are still using SAS or SATA SSDs.

Not by default they don't. That only works if you're willing to dedicate that entire physical drive to a single application anyway.

In the future this possible on Linux with the filesystems that support DAX. Currently this all pretty experimental with lots of work being done in this space in the last two years.

But this will require you to have the right kind of flash storage, right kind of fs, right kind mount options, and probably a different code path in userspace for DAX vs traditional storage.

So we're a little ways away from this.

DAX doesn't appear related here at all. That is about bypassing the page cache for block devices that don't need one.

That doesn't move anything from kernel land into userspace, certainly not in the app's process in userspace anyway.

If you bypass the page cache you do not have read()/write() and mmap you avoid the syscall overhead. This matters a lot for high IOPs devices. Also these new fangled devices claim support word cache line sync using normal cpu flush instructions. Also avoiding fsync syscall.

One does not follow the other. Where are any references to how this will let you bypass read & write? User-space applications are still interacting with a filesystem, which they access via read/write and not a block device.

There's no talk in the DAX information about how this results in a zero-syscall filesystem API, and I'm not seeing how that would ever work given there would then be zero protections on anything. You need a handle, and that handle needs security. All of that is done today via syscalls, and DAX isn't changing that interface at all. So where is the API to userspace changing?

Please re-read my above comment. There is no new API. The DAX userspace API is mmap.

This work is experimental but you can mmap a single file on a filesystem on this device using new DAX capabilities. Most access will not longer require a syscall.

This comes with all the usual semantics and trappings of mmap plus some additional caveats as to how the filesystem / DAX / hardware is implemented. Most reads/writes will not require a trip to the kernel using the normal read()/write() syscalls. Additionally, there is no RAM page cache baking this mmap instead the device is mapped directly at a virtual address (like DMA).

Finally, flush for these kinds of devices is at the block level implemented using normal instructions and not fsync. Flush is going to be done using the CLWB instruction. See: https://software.intel.com/en-us/blogs/2016/09/12/deprecate-...

LWN.net has lots of articles and links in their archives from 2016/2017. It's a really good read. Sadly I do not have time to dig more of them up for you. Do a search for site:lwn.net and search for DAX or MAP_DIRECT.

Please re-read mine. How is the number of syscalls (which is the only thing that matters in this context) changing if there's no API change to apps? mmap already exists and already avoids the syscall. DAX "just" makes the implementation faster, but it doesn't appear to have any impact on number of syscalls

As in, if you call read/write instead of using mmap you're still getting a syscall regardless of if DAX is supported or not. Not everything can use mmap. mmap is not a direct replacement for read/write in all scenarios.

Do we have a performance estimate? I can eat 20 or 30%, but I can't eat 90%.

This comment further down thread mentions it's 20% in Postgres. https://news.ycombinator.com/item?id=16061926

...when running SELECT 1 over a loopback socket.

The reply to that comment is accurate: that's a pathological case. Probably an order of magnitude off.

We're still learning, but it looks like pgbench is 7% to 15% off:


I've seen that message. It acknowledges the same problems: do-nothing problems over a local unix socket.

Real-world use cases introduce much more latency from other sources in the first place.

I'm sticking with an expectation in the 2%-5% range.

Yep, this is getting blown way out of proportion by all of these tiny scripts that just sit around connecting to themselves. Even pgbench is theoretical and intended for tuning; you're not going to hit your max tps in your Real Code that is doing Real Work.

In the real world, where code is doing real things besides just entering/exiting itself all day, I think it's going to be a stretch to see even a 5% performance impact, let alone 10%.

I think 5% is a reasonable guess for a database. Even a well-designed database does have to do a lot of IO, both network and disk. It's just not a "fixable" thing.

But overall, yeah.

The claim is that it's 2% to 5% in most general uses on systems that have PCID support. If that's the case then I'm willing to bet that databases on fast flash storage are lot more impacted then this and pure CPU bound tasks (such as encoding video) are less impacted.

The reality is that OLTP databases execution time is not dominated by CPU computation but instead of IO time. Most transactions in OLTP systems fetch a handful of tuples. Most time is dedicated to fetching the tuples (and maybe indices) from disk and then sending them over network.

New disk devices lowered the latency significantly while syscall time has barely gotten better.

So in OLTP databases I expect the impact to be closer to 10% to 15%. So up to 3x over the base case.

> I've seen that message. It acknowledges the same problems: do-nothing problems over a local unix socket.

The first set of numbers isn't actually unrealistic. Doing lots of primary key lookups over low latency links is fairly common.

The "SELECT 1" benchmark obviously was just to show something close to the worst case.

> The first set of numbers isn't actually unrealistic. Doing lots of primary key lookups over low latency links is fairly common.

Latency through loopback on my machine takes 0.07ms. Latency to the machine sitting next to me is 5ms.

We're actually (and to think, today I trotted out that joke about what you call a group of nerds--a well, actually) talking multiple orders of magnitude through which kernel traps are being amplified.

> Latency through loopback on my machine takes 0.07ms. Latency to the machine sitting next to me is 5ms.

Uh, latency in local gigabit net is a LOT lower than 5ms.

> We're actually (and to think, today I trotted out that joke about what you call a group of nerds--a well, actually) talking multiple orders of magnitude through which kernel traps are being amplified.

I've measured it through network as well, and the impact is smaller, but still large if you just increase the number of connections a bit.

If so, this definitely moves the needle on the EPYC vs Xenon price/performance ratio.

All the Oracle DBAs out there are in for some suffering. Forget the cost of 30% extra compute, what about the 30% increase to Oracle licensing?

"SPARC user: not affected!"

--Oracle's marketing tomorrow, probably

(to their credit, SPARC does fully isolate kernel and user memory pages, so they were ahead of the curve here... for all 10 of their users who run anything other than Oracle DB on their systems.)

Phoronix strikes again! I admire Michael's consistency and dedication and their benchmarks have certainly gotten better over the years as PTS has matured, but everything on Phoronix still needs to be taken with a generous helping of salt. New readers generally learn this after a few months; it applies not only to their benchmarks, but also their "news".

The most obvious issue with this benchmark is that Phoronix is testing the latest rcs, with all of their changes, against the last stable version [EDIT: I misread or this changed overnight, see below] that doesn't have PTI integrated, instead of just isolating the PTI patchset. The right way to do this would be to use the same kernel version and either cherry-pick the specific patches or trust that the `nopti` boot parameter sufficiently disables the feature. That alone makes the test worthless.

There is no way this causes a universal 30% perf deduction, especially not for workloads that are IO-bound (i.e., most real-world workloads). This is a significant hit for Intel, but it's not going to reduce global compute capacity by 30% overnight.

EDIT: Looking at the Phoronix page, the benchmark actually appears to use 4.15-rc5 as "pre" and 4.15-some-unspecified-git-pull-from-Dec-31-that-isn't-called-rc6 as "post". I thought I had read 4.14.8 there last night, but may not have. Regardless, the point stands -- these are different versions of the kernel and the tests do not reflect the impact of the PTI patchset.

So you’re saying that the latest RCS, without the patch, was supposed to be slower than stable by at least 10%? How often do companies release performance downgrades of that scale? That’s also very unlikely.

>So you’re saying that the latest RCS, without the patch, was supposed to be slower than stable by at least 10%?

I'm saying that it's not a reliable measurement of the impact of the PTI patchset. There was a PgSQL performance anecdote [0] (actually tested with the real boot parameters instead of entirely different versions of the kernel) that showed 9% performance decrease posted to LKML, which Linus described as "pretty much in line with expectations". [1]

Quoting further from that mail:

> Something around 5% performance impact of the isolation is what people are looking at.

> Obviously it depends on just exactly what you do. Some loads will hardly be affected at all, if they just spend all their time in user space. And if you do a lot of small system calls, you might see double-digit slowdowns.

So in general, the hit should be around 5%, and "[y]ou might see double-digit slowdowns" seems like the hit on a worst-case workload is hovering closer to the 10% range than 30%. That's also what the anecdote from LKML shows, unlike Phoronix which shows 25%-30% or worse.

This is more of an attrition thing than a staggering loss. With people saying MS patched this in November, it would be interesting to see if people saw a similar 5-10% degradation in Windows benchmarks since that time.

>How often do companies release performance downgrades of that scale?

I don't know which "company" you're referring to here, but substantial changes in kernel performance characteristics are pretty common during the Linux development/RC process, and yes, definitely some workloads will often see changes +/- 10% between the roughly bi-monthly stable kernel releases.

If you're surprised that Linux development is so "lively", you're not alone. That's one of the selling points of other OSes like FreeBSD.

[0] https://lkml.org/lkml/2018/1/2/678

[1] https://lkml.org/lkml/2018/1/2/703

I wonder if we'll see some performance return as subsequent patches are produced. I can't tell from the coverage so far if this is possible.

A lot of people have noticed that High Sierra is slower than Sierra, specifically for filesystem operations with APFS. I wonder if Apple knew about this ahead of time and this explains the overhead?

Probably not. APFS just does a lot more then HFS, so there is a huge performance impact on disk related issues before this change goes in.

This is a all hands on deck kind of situation. Apple doesn't usually do well with security firedrills like this.

If work in NT and Linux kernels started on November, they must know. Intel must've told them, the alternative of Apple learning about this from a third party and grilling them over it would be too scary.

I just wonder whether Apple is threatening to move all their Macs to AMD, or to ARM?


Unless they live in a bubble, I’m sure most people at Apple are already aware of this.

It was a joke referencing the login vulnerability that was disclosed on twitter a couple of months ago.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact