Hacker News new | comments | show | ask | jobs | submit login
Intel Confronts Potential ‘PR Nightmare’ With Reported Chip Flaw (bloomberg.com)
1026 points by el_duderino 76 days ago | hide | past | web | favorite | 543 comments

This is a clusterf/big deal. Beyond the security implications, it means that all companies paying for computing resources will have to pay roughly 30% more overnight on cloud expenses for the same amount of CPU, assuming that they can just scale up their infrastructure.

I know that bugs happen and that there was nothing intentional on this one, but at times like this is hard to held at bay the temptation of claiming for a class lawsuit against Intel....

It's a good thing CPU is fairly compressible. Unless you meter it very carefully, you'll see the performance hit and it'll not impact you that much. Very few of my physical boxes are over 70% CPU utilization on a daily average.

It's, however, really bad if you sell CPU cycles for a living. You just lost between 5 and 30% of your capacity. If you have a large building, you just lost part of your parking lot to the Intel Kernel Page Problem building.

Problem is, most companies that need a lot of power only care about one thing - peak performance. And they tune it carefully in order to not overspend while guaranteeing minimal downtime. This means that they'll have to pretty much scale their infrastructure up by exactly 30%. That's a LOT for these big clients.

Honestly, I'd just make sure the server firewalls are super tight and not take in the future patches. At least for now.

Very very few highly tuned "peak performance" workloads are dominated by syscall overhead like the test that produced that 30% number was. It's best to hold off on the hyperbole.

I have a power dependent workload that scales horizontally and is currently already dominated by the cost of system calls. This will effectively, directly cause me to buy 30% more compute on a huge infrastructure. (2,000~physical machines. Quite beefy dual socket machines with a lot of memory)

I know I’m not alone.

Then again. Think of microservices, Kubernetes for instance; Network requests are system calls.

If your workload has no code that's untrusted, you can safely skip this patch or disable it on boot. If not, at 2000+ physical machines, it may be worth to move some of that into kernel modules that would collapse a couple syscalls into a single higher level one.

The VM host will still have the patch applied, won't it?

Yes, but if it's your metal, you don't need to.

Good idea. But it’s a windows executable and depends quite heavily on windows specifics.

(In my case anyway)

30% overhead might be inscentive to revisit the assumption we can’t rewrite it for Linux.

You still can move some functionality to a device driver or something else that runs in the NT kernel space.

But then you have to release the module as GPL, no?

Only if you want to do something that would otherwise be a violation of copyright - e.g. distribute the module to other people (assuming it's sufficiently entangled with linux to be a derivative work therof). The GPL only licenses you to do things that you otherwise couldn't do, it doesn't restrict you from doing things that were never a violation of copyright (e.g. privately modifying your own things).

No kernel modules can be closed source; GPU drivers are a common example of a closed source kernel module.

Will you be buying Intel-based machines? Or will you be running a hybrid-architecture cluster now?

I don’t know very much about computing on that scale, but I wonder if all the people selling off Intel stock are thinking this story through.

AMD server CPUs currently outperform Intel on some multi-threaded benchmarks. This usually isn't a problem for people buying for peak-performance because you can always buy more CPUs to increase parallel programming speeds, but it's harder to make single threads faster.

It's possible that the patches applied to fix this bug will cause some single-threaded benchmarks to change from Intel being the fastest to AMD being the fastest.

So for what it is worth my company has all Intel kit. We run servers that run docker. In each docker container we do build / test for our product. That is all we use them for. 1RU with 2 blades, each blade is dual socket, 72 total cores, 512GB RAM. We will not apply this patch as none of this is public facing and we do not want the hit to build / test throughput. The one big thing that this has done is we were looking at AMD for new servers and that has now become a higher priority on the to do list. Given our environment we care about the number of containers we can run, period.

It is overwhelmingly likely that we’ll buy more intel. Power/Watt has always been superior and AMD has to prove itself over time before we’d buy it.

Not trying to kill expectations. This decision isn’t mine alone. You know the old saying “nobody got fired for buying Cisco” that applies to Intel too.

No, but lots of workloads are built with latency in mind. For APIs that talk to each other in long serial chains, don't be surprised if request responses take significantly longer in many, many workflows.

I wouldn’t isolate your concern to firewalls and bad actors that break in over SSH. If they manage to find a vulnerability in your app that allows remote code execution this could help them make that problem much worse. Also VM/container escapes are a big problem if you use a cloud provider.

> I'd just make sure the server firewalls are super tight and not take in the future patches. At least for now.

Good security is about layers. No one layer can be assumed to be watertight, but with enough layers you hopefully get to a good place.

If they really care about peak performance, I don't believe the PTI patch will affect them. If you can change your system in a way that the power-hungry part does not work on untrusted data, you can not with "nopti" and ignore it. Systems which both need lots of maxed-out CPUs and traffic directly from wild internet are pretty rare. They're unlikely to run on a virtualised systems either.

Systems which both need lots of maxed-out CPUs and traffic directly from wild internet are pretty rare.

That's a good description of basically every cloud environment out there, from AWS on down.

In other words they are extremely common.

There are many ways to tune such workloads and I suspect our software will get better as a result.

We'll start to get conscious about the number of syscalls we use on each operation, start using large buffers, start buffering stuff user-side...

The CPUs in cloud environments are not maxed out in general. There will be some area like batch processing and compute-specific VMs. For other cases, there's quite a bit of overcommitting of resources. And that's before you start doing scheduling that mixed workloads on a physical host for better utilisation. Source: worked on a public cloud environment.

I agree with you on most VMs. But once you schedule mixed workloads, you want each host to be balanced so that all of its capacity is utilized evenly. Which means that if CPU use increases across the fleet, you will want new hardware with more CPU.

Either that, or you'll have to put up with some processes taking longer.

About 10 years ago I was mentored by a guy who was an utter wizard at queuing theory, and who bugfixed a whole bunch of nasty issues in cellular telecoms hardware through his understanding of how queuing theory impacted code execution.

TL:DR - queue behaviour gets nonlinear as you approach the theoretical max load. If you are running your processors at a high load, even a small change in code throughput makes a huge difference to real world behaviour.

I believe I saw a strangeloop talk about this specific issue in Clojure. The talk giver was talking about channels, not queues though.

Do you have a link?

Hmm. Since everyone that sells (Intel) CPU cycles for a living suffers the same loss of supply this boils down to pricing; the same demand chasing fewer cycles will drive up prices and the market will adapt.

30% is a big hit. I'm wondering if that isn't a bit exaggerated, or perhaps the consequence of a poorly optimized workarounds that will rapidly improve. I recall seeing figures on the order of 3% only a few days ago.

30% is a worst case for a workload optimized to hit the performance bug as hard as possible.

How big it will be for your workload is a function of what your workload is. Benchmark if it is important to you.

50%, rather, is the worst case scenario. 30% is a bad case scenario, and 5% best-case scenario. Which is still a lot for large cloud providers like amazon, google, microsoft.

Any public information on how someone like Google or Facebook handle this? Do they have enough spare capacity to patch or will they need to build further capacity first? I could imagine 10% of Google's capacity (internal services, not Google Cloud) is at least a large datacentre.

I know someone who works for amazon and he said that they didn't need to do anything or buy any more servers.

>It's, however, really bad if you sell CPU cycles for a living.

Who really sells CPU cycles? Cloud providers sell instances priced per core. So the real hit is by the customers since they have to shell out for more instances for the same amount of computing power.

The hit I see is by providers of 'serverless' computing, since they charge per request and have their margins reduced.

> The hit I see is by providers of 'serverless' computing, since they charge per request and have their margins reduced.

AWS, Azure, and GCP all bill serverless with a combination of per-request fees and compute (GB-seconds), so I'd expect the entire hit to be passed on to the user since this will cause increased compute time for each request. N requests that used to average 300ms each will now be N requests that average, say, 400ms, so the per-request billing remains the same and the compute billing will increase by approximately 30%.

I don't understand what exactly you're saying. All of those services have serverless services, but they also have server based instances which abstract compute to amount of cores and RAM rather than CPU cycles. And most use is out of the services which aren't serverless.

Then it's because you don't know how modern cloud works.

I now see that my misunderstanding was about who exactly the users were and who the provider was. My opinion of provider was only GCE, AWS, etc. while the commenter I believe when talking about providers included users of those services (who again were providers of serverless services).

A lot of companies do, including many ycombinator startups. Think of anything that's analytics, data science, data warehouse or advertising related. The costs to run their service just took a hit.

Their competitors are also affected.

Also a 30% decrease is also equivalent to setting Moore's law back 7 months. A 5% loss is only setting it back 1 month. I know that's a bit of a naive calculation. But the point is computing power has long operated in an exponential domain. So big differences in absolute numbers aren't necessarily a big deal.

According to this patch comment, AMD x86 chips are not affected: https://lkml.org/lkml/2017/12/27/2

Sure but who is using AMD chips in place of Intel server chips? If company A competes in the widget market against company B and they both built their server infrastructure on Intel then neither company gained an advantage due to a performance degradation in Intel hardware.

> Sure but who is using AMD chips in place of Intel server chips?

Well... Everyone who bought AMD. Some people managed to see beyond the hype and go for the optoon that made sense.

The overwhelming majority of the cloud runs on Intel. Saying AMD is slightly better off doesn't really help if my systems are built on Intel. This is the case for most people.

What hype are you referring to? Are you suggesting the people who bought AMD knew this was a problem for Intel?

Azure got some AMD EPYC.

>Sure but who is using AMD chips in place of Intel server chips?

Maybe a lot more now?

CPU speed hasn't followed Moore's law since 2003ish. (Number of transistors is still following Moore's law, but that doesn't necessarily directly help you when your program is suddenly 3-30% slower.)

A CPU from 2017 is going to run your programs a hell of a lot faster than one from 2003. Even if they technically have the same clock speed. Look at benchmarks for instance: https://www.cpubenchmark.net/high_end_cpus.html

The claim wasn't "CPUs in 2017 are not faster than CPUs in 2003" or even "CPUs in 2017 are not much faster than CPUs in 2003"; the claim was that they haven't followed Moore's law since 2003, so applying it to CPU speed nowadays is inaccurate. Of course CPUs are faster now than they were 14 years ago, just not as fast as the case where Moore's law still applied to CPU speed.

Moore’s Law doesn’t describe CPU clock speed increases.

Dennard scaling however did deal with clock speed (indirectly via power). It has failed since about 2005.

I don't think you can reliably phrase this in terms of Moore's law. Moore's law mostly concerns raw FLOPs. It's less useful for predicting hardware performance for operations that are governed by limitations like I/O and memory latency. And this slowdown, if I understand it correctly, is largely driven by memory latency.

One of the rationales for cloud computing is it saves money by cracking up utilisation. Providers observe how much users "really" use and then provisioning that much.

True, sometimes you will leave boxes at low utilisation for various reasons, e.g. to deal with traffic spikes. But those reasons have not gone away. So now instead of heaving a predictable increase in CPU cost, you have an unpredictable increase in performance snafus.

The only good news is that the real performance hit will be less than 30% on many workloads. Especially once the providers start juggling and optimising.

>"It's a good thing CPU is fairly compressible."

What do you mean by "compressible"?

Presumably, for a certain important class of application, CPU is not used "densely", i.e. continually. Instead it's used intermittently, like a gas rather than a solid... Hence compressibility. Such applications are far from being CPU-bound, in other words.

So a cloud provider would be an example. Compressible similar to a sparse file I guess as well. Thanks this makes sense.

I think it was meant that a normal application does not utilise the CPU all the time, which can be seen by looking at the task manager CPU usage % = X. Any extra processing needed to fix this bug will have to come out of the remaining 100-X%. This is OK as long as you have enough spare %, and can afford the extra power usage for that processing.

That makes sense, thanks. This is a big deal.

Virtualization is one popular way to drive up CPU utilization. The more diverse workloads run on a given server, the more even the CPU usage tends to get. This way, if you have 100 workloads that peak at 100% but average at 1%, your CPU usage will tend to be smooth at 100%, any overallocation will smooth out over time (a job that would take 1 second may take up to 10).

No vendor can afford to do it at a loss for long. One way or another the customer will end up paying

There's also latency though :/ it seems that programs that make a lot of syscalls will be affected more than programs that are doing in-process calculations

We'll start being more syscall conscious when we write our programs. We'll batch more at the user mode side and try to use less syscalls to do the job.

Kernel ABIs will eventually reflect that and crop up higher level expensive calls that replace groups of currently cheap syscalls (that will become expensive after the fix).

And Intel will profit handsomely from next generation CPUs that'll get an instant up-to-30% performance boost for fixing this bug.

What about all the kernel interrupts due to network and storage traffic?

Maybe the scheduler could dedicate a core to interrupts and software that has small quanta and page tables? Can't really think of a code solution that doesn't sound stupid when I type it.

Another consideration is power usage in data centers. Server power usage is annoyingly complex, and once you get above 70% utilization power usage may go up considerably.

As a billion other people have already said, that all depends on their workloads. This isn't a 30% clockspeed deduction.

As i understand the problem, this isn't about clockspeed reduction, now it is the software's responsibility to check if the page is a kernel page/user page. So, the impact is significant. So, every time either pages are touched/accessed this check needs to be triggered, which causes it to be much slower.

> So, every time either pages are touched/accessed this check needs to be triggered, which causes it to be much slower.

Not to be mean, but that's not what is being changed.

You're right on the bug - userlevel code can now read any memory regardless of privilege level. However the fix isn't to manually check the privileges on each access - that would be extremely slow and wouldn't actually fix the problem.

The fix is to unmap the kernel entirely when userspace code is running. Because the kernel will no longer be in the page-table, the userspace code can no longer read it. The side-effect of this is that the page-table now needs to be switched every-time you enter the kernel, which also flushes the TLB and means that there will be a lot more TLB misses when executing code, which slows things down a lot.

So, to be clear, it is not accessing pages that is being slowed down, it is the switch from the kernelspace to the userspace.

But doesn't the CPU enter kernelspace every time a syscall takes place? So based on what you've described, every time a syscall returns control back to userspace, the TLB will be flushed, which means slower page access times in general.

The distinction I was trying to make was the above commenters thinking that the kernel is now checking page permissions instead of the CPU doing it - IE. Doing privileged checks in software. That's not what's happening, the kernel is just unmapping itself when usercode is run so the kernel can't be seen at all. Then the privileged checks (which are now broken) don't matter because there is no kernel memory to read.

All your points are right though. Page access times will in general be slower because of all the extra TLB flushes, leading to more TLB misses when accessing memory.

Right, but how often that happens is workload dependent. Basically, how often is your code making syscalls.

But don't all FS accesses (e.g., write to socket, read from DB) require a syscall? In that case, basically all web applications would be affected.

Or am I completely off the mark?

No, you are correct. Really, every application will be affected, they all make some syscalls. How much will vary, though.

At least one syscall happens at some point, but performance-tuned systems already use "bulk" syscalls where a single syscall can send megabytes of data, check thousands of sockets, or map a whole file into your address space to access as if it were memory.

> how often is your code making syscalls.

And how often the kernel services interrupts.

You don't understand the problem.

> claiming for a class lawsuit against Intel

If people who received written assurance from Intel that their hardware is 100% bug free can form a legal class, sure. I highly doubt there is even a single one such customer.

Anyone can sue anyone else at any time. If you think Intel isn't going to be sued for this, you're wrong.

It depends on how they handle user compensation. Going by the FDIV precedent, they should typically replace all defective products for free, and they will be in the clear.

What I meant was that the presence of the bug itself is not a valid cause, for example you can't claim that due to the error you lost 1 trillion dollars via a software hack - even if it's true. If Intel can prove they acted ethically when disclosing the bug and that they replaced / compensated users up to the value of the CPU, they are in the clear.

I read that this bug goes back several years; "replacing all defective products for free" could be a massive expense, and assuming it includes current chips, there's also some lag time and engineering effort to get to the point where they could start doing so.

How do they replace defective products? Or more specifically how do you get your laptop CPU replaced if Intel offers a replacement for free?

Presumably, by visiting an agreed service center in your western, sue-happy country, or sending the computer to the nearest one on your expense in the rest of the world. If the CPU is not replaceable or no longer in service, you would get a voucher for the lost value of the CPU/computer that is now 30% slower. Something like 10-20$ for anything older than 3 years, so most people won't bother. If Apple can do it, surely Intel will manage, but it will cost them in the billion order of magnitude, a non-negligible fraction of their yearly profit.

Good call. Will be fascinating to see how this plays out!

They need some legal standing or the case can be dismissed out of hand. It may very well be a question of who has the better legal team.

> They need some legal standing or the case can be dismissed out of hand.

Yes and no. Yes, Intel would get a chance to claim that the case should be dismissed out of hand. To do that, they have to prove that, even assuming all the claimed facts are true, the people suing still don't have a valid case. That's a high bar. It can be reached - there's a reason that preliminary summary judgment is a thing in court cases - but it takes a really flawed case to be dismissed in this way.

How flawed? SCO v. IBM was not completely dismissed on preliminary summary judgment, and that was the most flawed case I've ever seen.

> It may very well be a question of who has the better legal team.

Well, Intel can afford to hire the best. A huge class-action suit can sometimes attract the best to the other side as well, though. (There's not just one "best", so there's enough for both sides of the same court case.)

IANAL, but it looks to me like there's at least the potential for a valid court case. CPUs are (approximately) priced according to their ability to handle workloads; if they can't provide the advertised performance, they didn't deserve the price they sold for.

What is bug free? CPUs work just fine. There is no bug.

The question is did anyone receive performance assurance from Intel? Probably not.

Some cloud providers or compute grids just lost a lot. Maybe they will find an angle to claim compensation.

From what I've read, this slowdown only affects syscalls, which, since they aren't usually a huge percentage of processing in the first place, should not have such an effect. You're more likely looking at a few percent at most, which is not going to be enough to make AMD outperform Intel. Let's stop the fear mongering and wait for actual metrics.

> From what I've read, this slowdown only affects syscalls

Incorrect. It also affects interrupts and (page) faults.

Any usermode to kernel and back transition.

So this is evil for virtualization hosting, which is the major enterprise application for Intel chips.

Hosting on bare metal will become more attractive. Too bad you can't long OVH and Hetzner.

>Too bad you can't long OVH and Hetzner.

What does that even mean?

Also Hetzner just introduced some AMD Epyc server.

"Long" as a verb means to purchase their stock.

As opposed to "shorting" a stock, which means making a bet that it will go down in value.

Ah that makes sense. Thanks!

For some reason I can’t reply to ‘chrisper’ but I think ‘api’ is referring to going long in the stock market.


> For some reason I can’t reply to ‘chrisper’

HN doesn't let you do this to new comments to avoid back-and-forth commenting that is typical in flamewars.

Coming up on 9 years here and I'm still finding out new things about how this site works. I've been wondering recently why some comments aren't replyable.

I think there is a time delay. Wait a few minutes or hours and you can reply. That cooling off period has helped me really think through my replies.

They become replyable after some amount of time. The amount of time varies based on how active the thread is and/or how deeply nested the comment is.

You can reply anyway, but you have to click on the timestamp ("X minutes ago") to do it.

Usually you can click on the <posting time> and go to a page that displays only that comment, which has a reply box even when there's no reply option on the main page.

That 100Hz timer tunable just got a lot more attractive...

Does that mean that you can get an instance on AWS and slowdown the underlying server for all others by forcing a lot of syscalls? Or how is performance distributed between tenants?

Prelim benchmarks show a significant impact (~20%) on Postgresql benchmarks.

20% when running SELECT 1; over a loopback network interface, not in real-world workloads.

The other benchmark that has generated some consternation is running 'du' on a nonstop loop.

Both of these situations are pathological cases and don't reflect real-world performance. My guess is a 5-10% performance hit on general workloads. Still significant, but nowhere near as bad as some of the numbers that are getting thrown around.

And, databases are the worst case scenario, most real-world applications are showing 1% performance impact or less.



It's not really a worst case scenario when you consider where the majority of Intel's revenue comes from: selling their high margin server chips for use in data centers, a significant portion of which are running some kind of database.

Why should we trust your guesses over numbers being thrown around?

Your last link is all gaming benchmarks, which as the article mentions are not affected much.

Another quick postgres estimate [1] with lower impact and a a reply from Linus Torvalds that this values are in the range what they are expecting from the patch. "... Something around 5% performance impact of the isolation is what people are looking at. ..." [2]

[1] http://lkml.iu.edu/hypermail/linux/kernel/1801.0/01274.html [2] http://lkml.iu.edu/hypermail/linux/kernel/1801.0/01299.html

> syscalls, which, since they aren't usually a huge percentage of processing in the first place... Let's stop the fear mongering


(We should probably also stop overgeneralizing about the nature of computational workloads.)

Software development workflows have some of the worst syscall profiles out there. This is going to hit most of us where we live.

> I know that bugs happen

This isn’t an excuse for Intel consistently having terrible verification practices and shipping horrendous hardware bugs. From 2015: https://danluu.com/cpu-bugs/ There have been more since then.

I’ve talked to multiple people who work in intel’s testing division and think “verification” means “unit tests”. The complexity of their CPUs has far surpassed what they know how to manage.

This is typically what happens when you go for a long time without real competition. You get way too comfortable and bad habits start to pile up.

Isn't why this problem even exits the exact opposite? Intel was losing on the mobile market and changed internal testing to iterate faster by cutting corners.

Found a quote:

"We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are".

Man, you should see the errata for some ARM-based SOCs. It's amazing that they work at all.

Vendor, in conversation: "We're pretty sure we can make the next version do cache coherency correctly."

Me (paraphrased): "Don't let the door hit you in the ass on the way out."

Management chain chooses them anyway, I spend the next year chasing down cache-related bugs. Fun.

ARM is such a shitstorm. At least the PC with UEFI is a standard. With every ARM device, you have to have a specialized kernel rom just for that device. There have been efforts made on things like PostmarketOS, but still in general, ARM isn't an architecture. It's random pins soldered to an SoC to make a single use pile of shit.

Why is it an issue to need a different kernel image for each device? I don't see a problem as long as there is a simple mechanism to specify your device to generate the right image. It's already like that with coreboot/libreboot/librecore, and it worked just fine for me.

Imagine that you are the person leading the team that's making an embedded system on an ARM SOC. It's not Linux, so you have your own boot code, drivers and so forth. It's not just a matter of "welp, get another kernel image." You're doing everything from the bare metal on up.

(I should remark that there are good reasons for this effort. Such as: It boots in under 500ms, it's crazy efficient, doesn't use much RAM, and your company won't let you use anything with a GPL license for reasons that the lawyers are adamant about).

So now you get to find all the places where the vendor documentation, sample code and so forth is wrong, or missing entirely, or telling the truth but about a different SOC. You find the race conditions, the timing problems, the magic tuning parameters that make things like the memory controller and the USB system actually work, the places where the cache system doesn't play well with various DMA controllers, the DMA engines that run wild and stomp memory at random, the I2C interfaces that randomly freeze or corrupt data . . . I could go on.

It's fun, but nothing you learn is very transferrable (with the possible exception of mistrust of people at big silicon houses who slap together SOCs).

The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware. If the manufacturer doesn't provide the necessary documentation, then that's exactly that: insufficient documentation to use the device.

There are hardware manufacturers that are better than others at being open and providing documentation. My minimal level of required support and documentation right now is mainline linux support.

Can you document your work publicly, or is there something I can read about it? I'm very interested in alternative kernels beside Linux.

> The responsibility to document the quirks and necessary workarounds lie with the manufacturer of the hardware.

When you buy an SOC, the /contract/ you have with the chip company determines the extent and depth of their responsibility. On the other hand, they do want to sell chips to you, hopefully lots of them, so it's not like they're going to make life difficult.

Some vendors are great at support. They ship you errata without you needing to ask, they are good at fielding questions, they have good quality sample code.

Other vendors will put even large customers on a tier-1 support by default, where your engineers have to deal with crappy filtering and answer inane questions over a period of days before getting any technical engagement. Issues can drag on for months. Sometimes you need to get VPs involved, on both sides, before you can get answers.

The real fun is when you use a vendor that is actively hiding chip bugs and won't admit to issues, even when you have excellent data that exposes them. For bonus points, there are vendors that will rev chips (fixing bugs) without revving chip version identifiers: Half of the chips you have will work, half won't, and you can't tell which are which without putting them into a test setup and running code.

Arm is a problem for all kernels not just Linux in how they map on chip peripherals, etc. All the problems that UEFI solve, are not solved on Arm.

Yep. I've seen scary errata and had paranoid cache flushes in my code as a precaution.

My favorite ARM experience was where memcpy() was broken in an RTOS for "some cases". "some cases" turned out to be when the size of the copy wasn't a multiple of the cache line size. Scary stuff.

Obvious hypothesis: first complacency leads to incompetence, then starting to cut corners has catastrophic consequences. The two problems are wonderfully complementary.

As other comments suggest, there might be a third stage, completely forgetting how to design and validate chips properly.

Or the system was designed poorly to begin with and now you're stuck with the design for backwards compatibility reasons.

I'd expect engineers that are aware of such serious bugs to spit on the grave of backwards compatibility. After all, the worst case impact would be smaller than the current emergency patches: rewriting small parts of operating systems with a variant for new fixed processors.

I think that could also have been the "official reason".

The same reason could have been used to give the NSA some legroom for instance, but tell everyone that's why they won't do so much verification in the future.

This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster. So essentially this is payback time for Intel for keeping compatibility with older code and simpler to program architecture (stricter cache coherence etc.). It is like one can only have 2 of cheap, reliable, easy-to-program.

I'm sure ARM vendors have their own problems... it is just that they tend to be used in application specific products so the bugs are worked around. Having come from a firmware background I've worked are tons of ugly workarounds for serious bugs in validated hardware.

Furthermore, I just a read an article (can't find the link) that certain ARM Cortex cores have this same issues as Intel.

> This implies that ARM vendors do less validation. I guess ARM is just so much simpler that good enough validation can be done faster.

More likely "good enough" is much lower because ARM users aren't finding the bugs. The workloads that find these bugs in Intel systems are: heavy compilation, heavy numeric computation, privilege escalation attackers on multi-user systems. Those use cases barely exist on ARM: who's running a compile farm on ARM, or doing scientific computation on an ARM cluster, or offering a public cloud running on ARM?

Where’s that quote from? ISTR reading it (or something very similar) as reported speech in a HN comment.

Overall it’s a depressing story of predictable market failure as well as internal misbehavior at Intel, if true. Few buyers want to pay or wait for correctness until a sufficiently bad bug is sufficiently fresh in human memory. And if you do want to, it’s not as if you’re blessed with many convenient alternatives.

The quote is from the link above (referencing an anonymous reddit comment).

That is a very interesting perspective, and as far as I know it is correct, though perhaps Intel's situation in the mobile market was exacerbated by complacency?

There are people looking to deploy ARM servers now. However I wish there had been more server competition. Many companies write their backend services in Python, JVM (Java/Scala/Groovy), Ruby, etc. Stuff that would run fine on Power, ARM or other architectures. There are very few specialized libraries that really require x86_64 (like ffmpeg and video-transcoding)

ffmpeg works great on ARM. I don't know if the PPC port is all that optimized lately.

But why do AMD chips not have similar issues? To me it looks like Intel tried to micro optimize something and screwed up.

According to LKML: https://lkml.org/lkml/2017/12/27/2

> The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault.

Out-of-order processors generally trigger exceptions when instructions are retired. Because instructions are retired in-order, that allows exceptions and interrupts to be reported in program order, which is what the programmer expects to happen. Furthermore, because memory access is a critical path, the TLB/privilege check is generally started in parallel with the cache/memory access. In such an architecture, it seems like the straightforward thing to do is to let the improper access to kernel memory execute, and then raise the page fault only when the instruction retires.

Maybe the answer lies in Intel’s feted IPC advantage over AMD? Or is it the case that AMD has simply been relatively lucky so far?

Sounds like Facebook and Youtube, too.

It depends on whether it's an attack against HVM hypervisors or not.

If it, like it seems, is just an attack on OS kernels and PV hypervisors, you can simply turn off the mitigation, since nowadays kernel security is mostly useless (and Linux is likely full of exploitable bugs anyway, so memory protection doesn't really do that much other that protecting against accidental crashes, which isn't changed by this).

Even if it's an attack against hypervisors any large deployment can simply use reserved machines and it won't have a significant cost.

More than the lawsuit, it attacks one of the core aspect of Intel's brand: performances. Intel chips are supposed to be faster. Now they are suddenly 30% slower because they carelessly implemented performance features over security ones.

> all companies paying for computing resources will have to pay roughly 30% more overnight on cloud expenses

Well, if I rent a VPS with x performance, I still expect x performance after this flaw is patched. The company providing the virtual machine will perhaps have to pay 30% more to provide me with the same product I've been getting.

Since most VPS offerings arbitrage shared resources, this will not increase costs of providing VPSes by the full performance penalty.

But all you are ever getting with vps offerings is a description of the number of cpus and amount of ram and suchlike. I haven't seen vps offerings that say "x" chips yield "y" performance. Granted it is sort of implied that the hardware meets certain expectations but there isn't any guarantee. I was just now reading the TOS for AWS just to check and so far as I can tell they aren't guaranteeing any kind of specific performance.

Well if you use m5.xlarge instances from AWS you were getting 4 vCPUs for your money. I don't expect you'll now get 5...

No, but the underlying hardware that perviously hosted two m5.xlarge instances may instead host one M5.xlarge and one M5.medium, so that performance is not degraded.

Yeah but AWS doesn't guarantee that "2x m5.xlarge" will meet any kind of performance requirements, particularly your own application's, do they?

So you may suddenly find that your own performance requirements, that were previously satisfied by "2x m5.xlarge" are no longer being met by that configuration, and I doubt AWS will just provide you with more resources at no additional charge.

> Well, if I rent a VPS with x performance, I still expect x performance after this flaw is patched.

Are there any providers that state you will get x performance? Most that I've seen say you will m processors, n memory, and p storage but don't make any guarantees about how well those things will perform.

Last I checked Amazon AWS has a virtual processor metric not actual hardware metric. This is most noticeable in their lowest power instances which don't get a full modern CPU core.

If the virtual metric is tied to real performance then it could mean a drop in performance while maintaining the same power rating... It will be interesting to see if vendors directly address this.

Cloud services may not need to worry about the issue depending on the OS the customer choose to use, the patched or non patched version.

For the cloud providers it's the security of the hypervisor that's at stake.

Why would the OS of the customer matter? The patch would be applied to the kernel of the hypervisor / host OS.

Forgive me my ignorance, but I fail to see how this is such a big deal. Even 50% performance hit/cost increase would be... bearable, computations are rather cheap today. ML and other intensive calculations aren't done on CPU anyway. It's not like technical progress of our civilization is slowed down by 30% or something...

On the other hand, shrinking Intel's market share due to bad PR and thus adding some competition into the industry could actually foster that progress.

If you run things efficiently you're eaking every ounce of performance out of this hardware. A 30% performance hit means a 30% cost increase.

The bigger issue is for things that don't scale easily. That sql server that was at 90% capacity is suddenly unable to handle the load. Sure that could've happened organically, but now it happens (perhaps literally) overnight for everyone all at once.

Expect a bunch of outages in the next few weeks as companies scramble to fix this.

"A 30% performance hit means a 30% cost increase."

Just wanna point out that a 30% performance hit means a 43% cost increase.

Yes. This is so often forgotten when talking about stock prices (which those 2x or 3x daily derivatives are so dangerous).

For those confused: the math here is a 30% decrease puts you at 70%. To go from 70% back to 100%, 30% only gets you to 91% (0.70*1.3). 1/0.7 = 1.43 means you need 43% to recover.

Should the individual companies hurry to patch it. There is no news of exploit as such.

There is now !

Intel CEO added - "But when you take a look at the difficulty it is to actually go and execute this exploit — you have to get access to the systems, and then access to the memory and operating system — we're fairly confident, given the checks we've done, that we haven't been able to identify an exploit yet."

It seems you need root or physical access to the system as a prerequisite for the attack.

You don't need root, and you don't need physical access. For Meltdown, you only need the ability to run your own code on the target machine.

Where that gets tricky is when everyone's using cloud hosting solutions where the physical machines are abstracted away, and a given physical server may be running multiple virtual servers for different customers.

Think of it like this:

* Somewhere in a data center at a cloud provider is a physical server, wired up in a rack..

* That server runs virtualization software, allowing it to host Virtual Server 1, Virtual Server 2, and Virtual Server 3.

* Virtual Server 1 belongs to Customer A. Virtual Servers 2 and 3 belong to Customer B.

* Normally, Virtual Server 1 can't access any memory allocated to Virtual Servers 2 and 3.

* BUT: Customer A can now use Meltdown to read the entire memory of the physical server. Which includes all the memory space of Virtual Servers 2 and 3, exposing Customer B's data to Customer A.

That's the threat here.

Have you worked in a company where you've hit CPU performance limits. At my last job, we'd have some services run in 25 containers in parallel and we'd have to optimize as much as we could for performance bottlenecks. We'd literally get thousands of assets per minute some mornings, and had a ton of microservices to properly index tag, thumbnail and transcode them.

Our ElasticSearch nodes all had 32GB of ram and we had 10 of them and they were all being pushed to the max.

Something like this would be a massive hit, requiring a lot more work into identifying new bottlenecks and scaling up appropriately.

I think you're vastly underestimating the potential impact to cloud providers. Azure/AWS/GCP all definitely have extra capacity, but they have forecasting down to a science. Requiring even 10% more capacity is quite a large undertaking alone.

Even the non provider side of google will see some impact and even 5% datacenter increase won’t happen overnight

Best summary I've found for the somewhat technical but not hardware-or-low-level-hacker reader is arstechnica. https://arstechnica.com/gadgets/2018/01/whats-behind-the-int...

My head is still spinning writing an OS is a BIG DEAL!!!!

Can someone help me understand why this is such a big deal? This doesn’t seem to be a flaw in the sense of the Pentium FDIV bug where the processor returned incorrect data. It doesn’t even seem to be a bug at all, but a side channel attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it. And it doesn’t seem like it can be used for privilege escalation, only reading secret data out of kernel memory. It seems pretty drastic to impose a double-digit percentage performance hit on every Intel processor to mitigate this.

There is this thing called "return oriented programming". You write your program as a series of addresses that are smashed onto the stack through some other type of vulnerability. When the current function returns, it returns to an address of your choosing. That address points to the tail end of some known existing function, such as in the C library and other libraries. When the tail end of that function returns, it executes your next "instruction" which is merely the next return address on the stack.

The first "instruction" of your program is the last address on the stack, in the list of addresses you pushed to the stack.

You are executing code, but you did not inject any executable code, you did not need to modify any existing code pages (which are probably read only), you did not need to attempt to execute code out of a data page (which is probably marked non executable).

Address Space Layout Randomization is a way to prevent the "return oriented programming" attack. When a process is launched, the address space is randomly laid out so that the attacker cannot know which address in memory the std C lib printf function will be located at -- in this process.

Now let's think about the kernel. If you could know all of the addresses of important kernel routines, you could potentially execute a "return oriented programming" attack against the kernel with kernel privileges. Without modifying or injecting any kernel level code. These hardware vulnerabilities allow user space code to deduce information about kernel space addresses.

Now that's a lot of hoops to jump through in order to execute an attack. But there are people prepared to expend this and even more effort in order to do so. Well funded and well staffed adversaries who would stop at nothing in order to access more and better pr0n collections.

Thanks for the explanation. But I don't understand this part:

> If you could know all of the addresses of important kernel routines, you could potentially execute a "return oriented programming" attack against the kernel with kernel privileges. Without modifying or injecting any kernel level code.

The user <-> kernel transition is mediated (on x86-64) with the SYSCALL instruction, which jumps to a location specified by a non-user writable MSR. How does return-oriented programming work in that case?

Basically, let's say there's a syscall that takes a user buffer and size and copies it into kernel stack for processing. (This is common.) If you overflow that buffer, you can overwrite the return address in the kernel stack, which you can then launch into ROP.

If you overflow that buffer, you can overwrite the return address in the kernel stack, which you can then launch into ROP.

The crucial point here being that there must already be an existing overflow vulnerability in the kernel. Knowing all the addresses is no use if you can't force execution to go to them.

The hypothesis I've seen, and why people seem to be rushing to patch it without explaining, is that you might be able to not only leak addresses, but actual data, from any ring, into unprivileged code, at which point, your security model is burned to the ground.

AIUI, the present circumstances are:

- there exists a public PoC from some researchers of side-channel leaking kernel address information into userland via JavaScript which may be unrelated

- there exists a Xen security embargo that expires Thursday that might be unrelated

- AWS and Azure have scheduled reboots of many things for maintenance in the next week, which seems unlikely to be unrelated to the Xen embargo

- a feature that appears to be geared toward preventing a side-channel technique of unknown power has been rushed into Linux for Intel-only (both x86_64 and ARM from Intel)

- a similar class of prevention technique has been landed in Windows since November for both Intel and AMD x86_64 chips (no idea about ARM)

- the rush surrounding this, and people being amazingly willing to land fixes that imply a 5-30% performance impact, strongly suggest that unlike almost every major CPU bug in the last decade, you can't fix or even work around this with a microcode update for the affected CPUs, which is _huge_. The AMD TLB bug, the AMD tight loop bug that DFBSD found, even the Intel SGX flaws that made them repeatedly disable SGX on some platforms - all of them could be worked around with BIOS or microcode updates. This, apparently, cannot. (Either that or they're rushing out fixes because there's live exploit code somewhere and they haven't had time to write a microcode fix yet, but O(months) seems like they probably concluded they outright can't, rather than haven't yet.)

Addendum for anyone still reading:

- Intel issued a press release saying they planned to announce this next week after more vendors had patched their shit, which lends me more cause to believe that the Xen bug might be the same one [1]

- Intel claims in the same PR that "many types of computing devices — with many different vendors’ processors" are affected, so I'll be curious to see whether non-Intel platforms fall into the umbrella soon

- macOS implemented partial mitigations in 10.13.2 and apparently has some novel ones coming up in 10.13.3 [2]

- someone reasonably respected claims to have a private PoC of this bug leaking kernel memory [3]

- ARM64 has KPTI patches that aren't in Linus's tree yet [4] [6] ([6] is just a link showing the patches from 4 aren't in Linus's tree as of this writing)

- all the other free operating systems appear to have been left out of the embargoed party (until recently, in FBSD's case), so who knows when they'll have mitigations ready [5]

- So far, Microsoft appears to have only patched Windows 10, so it's unknown whether they intend to backport fixes to 7 or possibly attempt to use this as another crowbar to get people off of XP 2.0

- Update: Microsoft is pushing an OOB update later today that will auto-apply to Win10 but not be forced to auto-apply on 7 and 8 until Tuesday, so that's nice [7]

[1] - https://newsroom.intel.com/news/intel-responds-to-security-r...

[2] - https://twitter.com/aionescu/status/948609809540046849

[3] - https://twitter.com/brainsmoke/status/948561799875502080

[4] - https://patchwork.kernel.org/patch/10095827/

[5] - https://lists.freebsd.org/pipermail/freebsd-security/2018-Ja...

[6] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[7] - https://www.theverge.com/2018/1/3/16846784/microsoft-process...

https://security.googleblog.com/2018/01/todays-cpu-vulnerabi... https://googleprojectzero.blogspot.com/2018/01/reading-privi...

Seems that Google/Project Zero felt the need to go ahead and break embargo. Worth adding to the above list of news sources.

No, that's not accurate.

If you read the article you quoted:

> We are posting before an originally coordinated disclosure date of January 9, 2018 because of existing public reports and growing speculation in the press and security research community about the issue, which raises the risk of exploitation. The full Project Zero report is forthcoming (update: this has been published; see above).

Just from public Gooogling, I believe it may have been the Register who tried to get in on the scoop, and broke the embargo:


No-one necessarily broke the embargo. A blogger noticed unusual activity around a certain linux patchset and put two and two together, and the register mostly sourced from his article ( http://pythonsweetness.tumblr.com/post/169166980422/the-myst... )

Also from the P0 blog post: Variant 1: bounds check bypass (CVE-2017-5753) Variant 2: branch target injection (CVE-2017-5715) Variant 3: rogue data cache load (CVE-2017-5754)

My checking doesn't show any of those three explicitly listed in Apple's security updates up through 10.13.2/2017-002 Sierra.


Thanks for summarizing. Does anyone have time to link to more on the "side-channel leaking kernel address information into userland via JavaScript" ?

This isn't exactly that, but here[1] is a talk linked in the post from the other day which shows a PoC breaking ASLR in Linux from JavaScript running in the browser, via a timing attack on the MMU. There's a demo a half hour in.

EDIT: This post[2] discusses the specific speculative execution cache attack and claims there is a JavaScript PoC (but doesn't cite a source for that claim)

[1] https://www.youtube.com/watch?v=ewe3-mUku94

[2] https://plus.google.com/+KristianK%C3%B6hntopp/posts/Ep26AoA...

[1] was what I was referencing, thank you.

Also, RUH-ROH. https://twitter.com/brainsmoke/status/948561799875502080

More importantly, it also switches stacks so user-mode code cannot modify the return addresses on the kernel's stack.

Everything you've said is right, but I'll expand a little more because ROP is fun.

ASLR, PIC (position independent code: chunks of the binary move around between executions), and RELRO (changing the order and epermissions of an ELF binaries headers: a common ROP pattern is to set up a fake stack frame and call a libc function in the ELFs Global offset table) are all mitigations against ROP, but none solve the underlying problem.

The reason ROP exists is that x86-64 use a Von Neumann architecture, which means that the stack necessarily mixes code (return addresses) and data. The only true solution is an architecture that keeps these stacks separate, such as Harvard architecture chips.

As for bypassing the aforementioned mitigations...

ASLR: Only guarantees that the base address changes. Relative offsets are the same. So to be able to call any libc function in a ROP chain, all you need is a copy of the binary (to find the offsets) and to leak any libc function address at runtime. There are a million ways for this data to be leaked, and they are often overlooked in QA. Once you have any libc address, you can use your regular offsets to calculate new addresses.

PIC: haven't yet dealt with it myself, but you can use the above technique to get addresses in any relocated chunk of code, but I think you'll need to leak two addresses to account for ASLR and PIC.

RELRO: This makes the function lookup table in the binary read only, which doesn't stop you from calling any function already called in the binary. Without RELRO, you can call anything in libc.so I think, but with RELRO you can only call functions that have been explicitly invoked. This is still super useful because the libc syscall wrappers like read() and write() are extremely powerful anyway. Full RELRO (as opposed to partial RELRO) makes the procedure linkage table read only as well, which makes things harder still.

If this is the kinda thing that interests you, I heartily recommend ropemporium.com which has a number or ROP challenge binaries of varying difficulty to solve. If you're not sure where to start, I also wrote a write-up for one of the simpler challenges [1] that is extremely detailed, and should be more than enough to get you started (even if you have me experience reversing or exploiting binaries)

Disclaimer: I'm just some dipshit that thinks this stuff is fun, if I've made a mistake in the above please let me know. I also haven't done any ROP since I wrote the linked article, so im probably forgetting stuff.

[1] https://medium.com/@iseethieves/intro-to-rop-rop-emporium-sp...

> If you could know all of the addresses of important kernel routines

Are those kernel logical addresses?

My BTC wallet feels more vulnerable then ever

If you can read kernel (and hypervisor) memory then it seems like a very small step from that to a local root vulnerability - e.g. forge some kind of security token by copying it. There's an embargoed Xen vulnerability that may be related to or combine with this one to mean that anyone running in a VM can break out and access other VMs on the same physical host. That would be a huge issue for cloud providers.

> If you can read kernel (and hypervisor) memory then it seems like a very small step from that to a local root vulnerability - e.g. forge some kind of security token by copying it.

This seems very wrong. I'm not aware of any privilege isolation in Windows relying on the secrecy of any value. Security tokens have opaque handles for which "guessing" makes no sense. Are you aware of anything?

I can think of a few ways to get privilege escalation if you already have rce as unprivileged user:

1. Read the root ssh private key from the openssh deamons kernel pages maintaining the crypto context and ssh into the system

2. Read a sudo auth key generated for someone using sudo and then use that to run code as a root user

3. Read the users password's whenever a session manager asks the users to reauth

4. If running in AWS/GCP inside a container/vm meant to run untrusted code, read the cloud provider private keys and get control on account

5. RCE to ROP powered privilege escalation exploit seems reasonable...

6. Rowhammer a known kernel address (since you can now read kernel memory) to flip some bits to give you root

Also remember running JS is basically RCE if you can read outside the browser sandbox, ads just became much more dangerous...

Thanks! I see. So it seems like the program basically has to capture sensitive data while it is in I/O transfer (and hence in kernel memory) just at the right time, right? Which is annoying and might need a bit of luck, but still possible.

Incidentally, this seems to indicate that zero-copy I/O is actually a security improvement as well, not just a performance improvement?

4,5 and 6 don't need to time the attack.

I am not really sure how/if zero copy may/may not solve this problem.

If this bug only allows reading kernel pages, zero copy may actually help if the unprivileged user can't read your pages, but from the small amount of available description it looks like it can read any page, but kernel pages are more interesting because thats a ring lower and which is why all the focus is on that.

I am fairly certain there is more protection against being able to read memory owned by process on a lower ring level so zero copy may be a bad idea for security critical data.

And based on the disclosure that google published, looks like any memory can be read

If “reading secret data out of kernel memory” translates into “read the page cache from a stranger’s VM that happens to be on the same cloud server” then this could be worse than Heartbleed.

Or maybe random javascript in the browser can stroll upon your ssh private key in the kernel's file cache... and so on.

Excellent point, I didn't think about the implications for stuff like JavaScript.

The privilege escalation is being fixed in software. The problem is that mitigation involves patching the kernel and that patch results in around 30% slowdown for some applications like databases or anything that does a lot of IO (disk and network). That's the big deal. Imagine you are running at close to full capacity after security fix reboot your service might tip over. It could mean a direct impact to cost and so on.

Oh good, I put my SaaS (running mostly on Linode) up yesterday, then this happens. Can't wait for Linode to apply this patch to their infrastructure :(

I'm cursed when it comes to timing. It's like when I bought that house in 2007, held onto it waiting for the market to recover, then tried to sell it only to find out my tenants had been using it to operate a rabbit-breeding business for years and completely trashed the place (thank you, useless property manager), forcing me to sell it at a loss anyway (6 months ago).

Also, I hate rabbits now. And I veered off topic, sorry.

You might try luck to sell this as comedy/drama movie script. :)

> Also, I hate rabbits now. And I veered off topic, sorry.

Well I guess you're not the right person to without about a great ninja-rockstar position at our new RaaS startup.

/one has to joke sometimes to avoid crying over taking a 30% hit in costs... over a stupid CPU bug

I would love to see some SQL Server benchmarks on this patch

SQL Server license disallows publishing of the results of benchmarking (much like Oracle does)

Wait, really? That's kind of messed up.

Remarkable that no throwaway HN accounts considered that a challenge.

Likely very similar to the postgres benchmarks. Fundamentally a RDBMS needs to sync each transaction commit to the log file on disk and that sync is always a syscall. If your DB is doing thousands tx/sec to low latency flash and you rely on that low latency, you're going to get hit.

Note that the postgres benchmark numbers passed around (most based on my benchmarks) are readonly. For write based workloads the overhead is likely going to be much smaller in nearly all cases, there's just more independent workload. The overhead in the readonly load profiles comes near entirely from the synchronous client<->server communication, if you can avoid that for workloads (using pipelining, other batching techniques) the overhead's going to be smaller.

Reading secret data out of kernel memory is very bad on cloud environments. Keep in mind that the kernel deals with a lot of cryptography.

Sounds like reading HTTPS cert/key details from other-peoples-VM's on cloud providers wouldn't be too much of a stretch. Especially with the memory dumping demo. Combine that with something that looks for the HTTPS private key flag string and it's sounding pretty feasible. :/

Is there anything this bug can give you that you can't get with

    sudo cat /dev/mem

I'm having a hard time understanding why this is worse than any other local root escalation bug except for the consequences of the necessary patch.

EDIT: I see that /dev/mem is no longer a window on all of physical RAM in a default secure configuration. Is it true that there's no way for root to read kernel memory in a typical Linux instance? If so, the severity of this issue makes more sense to me.

You don't need to be root.

> I'm having a hard time understanding why this is worse than any other local root escalation bug except for the consequences of the necessary patch.

It's not, as far as I'm aware. The fact that the patch has perf consequences is why it's such a big deal.

I think the idea is that it is worse because the bug is in the hardware. The OS patches are just a workaround to make the hardware bug unexploitable, and they can lead to a significant performance penalty.

We don't know what the actual bug is yet, or how easy it would be to exploit it. People are speculating that either:

a) It would allow any non-root process to read full memory, including the kernel and other processes, or

b) It would allow one cloud VM to read full memory of other cloud VMs on the same physical machine, or

c) With enough cleverness, it would allow even sandboxed Javascript on a web page to read full memory of the computer that it is running on.

`/dev/mem` is not available in a container, so I cannot use `/dev/mem` to read other tenants' memory on my VPS.

>And it doesn’t seem like it can be used for privilege escalation

Based on all the hoopla around the linux kernel patches the thinking is : yes it can. Or VM escape. Or both.

It's a bug, even if it's a side-channel attack only. Notice that AMD chips aren't vulnerable to this attack.

I have no idea how or if this is a big deal but:

>>attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it.

if you're going to put in features with expected attacks you should definitely be putting in features to prevent it , and if it is an expected attack it shouldn't be special measures it should just be an inherent part in introducing the feature.

When speculative execution (and caches) were invented and put into widespread use, no one thought about timing attacks, nor was the practice of running untrusted code on one's own machine common.

UNIX has been multi-user for a very long time and the intended use case is that those users not be able to compromise each other or get root.

nor was the practice of running untrusted code on one's own machine common

Doesn't multi-user timesharing and virtualization predate every modern CPU and OS though?

Yes, but it went out of style for a while.

At first, computers were very expensive, and so were shared between many users. Mainframes, UNIX, dumb terminals, etc.

Then computers became cheap. Users could each have their own computer, and simply communicate with a server. Each business could have their own servers co-located in a datacenter.

Then virtualization got really good, and suddenly cloud servers became viable. You didn't have to pay for a whole server all the time, and if demand rapidly increased you didn't need to buy lots of new hardware. And if demand decreased you didn't get stuck with tons of useless hardware.

The second stage (dedicated servers) was the case when speculative execution was implemented. We're currently in the third stage, but Intel haven't changed their designs.

Old multi-user time sharing generally had agents that were 'somewhat' trusted. Most systems that had 'secret' data didn't allow time sharing with data of less privileged. Also, outside of the timeshared server users attempting to exploit this wouldn't likely have the processing capability to deduce the contents of said cache.

It doesn’t even seem to be a bug at all, but a side channel attack that would be almost expected in a processor with speculative execution unless special measures were taken to prevent it.

Indeed, this reminds me of cache-timing attacks, which probably can be done on every CPU with any cache at all --- and they've never seemed to be much of a big deal either.

I don't think we even know what the bug is yet, just lots of informed speculation...

Ironically, "lots of informed speculation" seems to be exactly what the bug is about. ;-)

The thing is, AMD probably very narrowly just missed this one --- if they did more aggressive speculative execution, they would be the same.

I’m not (yet) clear on if/how this impacts aarch64 (ARM architecture) chips, the distinction between how it affects Intel vs. doesn’t affect AMD reminds us of a fundamental lesson we seem to have conveniently forgotten: monocultures of anything are bad. We need diversity and diversification in order to have reasonable amount of robustness in the face of unknowable, unpredictable risks.

I’m wondering whether ARM chips are affected if they are whether they are uniformly affected or whether it depends on vendor implementation choices.

A patch is in the works for ARM chips as well (http://lists.infradead.org/pipermail/linux-arm-kernel/2017-N...), but I am not clear on whether it's enabled by default. It seems like a good idea to have this, independent of current ARM vulnerability.

Yep, I'm still wondering how this affects ARM and if it can be corrected in microcode on that platform.

I'm also wondering if/hoping for a fix that involves increased memory usage instead of the speed.

I’m not a very proficient programmer/developer, so please bear with me. I’m intrigued by your reference to trading off greater memory footprint in exchange for diminishing performance by less. I'm trying to understand how this would work in practice: do you envision ‘padding’ the critical data structures with more empty or randomised buffer zones? Wouldn't that incur an additional penalty for the data transfer (Von Neumann bottleneck)? Would blank data be sufficient or would there be some additional kind of memory effect in the DRAM/SRAM that would demand using randomised data overwrites? How would you generate that random data?

(I apologise if this is blindingly obvious for somebody well versed in low-level programming.)

Oh I have no idea actually; it's just pretty normal to see a speed-memory trade in a lot of problems. I'm definitely not low-level enough either.

ARM posted a good overview and affected product list today:


>We need diversity and diversification in order to have reasonable amount of robustness

Ironically, in human populations it produces the opposite effect.

Please don't take HN threads on ideological tangents. The point you're making has by now become well-known ideological flamebait. We ban users who derail HN threads in such directions, so please don't.


Edit: since https://news.ycombinator.com/item?id=16063749 makes it clear that you're using HN for that purpose, which is not allowed here, I've banned this account. Would you please not create accounts to break the site guidelines with?


More diverse population—> lower trust —> less social will to support each other.

So we should be tribal societies, then? (I.e., endogamous. As in cousin marriage.)

Here are some numbers quantifying the problem. Big caveats apply as they are very preliminary, but the hit due to the software patches looks extremely significant:


Superficially, it seems like the performance hit mostly scales with IOPS or transactions per second, which might have some pretty serious implications for performance/dollar in the kinds of intensive back-end applications where Intel currently dominates and AMD is trying to make inroads with EPYC.

It has very little to do with what kind of syscalls (I/O or other kinds) and all to do with how many syscalls a given application makes per given time period. Compute bound applications are already avoiding syscalls in their hotter parts. This will mostly be a blow to databases, caching servers and other such I/O limited applications.

In other words, don't worry - pretty much all the key performance bottlenecks most of us deal with at work will be getting tighter, but at least our video games will still run OK.

Well, sounds like that time when there is no water left but there are still beers in the fridge.

At least we can play our sorrows away.

Right now I am just hoping that it wont add significant overhead to OpenGL. My application already has a bottleneck on changing OpenGL states and issuing render commands and I have no idea how much of that time is spend making syscalls.

OpenGL implementations shouldn't be effected by syscall overhead. Historically it's been DirectX that had a syscall per draw call, but I believe both now just write the command to GPU RAM directly.

That's at least somewhat encouraging. Nevertheless, it sounds rather like the old, "Yes, but will it run Crysis?" question will perhaps have renewed relevance.

That’s terrible - these are precisely the sort of customers who will suddenly get hit with a performance hit that will negatively impact their operations!

This sounds like it’s positively evil for outfits that rely heavily on virtualisation also.

Would it be fair to say that this might cause acceptation in the shift from on prem to the public cloud, where there are performance guarantees?

There aren't really performance guarantees for CPU and the ones that are there won't help here. When this patch is released big providers will have the same hardware as before and sell it in the same way - but the OS and userspace will just be slower for some use patterns.

There isn't a guarantee that will compensate for that any more than if you updated some piece of your software infrastructure to a new version that just got slower.

As I mentioned in the other thread yesterday database and database like applications are going to be hit particular hard. Even more so on fast flash storage. Double whammy compared to apps just doing network IO.

And while databases try to minimize the number of syscalls they still end up doing a lot of them for read, writeout, flush.

How would you trade this knowledge?

Intel has already dropped and AMD is up. Maybe there's more to move, but first-order effects are at least partially priced in already.

But what about second-order effects? Seems like virtualization should be vulnerable (VMWare and Citrix), but maybe they actually benefit as customers add more capacity.

Software-defined networking and cloud databases should also suffer though it's unclear how to trade these.

AWS, Google Cloud and Azure might benefit as customers add capacity but there's no way to trade the business units. So what about cloud customers where compute costs are already a large percentage of total revenue?

Netflix should be OK but Snap and Twilio could get squeezed hard. Akamai and Cloudflare might have higher costs they can't pass through to customers.

And where's the upside? Who benefits? If the performance hit causes companies to add capacity, maybe semiconductor and DRAM suppliers like Micron would benefit.

First and foremost, dont buy Intel stock. This doesnt help, but I long considered Intel as a company that doesnt know what they are doing. Here is why:

I owned Intel back in 2010 when they bought McAfee for 7.8 billion. They said the future of CPU's and chip tech was embedding security on the chip. the real answer was mobile and gpus.

Not only did I immediately know this was a horrendous deal, it clearly showed that the CEO and management had no clue on their own market's desires and direction. At the time, I was hoping they were going to buy Nvidia, it would have been a larger target to digest at 10 bil, but doable by Intel at the time.

The MacAfee purchase turned out to be one of the worst large cooperate purchases in history. Had they invested the 7.8 billion $ blindly into an sp500 index fund, their investment would be worth ~19-20 billion.

Arguably, the only sane trade here would be to buy Intel and short AMD if you think the size of the move is greater than it should have been. However, there are many reasons not to do this until there is more information, and as that information comes out, it will likely be incorporated into the continual corrections in price. As to second order effects, don't count on this mattering. Unless you are planning on trading huge amounts of money, the risk/reward is probably not great. If you have to ask...

I think the model for cloud vendors would be quite complicated. Not every version of the CPU and not every application is impacted as much (new intel processors with PCID will suffer less).

Add on top of that the fact that a lot cloud customers over provision (there's good scientific papers on how much spare CPU capacity there is). Cloud service providers that sell things on a per request / real CPU usage model (vs reserved capacity) prob benefit more.

Also, you can't just separate trading in AWS or GCE from the rest of the core business.

Potentially business units of DELL, HP, IBM, ... should do better as people use this as a justification to upgrade overdue hardware they should cover 5% to 10% lower performance (needing more units to cover that).

Agree on that last paragraph. The only reasonable thing people can do is buy more hardware to cover the performance loss and/or buy more hardware that's not needed using the bug as a pretext to get the budget approved now.

SDN might actually be OK. On the high-end they bend over backwards to not enter the kernel at all anyway, using stuff like DPDK. They stopped even using interrupts years ago.

Possibility: a cloud vendor who mostly-uses-AMD, versus one that mostly-uses-Intel, just got handed a massive price/performance relative advantage.

Automated ic-layout and software proofing benefits from this.

Not sure why you are being downvoted. It's an interesting question. I'm in on AMD for the time, just to see how to flows.

Don't flash devices use nvme (eg userspace queues) now and avoid the kernel all together for read and write operations? Shouldn't they have no impact?

NVMe means each drive can have multiple queues, but they're still managed by a kernel driver. You may be thinking of SPDK, which includes a usermode NVMe driver but requires you to rewrite your application. And many systems are still using SAS or SATA SSDs.

Not by default they don't. That only works if you're willing to dedicate that entire physical drive to a single application anyway.

In the future this possible on Linux with the filesystems that support DAX. Currently this all pretty experimental with lots of work being done in this space in the last two years.

But this will require you to have the right kind of flash storage, right kind of fs, right kind mount options, and probably a different code path in userspace for DAX vs traditional storage.

So we're a little ways away from this.

DAX doesn't appear related here at all. That is about bypassing the page cache for block devices that don't need one.

That doesn't move anything from kernel land into userspace, certainly not in the app's process in userspace anyway.

If you bypass the page cache you do not have read()/write() and mmap you avoid the syscall overhead. This matters a lot for high IOPs devices. Also these new fangled devices claim support word cache line sync using normal cpu flush instructions. Also avoiding fsync syscall.

One does not follow the other. Where are any references to how this will let you bypass read & write? User-space applications are still interacting with a filesystem, which they access via read/write and not a block device.

There's no talk in the DAX information about how this results in a zero-syscall filesystem API, and I'm not seeing how that would ever work given there would then be zero protections on anything. You need a handle, and that handle needs security. All of that is done today via syscalls, and DAX isn't changing that interface at all. So where is the API to userspace changing?

Please re-read my above comment. There is no new API. The DAX userspace API is mmap.

This work is experimental but you can mmap a single file on a filesystem on this device using new DAX capabilities. Most access will not longer require a syscall.

This comes with all the usual semantics and trappings of mmap plus some additional caveats as to how the filesystem / DAX / hardware is implemented. Most reads/writes will not require a trip to the kernel using the normal read()/write() syscalls. Additionally, there is no RAM page cache baking this mmap instead the device is mapped directly at a virtual address (like DMA).

Finally, flush for these kinds of devices is at the block level implemented using normal instructions and not fsync. Flush is going to be done using the CLWB instruction. See: https://software.intel.com/en-us/blogs/2016/09/12/deprecate-...

LWN.net has lots of articles and links in their archives from 2016/2017. It's a really good read. Sadly I do not have time to dig more of them up for you. Do a search for site:lwn.net and search for DAX or MAP_DIRECT.

Please re-read mine. How is the number of syscalls (which is the only thing that matters in this context) changing if there's no API change to apps? mmap already exists and already avoids the syscall. DAX "just" makes the implementation faster, but it doesn't appear to have any impact on number of syscalls

As in, if you call read/write instead of using mmap you're still getting a syscall regardless of if DAX is supported or not. Not everything can use mmap. mmap is not a direct replacement for read/write in all scenarios.

Do we have a performance estimate? I can eat 20 or 30%, but I can't eat 90%.

This comment further down thread mentions it's 20% in Postgres. https://news.ycombinator.com/item?id=16061926

...when running SELECT 1 over a loopback socket.

The reply to that comment is accurate: that's a pathological case. Probably an order of magnitude off.

We're still learning, but it looks like pgbench is 7% to 15% off:


I've seen that message. It acknowledges the same problems: do-nothing problems over a local unix socket.

Real-world use cases introduce much more latency from other sources in the first place.

I'm sticking with an expectation in the 2%-5% range.

Yep, this is getting blown way out of proportion by all of these tiny scripts that just sit around connecting to themselves. Even pgbench is theoretical and intended for tuning; you're not going to hit your max tps in your Real Code that is doing Real Work.

In the real world, where code is doing real things besides just entering/exiting itself all day, I think it's going to be a stretch to see even a 5% performance impact, let alone 10%.

I think 5% is a reasonable guess for a database. Even a well-designed database does have to do a lot of IO, both network and disk. It's just not a "fixable" thing.

But overall, yeah.

The claim is that it's 2% to 5% in most general uses on systems that have PCID support. If that's the case then I'm willing to bet that databases on fast flash storage are lot more impacted then this and pure CPU bound tasks (such as encoding video) are less impacted.

The reality is that OLTP databases execution time is not dominated by CPU computation but instead of IO time. Most transactions in OLTP systems fetch a handful of tuples. Most time is dedicated to fetching the tuples (and maybe indices) from disk and then sending them over network.

New disk devices lowered the latency significantly while syscall time has barely gotten better.

So in OLTP databases I expect the impact to be closer to 10% to 15%. So up to 3x over the base case.

> I've seen that message. It acknowledges the same problems: do-nothing problems over a local unix socket.

The first set of numbers isn't actually unrealistic. Doing lots of primary key lookups over low latency links is fairly common.

The "SELECT 1" benchmark obviously was just to show something close to the worst case.

> The first set of numbers isn't actually unrealistic. Doing lots of primary key lookups over low latency links is fairly common.

Latency through loopback on my machine takes 0.07ms. Latency to the machine sitting next to me is 5ms.

We're actually (and to think, today I trotted out that joke about what you call a group of nerds--a well, actually) talking multiple orders of magnitude through which kernel traps are being amplified.

> Latency through loopback on my machine takes 0.07ms. Latency to the machine sitting next to me is 5ms.

Uh, latency in local gigabit net is a LOT lower than 5ms.

> We're actually (and to think, today I trotted out that joke about what you call a group of nerds--a well, actually) talking multiple orders of magnitude through which kernel traps are being amplified.

I've measured it through network as well, and the impact is smaller, but still large if you just increase the number of connections a bit.

If so, this definitely moves the needle on the EPYC vs Xenon price/performance ratio.

All the Oracle DBAs out there are in for some suffering. Forget the cost of 30% extra compute, what about the 30% increase to Oracle licensing?

"SPARC user: not affected!"

--Oracle's marketing tomorrow, probably

(to their credit, SPARC does fully isolate kernel and user memory pages, so they were ahead of the curve here... for all 10 of their users who run anything other than Oracle DB on their systems.)

Phoronix strikes again! I admire Michael's consistency and dedication and their benchmarks have certainly gotten better over the years as PTS has matured, but everything on Phoronix still needs to be taken with a generous helping of salt. New readers generally learn this after a few months; it applies not only to their benchmarks, but also their "news".

The most obvious issue with this benchmark is that Phoronix is testing the latest rcs, with all of their changes, against the last stable version [EDIT: I misread or this changed overnight, see below] that doesn't have PTI integrated, instead of just isolating the PTI patchset. The right way to do this would be to use the same kernel version and either cherry-pick the specific patches or trust that the `nopti` boot parameter sufficiently disables the feature. That alone makes the test worthless.

There is no way this causes a universal 30% perf deduction, especially not for workloads that are IO-bound (i.e., most real-world workloads). This is a significant hit for Intel, but it's not going to reduce global compute capacity by 30% overnight.

EDIT: Looking at the Phoronix page, the benchmark actually appears to use 4.15-rc5 as "pre" and 4.15-some-unspecified-git-pull-from-Dec-31-that-isn't-called-rc6 as "post". I thought I had read 4.14.8 there last night, but may not have. Regardless, the point stands -- these are different versions of the kernel and the tests do not reflect the impact of the PTI patchset.

So you’re saying that the latest RCS, without the patch, was supposed to be slower than stable by at least 10%? How often do companies release performance downgrades of that scale? That’s also very unlikely.

>So you’re saying that the latest RCS, without the patch, was supposed to be slower than stable by at least 10%?

I'm saying that it's not a reliable measurement of the impact of the PTI patchset. There was a PgSQL performance anecdote [0] (actually tested with the real boot parameters instead of entirely different versions of the kernel) that showed 9% performance decrease posted to LKML, which Linus described as "pretty much in line with expectations". [1]

Quoting further from that mail:

> Something around 5% performance impact of the isolation is what people are looking at.

> Obviously it depends on just exactly what you do. Some loads will hardly be affected at all, if they just spend all their time in user space. And if you do a lot of small system calls, you might see double-digit slowdowns.

So in general, the hit should be around 5%, and "[y]ou might see double-digit slowdowns" seems like the hit on a worst-case workload is hovering closer to the 10% range than 30%. That's also what the anecdote from LKML shows, unlike Phoronix which shows 25%-30% or worse.

This is more of an attrition thing than a staggering loss. With people saying MS patched this in November, it would be interesting to see if people saw a similar 5-10% degradation in Windows benchmarks since that time.

>How often do companies release performance downgrades of that scale?

I don't know which "company" you're referring to here, but substantial changes in kernel performance characteristics are pretty common during the Linux development/RC process, and yes, definitely some workloads will often see changes +/- 10% between the roughly bi-monthly stable kernel releases.

If you're surprised that Linux development is so "lively", you're not alone. That's one of the selling points of other OSes like FreeBSD.

[0] https://lkml.org/lkml/2018/1/2/678

[1] https://lkml.org/lkml/2018/1/2/703

I wonder if we'll see some performance return as subsequent patches are produced. I can't tell from the coverage so far if this is possible.

A lot of people have noticed that High Sierra is slower than Sierra, specifically for filesystem operations with APFS. I wonder if Apple knew about this ahead of time and this explains the overhead?

Probably not. APFS just does a lot more then HFS, so there is a huge performance impact on disk related issues before this change goes in.

This is a all hands on deck kind of situation. Apple doesn't usually do well with security firedrills like this.

If work in NT and Linux kernels started on November, they must know. Intel must've told them, the alternative of Apple learning about this from a third party and grilling them over it would be too scary.

I just wonder whether Apple is threatening to move all their Macs to AMD, or to ARM?


Unless they live in a bubble, I’m sure most people at Apple are already aware of this.

It was a joke referencing the login vulnerability that was disclosed on twitter a couple of months ago.

AMD must be pretty happy about this patch: https://lkml.org/lkml/2017/12/27/2

Not quite -- without his patch, the performance penalty will hit them too. Tom is proposing to be excluded from the proposed solution as it would hit AMD with collateral damage from the `X86_BUG_CPU_INSECURE` fix.

I'm sure there are frantic emails claiming that AMD shouldn't be punished for Intel's mistake.

EDIT: actually the fix will go out with 4.14.12 and 4.15rc7, both `X86_BUG_CPU_INSECURE` and AMD's addendum to be protected from the collateral damage.

Maybe it's just me but that name is just too general. Is there no other conceivable way an x86 CPU can ever be "insecure"? Why'd they use something so vague? Is this part of the redaction?

Not so sure about that. I am reading the merge commit, and comments are pretty interesting:

  --- a/arch/x86/include/asm/processor.h  
  +++ b/arch/x86/include/asm/processor.h
  + * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
  + * address, then that syscall will enter the kernel with a
  + * non-canonical return address, and SYSRET will explode dangerously.
  + * We avoid this particular problem by preventing anything executable  
  + * from being mapped at the maximum canonical address.
  + *
  + * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
  + * CPUs malfunction if they execute code from the highest canonical page.
  + * They'll speculate right off the end of the canonical space, and
  + * bad things happen.  This is worked around in the same way as the
  + * Intel problem.

I wrote that text. It's just documenting a bug that never affected Linux at all.

I'm having trouble finding a good reference right now.

That's an old issue, fixed many months ago, not related to the 'new' Intel bug. The comment was updated in this patch series, that's it.

It's a completely unrelated bug. This DragonFlyBSD commit message does a pretty good job of explaining it: http://lists.dragonflybsd.org/pipermail/commits/2017-August/...

As long as the patch is all anyone notices then it looks great from AMD marketing. But I'm fairly confident (and a quick browse through the patched in the latest RC confirms at least one instance of this) where comments call out that AMD CPUs do things that should be considered bugs, so there's probably not going to be a lot of PR people telling the masses to read the source code. It's not exactly flattering to anyone.

One thing I can definitely respect about the kernel developers is that they don't seem to make any effort to be nice about the fact that they need to deal with undocumented nonsense from vendors all round.

Tom Lendacky works at AMD, so yes... :)

"Intel has a bug that lets some software gain access to parts of a computer’s memory that are set aside to protect things like passwords."

Seems like very little got through to the media about the details regarding this flaws effects and costly workaround.

Very little by way of details has been made public yet. Not even the technical press. Even relevant comments in the Linux source are redacted at the moment. Hopefully, further details will be released in good time (in the next month?) when people have had time to install the patches that are going out RealSoonNow (i.e. the huge plan of updates on Azure's VM hosts).

> Even relevant comments in the Linux source are redacted at the moment.

People keep repeating this claim because it sounds dramatic, but I'm not sure it's a fair description. The original source appears to be a single snide tweet from @grsecurity [1] referencing this comment [2].

It's far from obvious that the comment was even "redacted" at all. It seems more likely that "stay tuned" is either a reference to the more detailed comments elsewhere in the patch (in arch/x86/kernel/ldt.c), or a reflection of the fact (which is clearly spelled out in the commit message) that future patches are likely to change the location of the LDT mapping.

I've skimmed through the commit messages and comments from the latest patchset [3] and couldn't find anything else that even hinted at redaction, nor could I find any mention of redactions on the linux-kernel mailing list.

Furthermore, it's worth bearing in mind that @grsecurity has been involved in numerous public feuds with the Linux security folks. So in the absence of concrete evidence, I'm not particularly inclined to assume his tweet was made in good faith.

Bloomberg is not going to focus on technical detail too much (at all) given their readership. Follow the link to the register for more detail.

This reddit thread has more information on the bug https://www.reddit.com/r/sysadmin/comments/7nl8r0/intel_bug_...

That Reddit thread is repeating back information from previous hacker news threads.

Spreading of information is a good thing.

It's not information. It's speculation.

Er, no.. what? There's a lot of non-speculative information in that reddit post..

in this particular case, will the generalization help their readership?

I actually think it will - it would be easy to give more accurate details that cause many readers to glaze over.

That was my reaction. How to ELI5 the risk in multitenant VM environment? It could steal passwords

> How to ELI5 the risk in multitenant VM environment? Some other guy paying EC2 steals your customers passwords by sheer force of will

> How to ELI5 the risk in desktop PC? A piece of JavaScript in some 0x0 pixel iframe in a tab you're not even looking at stealing your passwords and SSH keys

(Although nothing is proven in the latter regard, I wouldn't be per se surprised to see something in that direction once the exact nature of the issue is more widely known)

This isn't really too far out of proportion with normal daily fluctuations of AMD stock this year.

Quiet, you're spoiling the narrative we're imposing on semi-random, chaotic events.

+8.79% AMD & -4.44% INTC does not look like a semi-random event anymore. Seems like wider audience is waking up to the news.

To add another point: NVDA is up 5% and doesn't have the same direct competitive narrative. If money has to be spent rectifying the issue with new or more CPUs due to performance loss, that's less money available for spending on Nvidia GPUs. It might tip a few marginal applications in favor of eating the development costs to migrate to GPGPU, but that effect isn't likely nearly as high, especially if applications with lots of system calls are affected the most.

Chipotle is also up 5%. All these engineers are going to need lots of burritos to eat.

You're saying that gas in the cleanroom caused the chips to behave incorrectly?

Excuse me while I go and invest in a company making rubber underwear.

Someone's gotta sell Burger King sesame seeds...

That plus approaching ER

"Chip design flaws are exceedingly rare."

Writes someone who's never seen an errata.

What started as speculation about recent kernel developments has really turned into a shitshow for Intel. I can't imagine they thought that these massive changes would get through without anyone finding out. However, there really isn't any other better option for Intel, so it seems they are in a lose lose situation with no way out. Or at least a way out that doesn't involve them going bankrupt trying to repair the damage.

> going bankrupt trying to repair the damage.

I think you took it a bit too far. May be, I am missing something. Is it really that bad?

There are at least tens of thousands, possibly hundreds of thousands of Intel CPUs in just datacenters around the globe. Most of which are controlled by companies that make a lot of money and paid a lot of money to intel for their CPUs. And I doubt that they will just take this kind of hit to their performance sitting down. And thats just datacenters, not to mention all the personal computers (mine included) that will suffer. At this point, if this is real, its not a question of if it will cost them, its how much. And considering the sheer numbers of the products affected, I can't imagine it will be cheap. I am not saying that I think they are going to go bankrupt, and I would be surprised if they did, but a 30% percent performance hit is multiple generations of fallback, and considering the importance of computing today (and the number of different entities that are affected by this) I find it hard to imagine that it will be just taken sitting down.

Bankrupt might be a wild guess, but it is one of the largest blunder I can recall. 12 years lineup affected, large penalty fix, hardwired. Intel is gonna enjoy a healthier diet for a while.

Interestingly enough, Intel and AMD market values are affected by this but the cloud provider aren't, which I find surprising.

Do you pay for computing cycles or for a computing instance?

If your cloud provider lowers its performance/efficiency by 30%, isn't it up to its customers to switch? But where to switch too? There's no choice among the big cloud providers. They all use Intel and AMD in varying amounts. As a cloud user you are stuck with Intel or AMD chips. Demand doesn't change, supply is lowered.

Looking at it like this, it is more likely that demand for cloud resources will rise, and as such revenue of the cloud providers will as well.

My guess is that it will help cloud providers. In most cases they don’t guarantee x operations per second but rather the type of cpu you get. If that type now gets 30% less performance you’re most likely going to have to pay 30% more. Again the guarantee is not on operations per second but either exact processor, class of processor, or processor units.

Hm. Do cloud providers need to provide up to 30% more capacity for the same price, or do customers need to purchase up to 30% more capacity to get the same speed?

Unless demand is fully inelastic, if you up the price fewer people will buy. As such, this should hit the cloud provider profits.

> Unless demand is fully inelastic

It's mostly inelastic. People don't buy resources unless they need to do work. Alternatives, such as on-prem hardware, are affected as well.

The only work that would be affected would be the work that becomes unprofitable at a 10-30% hardware cost increase and this is probably marginal enough to be ignored.

> It's mostly inelastic. People don't buy resources unless they need to do work.

I work on a system (at Google) with significant hardware cost. Fundamentally, the work my system is doing needs to be done. But the time we spend improving efficiency is hugely elastic. I look at a CPU profile or request flow, have an idea to improve it, look at how much machine resources we're spending, guesstimate how long it will take us to implement and maintain, and use a chart to see if my idea is worth my team's time or not.[1]

If the resources get more expensive or simply impossible to acquire,[2] we'll optimize more.

[1] There are other considerations (will it make the code more complex, thus increasing risk of a security/privacy flaw? what about opportunity cost given that it's hard to hire more engineers and scale a team?) but that's a reasonable mental model.

[2] There's a global RAM and SSD manufacturing crunch already, and if lots of CPUs are replaced due to these vulnerabilities or simply are no longer enough, that's gonna be a big crunch as well. If you're one of the few biggest cloud providers, I think you can't just replace / add tons more hardware than planned without dramatically increasing the price per unit for everyone.

> But the time we spend improving efficiency is hugely elastic.

In the context of the above posts, "elastic" is an economic term used to refer to the demand curve. There is no supply/demand curve when it comes to internal technical decisions for optimization =)

I would say your work is highly correlated to the price of hardware per some unit of performance AND the amount of work that you need to complete AND the amount of hardware already available AND the cost of labor needed for optimization.

I imagine in your case, since these systems are tightly controlled, you can probably run unpatched without taking a substantial risk.

I'd say most of the elasticity comes from people deciding not to migrate to the cloud at all. Rather than people deciding to stop using the cloud.

*fully elastic

No, if demand is slightly elastic, an increase in price will still cause a decrease in demand.


This bug doesn’t magically chop all CPU performance down by almost a third.

What does it do, then? That is approximately what the mitigation patch does, according to early averages of perf changes.

Performance is chopped 5-30%, with newer CPU's with PCID being affected less significantly. [0] BoorishBears is possibly pointing out that 30% is a worst case.

[0]: https://www.phoronix.com/scan.php?page=article&item=linux-41...

This combined with what the other comment mentions, it's a range from 5%-30% with 30% being a worst case the average user does not encounter.

This is an issue, but laypeople are overblowing the effect on their everyday computing.

What about gaming performance? Once everything is patched[0], that is. I know no one knows, but it'll be interesting to see Intel's last remaining advantage (clock frequency) mitigated somewhat by all of this.


Syscall-heavy workloads are affected to a much greater degree than CPU-heavy workloads. (This is a direct consequence of the fact that the workaround involves a per-syscall overhead.)

Yes. Which is why I'm saying "up to 30%"

Are there any major cloud providers whose stocks depend on their cloud earnings?

Cloud is probably a fraction of Google’s, IBM’s, Oracle’s and MS’s profits. It’s probably a significant part of Amazon’s profits but Amazon doesn’t trade on earnings.

This is not a PR nightmare, it's a technical competitive nightmare. For many workloads, last-gen AMD server chips are now competitive with current-gen Intel server chips; and the current generation already had a lot in its favour. There's no amount of PR fluff that can save Intel from this, they can only release a new design with this fixed and hope they don't lose too much ground in the mean time.

As a shareholder that's great. Concerned that this is a very fragile win though. I don't think there will be much real world fall out.

If all Intel CPUs get 30% slower overnight, then there will be some fallout in the real world.

They might be 30% slower in virtualization and should be about 5-7% slower in real world usage. As a heavy user of the cloud I'm worried about my infra more than about my laptop: if my cloud setup gets even 7% slower overnight it won't be good to say the least, especially with lower-clock 2Ghz Skylake GCP CPUs.

Those 5--7% will make me feel like I'm powered by Atom and it will be even worse for single-core bounded workloads in the cloud.

>They might be 30% slower in virtualization and should be about 5-7% slower in real world usage.

Virtualization IS real-world-usage. This is going to damage Intel where it will hurt the most, the datacenter (which is largely virtualized using VMware, or Hyper-V). The Xeon CPU's have some of the best profit margins for Intel. If they erode away to Epyc (which is finally becoming available) this could be pretty good for AMD's, espically since AMD has said the the past few years there strategy is to go after the datacenter market.

there's other real world usage (DAWs, for example, where there is a ton of I/O for sample-based virtual instruments) which will likely be as significantly impacted as virtualization likely...

How you hold this stock is a mystery to me. With it bouncing between $10 and $15 constantly last year it just seems far to volatile. I'd be panic buying / selling every day to adjust which seems stressful...

Maybe why its a favorite of /r/wallstreetbets

You're doing it wrong if you're day trading stocks... dollar cost averaging is your friend when it comes to volatility and well if you're investing you're doing so because of long term belief in the company so it should be easy to hold and not worry about it.

"dollar cost averaging is your friend"

Research suggests otherwise:


Diversification is a more effective strategy for dealing with volatility.

I only read page one of that link (thanks for sharing), but it appears to limit its scope to a specific scenario: given a lump sump of money (a windfall), it is it better to invest all at once or dollar cost average over a period of time? It is not recommending against dollar cost averaging in general; it is recommending against it for windfalls. Diversification is also not mentioned in the article.

I read that Vanguard study when it came out, yes it's correct statistically but I disagree with using that data to do lump sum investments. For one, the CEO in their 2018 outlook webcast said to DCA. Even if you keep a short timespan on your DCA (I do 26 weeks), it's a good idea for your mental health if you happen to dump it all right before a collapse.

Sometimes it's wiser to take into account human psychology, even if academics are out there with data to convince us otherwise.

I'm personally with the CEO of Vanguard on this one. A short entrance into the market, especially in today's situation is probably the way to go. People can do whatever they want, and if they put their money where their mouth is and go in with a lump sum, then I'll respect it.

Otherwise, as someone who is bringing a lot of money myself onto the market now, I'm DCAing.

I bought a ton at 9 before the Ryzen announcement. The oscillation in the stock drove me up the wall, and I came to the conclusion that it’s just stupid to track single stocks, even when you know they have a great product, because the market is so irrational.


Guilty as charged.

>I'd be panic buying / selling every day to adjust which seems stressful...

Just treat it as numbers on a screen rather than money. Keeps the emotions at bay.

My current position's aggregate cost basis is around 2.35/share, so that's why I'm holding.

>I'd be panic buying / selling every day to adjust which seems stressful...

If you can train yourself to do the opposite you could make a lot off this stock. Buy fear, sell greed. Never panic.

Losses are never realized if you don't sell, and just wait it out until it is back near 15 again.

"I'd be panic buying / selling every day to adjust which seems stressful..."

Wrong way to invest. Don't buy an individual company's stock unless you are ready to (1) hold for the long-term and (2) ignore (short-term) unrealized losses. Especially in the first year after investing in an individual stock it is typical to see an unrealized loss, because gains require time to accumulate.

The stock market is not a slot machine. Investing requires patience and discipline.

There's no particular reason any random company's stock should have gains over time, aside from inflation. Especially if it doesn't give dividends, you may never gain anything.

Nothing is certain, but you can sort companies into "likely to gain" and "unlikely to gain." The P/E and P/CF ratios are commonly used for this, along with other valuation metrics. The basic theory of value investing is to find companies that are likely to give you a return.

Also, historically the stock market as a whole has outpaced inflation, so even just investing in randomly chosen companies should have gains over time (there is even evidence that a portfolio of randomly chosen companies performs about the same as a portfolio of companies chosen by analysts and advisers).

try trading crypto

Ill stick to blue chips and paper trading. The NHS can only hand out so much heart medication...

Laughs in Bitcoin

I am not so sure. Cloud providers will probably want to get this fixed and AMD would be a convenient alternative in the short term. I think the bigger risk is that rather than going with AMD, cloud providers will pursue a different CPU architecture altogether, especially in the long-term. There were plenty of things to dislike about x86 virtualization etc. before this mess.

Will there be any way to disable or block the upcoming patches and keep the performance for those of us who really just don't have any reason to care about inter-process information leakage on our personal computers?

Edit: I'm (also) wondering about Windows, in case anyone knows yet.

Yes, the work-around can be disabled via a boot-time argument.

Thank you! Do you know if this will be true for Windows as well, or just Linux?

From the merge commit:

  +# define DISABLE_PTI		0
  +# define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
PS - MSFT has not published relnotes, so we do not know yet. We'll find out soon enough.

I meant disable at run time, not disabling via recompiling your own kernel.

That is also in there. You can either specify "pti off" or "nopti" as a boot parameter.

  +void __init pti_check_boottime_disable(void)
  +	ret = cmdline_find_option(boot_command_line, "pti", arg, sizeof(arg));
  +	if (ret > 0)  {
  +		if (ret == 3 && !strncmp(arg, "off", 3)) {
  +			pti_print_if_insecure("disabled on command line.");
  +			return;
  +		}
  +		if (ret == 2 && !strncmp(arg, "on", 2)) {
  +			pti_print_if_secure("force enabled on command line.");
  +			goto enable;
  +  		}
  +		if (ret == 4 && !strncmp(arg, "auto", 4))
  +			goto autosel;
  +	}
  +	if (cmdline_find_option_bool(boot_command_line, "nopti")) {
  +		pti_print_if_insecure("disabled on command line.");
  +		return;
  +	}
  +	if (!boot_cpu_has_bug(X86_BUG_CPU_INSECURE))
  +		return;

Just don't do this if you have PII, HIPAA, or PCI data on your computer.

Even then and if they don't care about their own safety, they should patch for the rest of us. Who knows how long until their unpatched system will get pwned from some other vulnerability and end up in some botnet just spreading the pain.

Please just don't. It isn't worth the pain and risk just to have a little faster system. Maybe you don't care about this patch, but you will need others that are dependent on it.

Just patch.

> Please just don't. It isn't worth the pain

No, it very much is.

> and risk

No, there is no risk. I already run everything as admin.

> just to have a little faster system.

5-30% is not "a little".

> Maybe you don't care about this patch

Indeed. And I expect many other power users also don't (but regardless, this is irrelevant).

> but you will need others that are dependent on it.

Well when that actually becomes a problem I will act accordingly. If more patches like this pop up I obviously won't install any of them. If there's a patch for a drive-by browser exploit depending on this, that will obviously be a different story.

> Just patch.

Hell no. My patching this makes my computer slower while providing exactly zero benefit to anyone.

>Well when that actually becomes a problem I will act accordingly.

I don't mean to be rude but if you are having to ask how to disable automatic updates then you probably aren't someone who keeps up to date with all the latest issues. When it becomes problem, you just won't know.

Let your OS vendor do all this for you. They are good at it.

> I don't mean to be rude but if you are having to ask how to disable automatic updates then you probably aren't someone who keeps up to date with all the latest issues. When it becomes problem, you just won't know.

...Wow. First of all, that's not what I asked. I asked how to disable or block this patch. Blocking "automatic updates" is neither equivalent to disabling this patch (post-install) nor to blocking it (pre-install). Second of all, I'm running Windows 8.1, on which I can actually block updates easily. I don't know if I can be picky about which patches I block on 10 because I have barely used it, but I will have to start using it soon and I really don't want to waste time installing the update only to find out I can no longer uninstall it.

And third of all, you're really spewing nonsense. I've done security work in the past which I don't care to post details about here anonymously. I still keep up with security news regularly and I actually look into the update details before installing them (which should be obvious if you read my previous comment on how I said what I do depends on the actual updates). None of which you need to believe (and I really don't care if you don't), except for the minor caveat that if you're trying to be convincing, this holier-than-thou attitude moves you well in the opposite direction.

It seems like it is all thanks to this patch: https://lkml.org/lkml/2017/12/27/2

If I understand well, there may be a serious bug in some x86 CPUs but nothing is known publicly. Presumably, all current Intel CPUs are affected and none by AMD but we can't really be sure, it is still a secret.

It is impressive how a simple, yet to be justified patch has so much influence. It opens up new ways of manipulating the market...

But it's not just a random patch posted to LKML, it has also been reviewed and merged to tip. That means it has been justified, just not to you.

If you look here: https://www.computerbase.de/2018-01/intel-cpu-pti-sicherheit... (Sorry it is German)

But if you scroll down to "Windows-Benchmarks: Anwendungen" you can see that most applications do not have any performance hit with the Windows patch.

Only M.2 SSD seem to be affected.

All their tests are CPU-bound, not IO-bound, namely -- archiving, rendering, encoding video. Performance degradation happens with transactional, IO-bound tasks (think noSQL databases, ad serving, trading)

It is possible Microsoft has mitigated the issue in a way that has much lesser performance impact. Maybe they had a highly tuned feature to enable kernel page separation already coded but disabled. I won't be surprised if even the Linux implementation is tuned to the absolute limit in the coming months.

> I won't be surprised if even the Linux implementation is tuned to the absolute limit in the coming months.

To me, this sounds like unnecessary work if Intel is coming up with a microcode patch within a few months.

I think so too. It also seems the Linux version right now is in a "get it to work, optimize later" state.

If you know how to materially optimize it, I'm all ears.

This reminds me of a storyline from The West Wing, where a company that was Intel in everything but the name found a bug in their chip.

As it threatened the company's existence, the President refused to consider guaranteeing a loan because the company had been a contributor, and any help could potentially appear as corruption.

How quaint such considerations seem today...

I'm worried about the performance impact on low end intel chips like Atom/Celeron found in Chromebooks.30% hit will make computing on those platforms miserable.

Talking about chromeOS, is there any speculation about the impact of bug? Does it's hardened sand-boxing techniques put it in a better position even if KASLR is compromised?

The bug isn't really KASLR, it's more about reading kernel memory from userspace through side channel attacks of speculative execution.

KASLR is/was the cover for the kernel patches, to avoid disclosing the real bug.

Indeed, kernel memory read seems more likely than address only read, the memory probably needs to be cached (in L1 maybe?). Attack is a timing attack as can be seen in this very interesting tweet - https://twitter.com/brainsmoke/status/948561799875502080

The comment in linux kernel mailing list, does pin point attacks using 'speculative execution' but isn't access to kernel memory space by any means a potential hazard to the security depending upon KASLR?

Honestly, that serves Google right. I've been saying it for years how stupid Google is for encouraging 99% of Chromebooks to be powered by Intel chips (even more than Windows machines), when the Chrome OS itself is architecture-agnostic (for the most part).

It was just stupid through and through. Even today we see ARM coming to full versions of Windows, but Google is still kicking it with Intel CPUs.

> AMD shares surged as much as 7.2 percent to $11.77 Wednesday. Intel fell as much as 3.8 percent, the most since April, to $45.05. An Intel spokesman declined to comment.

I have no real context for this, but is 7.2% considered a "soar"? And a 3.8% decrease seems like kind of not a lot, considering what's fucking happening here.

A 7.2% rise intraday is pretty big, especially for traders holding leveraged positions. If you're comparing to crypto markets then no, but generally speaking yes.


The stock price is the present value of the future profits. Since this shouldn't have a huge effect beyond 4 years, that's saying profits for Intel in the short term would be down by 20 percent.

Literally just read here yesterday how the CEO dumped majority of his shares. So Shiesty. https://news.ycombinator.com/item?id=16055851

Don't be so quick to blame. You might have read it yesterday, but he reported it on Nov 29, which means that the transaction happened either within two days prior or about a month, depending on if you believe the SEC's website or NASDAQ's website. But the question isn't when he sold his shares, the question is when did he put in the order to sell those shares. Those guys put their buy/sell orders in months in advance because of exactly this problem. Or maybe he had a limit order in--the beginning of November was a 2-year high, so you could easily imagine a limit order executing about then.

Evidence shows this bug was known in November. It only blew up this week.

Probably even earlier.

Don't spread conspiracy theories.

It's absolutely impossible for the most visible executive of one of the largest firms to engage in insider trading in such an obvious fashion and get away with it.

At this level, there is always a paper trail of who knew what when. There are internal and external audits if any suspicions arise.

Plus there are sever penalties, both civil (in the employment contract) as well as criminal. Intel's CEO is without a doubt in the 9- or 10-digit range of personal wealth. Risking time in jail to avoid a 10% loss on his stock holdings would be a terrible decision even if they considered the chance of being caught was low.

I think we're going to find in the coming years that there is a lot of white collar crime going on at any time. It seems like it would be foolhardy to attempt this stuff, but greed compels people to do stupid things all the time. Wealthy C-suite people are not totally immune to that.

You'll be happy to know that:

[in 2016] the FEC Charged 78 parties in cases involving trading on the basis of inside information.

A number of these cases involved complex insider trading rings which were cracked by Enforcement’s innovative uses of data and analytics to spot suspicious trading.

For example, brought insider trading cases against:

- two hedge fund managers and their source, who was a former employee of the U.S. Food and Drug Administration

- a former Goldman Sachs employee

- a former senior employee at Puma Biotechnology Inc.


I am confused, as far as I know two different vulnerabilities have been discovered, Meltdown and Spectre; while the first one affects only Intel CPUs, the second one affects AMD and ARM as well. So how come I'm not seeing much talk around about the latter? Is it because it is harder to exploit?

I didn't have a chance to read the 2 papers so I would appreciate a TL;DR.

I am a SW developer working with high level languages; security and OS development are not my specific fields so while I don't need an ELI5 I would appreciate a sufficiently "layman's terms" explanation.

Replying to myself: most of the info can be found here

* https://googleprojectzero.blogspot.com.au/2018/01/reading-pr...

* https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl...

* https://meltdownattack.com/

* https://www.amd.com/en/corporate/speculative-execution

* https://newsroom.intel.com/news/intel-responds-to-security-r...

TL;DR is that mitigation for Spectre on an OS level is not very expensive in terms of performance while Meltdown mitigation, which affects only Intel, will have a performance penalty between 5% and 30%

The bottom of https://danluu.com/cpu-bugs/ suggests that AMD isn't any better, so this is likely just short term.

They don't have to be better to benefit from purchasers wanting to diversify their plant across two rather than one CPU vendors.

This specific issue is very relevant to cloud providers, who are the guys that buy thousands of CPUs, and doesn't affect AMD's current generation. I don't know what Intel's profit breakdown is, but I suspect it is heavily weighted to the high-end server chips where they have had no real competition for many years.

Since that article came out, AMD built a whole new CPU architecture, with the help of the guy that designed the K8, too.

Didn’t the Intel CEO just sell half of his shares/options? If he knew about these issues isn’t that illegal?

You can be sure there's a young hotshot with sparkling eyes at SEC who is already typing a letter to Mr. Krzanich politely asking about the circumstances of that sell.

At the same time, noone is doing eight figures transactions which require reporting to the SEC without talking to a lawyer. Right? Right? They really dislike insider trading, it's one of the few things where even rich people can get imprisoned -- and the typical jail sentence has been steadily climbing up for decades now.

> They really dislike insider trading, it's one of the few things where even rich people can get imprisoned -- and the typical jail sentence has been steadily climbing up for decades now.

Insider trading, like many white-collar crimes, exists primarily for its value as a weapon. There is nothing actually illegal about the act of selling a stock; it's all about casting aspirations as to intent and who-knew-what-when.

In other situations, intent is usually an aggravating factor, enhancement, or affirmative defense. It is not the thing that qualifies an otherwise 100% legitimate act as a bad thing.

My anecdotal, unsubstantiated perspective is that insider trading is unlikely to be an issue for anyone who hasn't made enemies, and that it may suddenly become an issue for anyone naive enough to make enemies recklessly. cf. Martin Shkreli, who couldn't be linked to a specific "bad trade" so was brought on generic "securities fraud" instead.

Not playing ball with the people wielding these powers seems to be the dangerous thing.

Intent is a key component of most crimes, isn't it? https://en.wikipedia.org/wiki/Mens_rea

Disclaimer: I'm not a lawyer, there are definitely people who can explain this better, and I'm probably using these terms incorrectly.

Yes, mens rea is an important consideration, but it's nuanced. Check the Model Penal Code [0], which identifies four differing types of mens rea, including negligence and recklessness; that is, a "guilty mind" (mens rea), for criminal purposes, does not necessarily require what would be conventionally considered bona fide malice or intent to harm.

What I meant when I said an "enhancement" or "aggravating factor" is that usually you have an objectively asocial actus reus, like theft, and should mens rea come into play, it's generally a defensive thing seeking to exculpate the accused, as in "I didn't know it belonged to someone else" (the affirmative defense), not to deny the act.

But with insider trading and other instances of nuanced malum prohibitum [1], the usual relationship between actus reus and mens rea is inverted. To make the crime, one starts with the mens rea, the bad intent, and must identify (or, if necessary, manufacture) an apparently-normal act to register as the external offensive conduct that harmed society and warrants legal action.

That is a scarier proposition because if your daily business involves technical acts that can be converted into actus reus, there's obviously going to be ample opportunity for people to assign and rationalize their preferred ideas about your thought process there and convince themselves that you're a criminal based on their personal level of dislike or offense. If this gets brought in court, your defense will amount to convincing the jury to believe you instead of the prosecutor, which is a straight-up likability and performance contest.

Whereas, with better-defined crimes, there is a physical, independent actus reus that people recognize as objectively bad and probably intentional. If you didn't steal the thing, if they can't show that you stole the thing, that's now the ground that you're fighting over, and that's much better for the defendant because it's much less fickle.

Essentially it makes every defense necessarily affirmative because the conduct is not otherwise unlawful. The government must dislike you enough to assume bad faith first.

[0] https://en.wikipedia.org/wiki/Model_Penal_Code#Mens_rea_or_c... [1] https://www.law.cornell.edu/wex/malum_prohibitum

The expression you're looking for is "casting aspersions."

Heh, you're right. I promise I knew that. ;) It's too late to fix the typo now, but I appreciate the correction.

Some people think it was due to the tax code changes.

There is a 0% chance you could prove he sold for that reason. Also if you didn't notice the stock is only a few %pts from its ATH; and almost trading at the exact price he sold it at.

I have no idea what happens when the official security disclosure happens and the bugfixes get released.

And what the SEC can prove and what they can't, I do not claim to be an authority of.

Intel and AMD were informed of the vulnerability in June. The stock is up significantly since then, if the CEO was selling for that reason you'd assume it would be much sooner after. In all likelihood this is for tax purposes. The SEC would likely never be able to prove otherwise. I also doubt the SEC even looks at it. There was heavy options volume leading into it though, which they might.

The timeline of events is ... interesting: Intel filed its quarterly report October 26 and the trades were initiated on October 30 and executed November 30. There's no doubt he knew. The big question is what lies ahead. When the patches hit, when people other than, how to say, geeks realize what's up then comes the question: does Intel stock drop? If it does, then even the company might come under fire for not disclosing a significant risk.

Not sure how this hurts Intel much, they essentially have a monopoly on the CPU space, AMD is only really competing because of their GPU unit. AMD can't scale to meet demand if there were a big shift from cloud providers. Even there Nvidia is miles ahead of everyone (GPU.) The thing is that, if anything, this could help Intel sales, because the only way to -truly- stop this is to get new chips...which Intel will provide, and vendors will have no choice for many larger clients. ARM chips have the same exploits (both variants.) AMD can't scale its business for that kind of demand, and its chips are slower...so what are you going to do?

I doubt the CEO is exactly shaking in his boots or sold for that reason. I'd even be willing to bet Intel will provide a fix on the hardware and continue to use the same sockets so cloud providers can just change them out without further hardware changes.

Intel's stock will probably trade down to 40, fill the gap and continue its uptrend, largely because the market is very bullish on the chip space right now.

It's a good thing.

What is a good thing...?

Martha Stewart did jail time for insider trading. Turns out the SEC isn't messing around, and they don't hand out community service, probation and fines for this stuff.

That's a common misconception. She did not do jail time for insider trading. She almost certainly would have gotten away with it. The problem is she lied to a federal agent while trying to hide her inside trading. That's what they charged and convicted her with.

Ahhhh, solid reference. At first I thought you were saying insider trading was a good thing and I was thoroughly confused.

Martha has been off of people's radars for a while now. I may need new material.

That is Martha Stewart's catch phrase.

Looks like Krzanich sold $39.4M of stock on 11/28. Hard to say when the top brass knew about this flaw.


I don't understand this attitude that "top brass might not know about it". He's the CEO, isn't his entire job to know what's going on in the company? I know we get this idea that rich people just sit back and take in the money, but isn't the reason they are paid so much in the first place because they have in theory a big responsibility on what happens in the company?

You're putting words in my mouth. I didn't say he didn't know about it. I said we don't know WHEN he first knew about it, which is important in the context of insider trading.

According to Google, they found out about it in June, and I think it's fair to assume this was shared with Intel shortly after. From the moment it's shared with Intel, it's fair game. It's up to him to setup the company in such a way that important information like this gets to him quickly.

Even if he knew about it, if it was more than 6 months in advance and he follows all the legally-mandated protocols for setting up a regulated stock trade, isn't he safe?

The marvel is that somehow the news was kept from hitting the media until he had finished his trade, and then was revealed immediately after the transaction was complete. I wonder how that coincidence happened.

The SEC would need to be able to show that Krzanich initiated that rather large selloff based on this inside, privileged info. Given the short document retention policies in place at most companies, this would presumably prove pretty difficult.

The article that started most of the discussion listed the first publicly seen patches at middle of november from Microsoft, and one would assume that a bug that could cause major performance impacts like this makes its way upstairs pretty quickly to Intel.


Intention to sell has to be registered with the SEC 6 months in advanced right?

Seems debatable. From what I recall, at least some news about this issue was already public before he sold his shares. And what would be illegal would be trading based on information that isn't public. OTOH, since details are still dribbling out, you could possibly argue that the Intel CEO had more complete information than the public.

The sell could also have been scheduled in advance, which would - AFAIK - not violate any insider trading regulations.

Unless the advance scheduling is completely binding, I don't see why that should sidestep insider trading. What's to stop these guys from always having a cascading series of sells and buys 6 mo. in advance, and just cancelling them?

SEC rules also cover cancellation of orders.


What is to "stop these guys" is the legal parties and officials involved have brains that they can use. The engineer idea of "mwa ha ha, I found a bug, I can walk" is rarely true in practice (n.b.: having quite a lot of money is a different kind of exception).

Basically this: https://xkcd.com/1494/

A good rule of thumb is that if you think you've discovered a loophole that would allow insider trading, and it only took you a few moments of thinking, the SEC thought of it already.

What, they are never allowed to sell shares then? There are constantly bugs being found in major software and hardware.

Surely this is a material design defect that anyone would expect to hurt Intel's business in a very significant way.

>Unless the advance scheduling is completely binding

It gets filed with the SEC so should be suitably permanent

The orders registered with the SEC and all these cancellations would attract a lot of their attention. I have no doubt they would see through it pretty easily once they started investigating

I saw that rumor spreading last night, but have yet to see any trustworthy reporting on any such action.

Edit: Here's trustworthy reporting on that action that also predates this announcement: https://www.fool.com/investing/2017/12/19/intels-ceo-just-so...

I don't think there needs to be much "reporting".

Intel has a rule where c-suite and up execs who have been at the company 5+ years must hold a minimum amount of stock. For the CEO, that number is 250,000. The CEO recently exercised options and immediately sold them, holding onto the absolute minimum required (250k).

This is all available on the SEC's website, trades like this are reported to the SEC and are considered public information.

Small correction, but generally someone like that exercises their options and sells them for profit, keeping their shares that they've had since whenever. In this case, he exercised the options, sold those shares, and then lowered the amount of shares he had beyond that to the absolute minimum (250k).

> that also predates this announcement

Apparently, Microsoft has been releasing patches for NT since November, so it isn't exactly predating the bug.

It depends on the transaction. Recent sells are not updated but all intel insider trades can be found here:


Most of trades are "automatic sells" which should mean the trades were setup days in advance. And the selling was reported on the back of the Intel issue only.

Yes, it would be, but in practice it's hard to nail people on this. He can argue that he'd already scheduled the sell when he learned about this and it will be difficult to prove otherwise.

"it will be difficult to prove otherwise"

The SEC could look at internal emails he got about this flaw, and if he got them (and especially if he replied to them) then it's pretty clear he knew about it.

They doubtlessly have other investigative tools they can use for this as well.

>>If he knew about these issues

Its impossible for him to not know something that is this serious.

If he actually did knew about this, then this is a proof of some serious incompetence.

I wonder what implications this will have on those who run Intel in their gaming rigs. I'm due to refresh, and _was_ gonna invest in Intel as my CPU. But this seems pretty damning for that.

I assume the system calls to interact with the GPU, or to do any sort of I/O, are going to incur the performance overhead. So rendering frames, reading/writing from the network, and loading assets from the disk could all cause issues.

And this is just for gaming. Anyone using a cloud provider that's running on Intel needs to worry about similar things.

What a nightmare...

Well, that's for Linux. We don't exactly know what effects such patches would have on other platforms yet. (do we?)

To make things worse there is an increasing number of games running DRMs like Denuvo and VMProtect that cause a significant performance hit. I think these will be more heavily affected by the patch.

I don't get why more developers don't take notes from CD Projekt.

Sadly I just bought an 8700k last week. Great CPU until next Tuesday? (Windows Patch Day)

Though even if I lose some percentage, it will still outrun AMD on single core applications (and probably also multi core ones).

I wonder how Intel will deal with this.

If I were Intel, I'd offer free replacements for at-par performance, and potentially tiny cash payment for upgraded performance. Assuming the marginal cost to produce chips, especially older/slower ones, is very low, the only real cost to them is losing out on potential upgrade sales which would have happened organically, for a while.

However, doing this keeps Qualcomm/ARM and AMD from making massive inroads into the market. As well, it would be a great way for Intel to accelerate adoption of their newer technologies, causing even greater lock-in (you could assume a much higher percentage of users will have a feature)

This all works great for socketed CPUs (still common for servers/cloud). For embedded, where CPU is more likely non-replaceable even if socketed) CPU peak performance probably doesn't matter as much -- maybe do a discount coupon? Or work with equipment vendors to subsidize upgrades.

Laptops and non-technical end users (who couldn't swap their own CPU) probably don't care as much but also don't have the ability to upgrade. A rebate/upgrade program would work, or a substantial cash payment. Doing it as e.g. $50 cash or $250 toward your next Intel-CPU laptop would be interesting.

Intel is rich, the market leader, incumbent across multiple segments, etc., so they really should go overboard on their response.

rofl, you don't simply upgrade chips installed in 1+B servers and laptops and network devices and embedded systems. In fact, statistically-few devices-containing-a-CPU are designed to ever have their CPUs replaced. At the last, you replace the motherboard, which requires coordination with the vendor. Consider cars - good luck coordinating that recall!

Devices containing multiple security levels/distinct users, and where performance matters the most, are generally socketed-CPU servers. For other devices, if they're security critical and can't be addressed through software, you can upgrade them at the board or device level early. This has already happened in embedded devices due to the C2000 timer problem.

Assuming there is a software mitigation which has a performance impact, sophisticated users would be capable of adding more capacity (if it's a horizontal scale type workload), upgrading early (if they had extra capacity for futureproofing), or spending money, potentially subsidized by Intel, to upgrade immediately. If there's no mitigation, upgrade early, or rearchitect application (moving away from shared security domains on single boxes, etc.)

Just built a new Linux workstation the day after Christmas with a 4.2 Ghz i7. In hindsight I should have bought a AMD Ryzen! Glad the computer I built my dad was an AMD.

The following story might comfort you somewhat. Someone had bad luck with Ryzen:


> The security updates may slow down older machinery by as much as 30 percent, according to The Register.

I will eat my hat if that 30% number holds up.

Honest question, does the performance hit from this patch actually hit Intel's best processors enough to make them perform worse than AMD's best?

I don't keep up on these kinds of metrics, but I'm under the impression that Intel still dominated CPU benchmarks berfore this issue, so if this question is answered in the negative then I doubt it will affect Intel very much.

Intel hasn't been dominating in CPU benchmarks since AMD's new Zen architecture came out. They still have a ~10% lead for gaming but there are as many usecases where AMD is the best choice as there are where Intel is. So if this bug is slowing down Intel but not AMD servers by 10% then suddenly choosing AMD is a no-brainer.

It depends on what your program is doing. If your program uses a lot of syscalls, then it'll be slower. If it hardly uses them at all, then it won't be as affected.

Yes, but that wasn't the question.

It's too soon to say. There's still variants Meltdown/Spectre that aren't fixed yet. The other guy was correct that it probably depends on your workload, but in a couple months after all the dust has settled on this and we have a comprehensive review, my guess is yes AMD is going to take a commanding lead.

Spectre affects every modern CPU, and performance impact will be marginal. Meltdown affects Intel-only, and the performance impact will be significant. Here's Fortnight's preliminary server results[0].

Once everything[1] is patched up it's just going to get worse for Intel. Ryzen was already fast though and had superior SMT (Hyperthreading) efficiency than Intel prior to this bug (~10%). This is going to take a serious toll on Intel's single-threaded performance advantage (~10% from IPC, 25% clock frequency advantage).

My advice is to stick with Ryzen/Threadripper/Epyc.



AMD stock price was higher in July and October. It increased a tiny bit and does not constitute “soaring”.

It's currently +10%, which absolutely is "soaring".

What is more interesting to me is the news that came before this: Intel CEO selling his stock. I did not read the news but I got a notification on the phone through one of the many news apps on my phone. For a moment, I wondered why he sold off his stock.

After the fix, will device drivers still have the user process mapped into their address space?

If not, then the fix may expose many driver bugs (where driver accessed user space directly instead of through copy_from_user()), making the whole thing even more painful.

I don't know much about CPUs, so what prevents a fix via microcode update?

No one outside of Intel knows enough about Intel's microcode to answer that question, especially since we don't know the full details of the problem.

We don't know if anything does, but it could be that an effective microcode update (eg turning OOO off) would slow down the CPU more than the software fix.

The bug seems to be revolving around speculative execution. It seems like that is a silicon thing, not a microcode thing.

It's likely they could disable speculative execution in its entirety, either in ucode or UEFI, but that would harm many compute-intensive workloads by 5-10% or more, so my guess is that somebody decided to push this page table mitigation instead, taking the hit on the syscall/hypercall side as that's perceived as "less bad overall".

The bug seems to be about the processor leaving speculatively read privileged data in some of the caches, even if execution failed [1].

If so, clearing all caches upon a failed privilege check sounds like something within the capabilities of microcode and without unreasonable performance penalties.

Unfortunately, that would not explain the complex in-kernel fix...

[1] https://plus.google.com/+KristianK%C3%B6hntopp/posts/Ep26AoA...

EDIT: what remains in cache is not "speculatively read privileged data" but more "unprivileged data whose address is correlated to speculatively read privileged data". Retrieving later such address allows one to infer what the privileged data was. Still, the point about clearing all caches as countermeasure holds...

This piece is (IMNSHO) trash. It's trying to stare into some crystal ball guessing at the cause of market fluctuations; but there's no real evidence they're right.

Not to mention it's pretty misleading about technical stuff, saying e.g. "Chip design errors are exceedingly rare. More than 20 years ago, a college professor discovered a problem with how early versions of Intel’s Pentium chip calculated numbers."

So, they imply that chip design errors are once in a few decades kind of things; that's just complete nonsense. Chip design errors are common most just aren't this bad.

Furthermore, they imply this stock change is anything more than temporary volatility; given the tiny, tiny changes so far that's just premature. Perhaps the stocks will adjust more in the future - we'll see. But as of now, this piece is financially and technically pretty misleading.

Major chip design errors, of this magnitude, are exceedingly rare. Especially in comparison to the software industry where bugs are released to production a couple orders of magnitude more frequently. How many other Intel hardware bugs can you point to that are comparable to the ones below?

"All computers with Intel chips from the past 10 years appear to be affected... The security updates may slow down older machinery by as much as 30 percent, according to the report."

"discovered a problem with how early versions of Intel’s Pentium chip calculated numbers... Intel had to recall some chips and took a charge of more than $400 million."

That's what you're saying (and indeed what I said too!), but that's not what bloomberg published. What you're saying makes sense! But they didn't qualify this by major. And the qualification is important to the heart of the story, since the way they phrased it suggests that chip design is largely reliable and thus that this is an exceptional error, when in fact errors are common, and impact is merely exceptional. In effect: they're overstating the signal-to-noise ratio this event implies.

Worse, they understated the noisiness of the chip error "signal" but they also understate the noisiness of the stock value signal. And hey presto; if you conveniently present only the data you want all of the sudden a correlation looks really meaningful!

Furthermore, even the qualification "major" isn't really deserved quite to the extent that bloomberg implies; specifically there have been worse chip errors much more recently than 20 years ago (e.g. amd's somewhat similar phenom issues). What makes this one worse isn't just the bug, it's how chips are used: cloud computing makes this much more relevant than similar bugs not too many years ago.

Now you know how to weight the truth of mass media publications.

> How many other Intel hardware bugs can you point to that are comparable to the ones below?

Disabling a whole feature (TSX) across an entire generation of processor and early steppings of the next? That's pretty major, given it's a marketed feature.

> It's trying to stare into some crystal ball guessing at the cause of market fluctuations; but there's no real evidence they're right.

Welcome to the world of business reporting.

I think it's pretty clear in this case. Plenty of people here in hn-land were talking about Intel/AMD short/long positions starting in the afternoon yesterday and word has gotten around.

Talk is cheap; investing perhaps less so. The asserted "soaring" stock price simply hasn't happened. The 7% rise they name as "soaring" is not a soar for a stock this volatile: https://www.bloomberg.com/quote/AMD:US - just look at the graph over the past year; value changes in excess of 50% happened several times.

To be clear; I'm not saying this increase won't stick, just that bloomberg is pretending they're reading this from the numbers, not merely expecting it to occur. It'd be fine to say you might expect the stock to trend higher. But pretending you can look at that noisy line and say the current 7% increase is statistically significantly higher in a humanly relevant way is just hogwash.

Obviously you might expect the stock to soar. It may well happen. It just isn't visible in the data they present; the article is simply click bait (or worse, market manipulation).

ok. I was more thinking about the "INTC decline" than the "AMD rise". you make a good point with previous volatility.

I didn't have options activated till this morning, because i hadn't gotten around to it. Had I been able to activate them yesterday afternoon, I'd be in on INTC and AMD, both sides. Anyways, I have options in play on the intel side but not the amd side, so while "Talk is cheap" - I'm in.

Yes, I'm sure all those Wall Street finance folks are all scouring HN comment threads just waiting for the next great piece of investment advice that they can move on.

> "So, they imply that chip design errors are once in a few decades kind of things; that's just complete nonsense. Chip design errors are common most just aren't this bad."

Many chip design errors are patched with microcode updates. There is speculation that this one cannot.

Perhaps a bit off topic but I have an (Asus) laptop with a recent Intel chip running Linux (Solus) and I have no idea how I am to deal with all these CPU bugs... Any pointers?

The patch isn't out yet. Once it's merged in, your distribution will release patches you can install on the normal way you update your distribution.

Ah, ok, so Intel Management engine patches are also distributed in the Linux Kernel then?

Just sit and wait, it's all too fresh to make solid statements on this. Maybe aside from, think twice before buying Intel again.

Definitely, sadly the new AMD Ryzen's were not in laptops at the moment I needed one. To be honest, a high end ARM laptop would also have been nice, but as it is, Intel is king on >8 hrs battery life machines with decent performance.

Considering the late Intel problems, Apple is going to be even more tempted to design its own CPUs/GPUs for the mac.

What do you think, is this realistic?

Would the perf hits introduced by the Linux/Windows patches also be paid by kernels of guest VMs? or just the hosts/hypervisors?

When pondering whether I should be grumpled about this situation or accept it as "just how it goes", I wonder if Intel would be OK with me giving them up to 30% less money after agreeing to 100% before walking out the Intel shop without prejudice.

I don't know if their will be a recall/class-action lawsuit/whatever. But clearly there is a difference between making a mistake in what must be one of the most complicated consumer products on the one hand, and intentionally violating the terms of an agreed-upon contract?

Tl/DR: Intent matters.

Good point and agreed. This will cost them, and clearly was a mistake.

Can anyone with better knowledge ELI5 to me -- to fix this bug, are cloud providers have to patch only hypervisor machines, or guest boxes as well? And if fix is applied to hypervisor, will performance degrade on guest boxes as benchmark suggests? Thanks!

Guest OS shouldn't be an issue, from what I understand. As long as the hypervisor memory is separated, you're good.

Yes it will affect your VM as it's the host CPU that's affected.

Contracts with fixed guaranteed cloud performance rates just became very valuable.


Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact