Hacker News new | past | comments | ask | show | jobs | submit login
Serious Intel CPU bugs (2016) (danluu.com)
562 points by derek on Jan 3, 2018 | hide | past | favorite | 106 comments



As a former Intel employee this aligns closely with my experience. I didn't work in validation (actually joined as part of Altera) but velocity is an absolute buzzword and the senior management's approach to complex challenges is sheer panic. Slips in schedules are not tolerated at all - so problems in validation are an existential threat, your project can easily just be canned. Also, because of the size of the company the ways in which quality and completeness are 'acheived' is hugely bureaucratic and rarely reflect true engineering fundamentals. Intel's biggest challenge is simply that it's not 'winning big' at the moment and rather than strong leadership and focus the company just jumps from fad to fad failing at each (VR is dead, long live automotive).


I thought that the secret of Intel's success is paranoia. Andy Grove's motto was "Only the paranoid survive" and he wrote a book with the same name.


Andy Grove, sadly, has retired from Intel in 2005, and died in 2016.

I'm afraid the level of paranoia at Intel has decreased since then.


not quite true regarding the focus - you aren't building a $50B+ company when you don't have focus, it's just that their core compentency isn't fashionable right now. they can't grow in the markets they're in because they either own them completely or were driven out, so they're trying different ones (pivoting, if that's applicable to mature companies).


They had a presence on the currently faster growing market (mobile), but they decided to sell it a while ago because it wasn't trendy and margins were smaller.

Now they are pivoting everywhere, but theirs is the only market with sufficient margins. And the perspective is that their margins will shrink because of competition and software emulation (that they are keeping into control by patent trolling).


Intel also liquidated all ram fabs in the eighties, very costly process. Almost less than a year after last fab closed ram prices skyrocketed letting Japanese manufacturers grow strong.


Intel had the problem of growing their market after their monopoly on 386-compatible processors in the late 80s, which AMD only caught up with their Am386 like 4 years later. Besides their shenanigans to rebate PC manufacturers for not using AMD processors, they did create the IAL—Intel Architecture Labs, which created and gave royalty-free technologies like PCI, AGP, USB, and PCI Express to move the industry forward so that it can grow. That was pretty innovative, at a time where getting a soundcard seemed more practical than choosing a faster processor.


Quote from article "We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are".

Competition pressure could make a company's new product worse than (in this case, less stable than) their previous products, e.x. Samsung phone explosion. I still remembered the story was Samsung wanting to release their phone ahead of iPhone and I would imagine the testing went through a similar stressful time as Intel.

Of course not all cases of taking such risks would lead to disasters - just imagine Intel rushes on releasing new chips ahead of competition and 99 out of 100 times it ended up performing well. But a unique character in Intel's case is these bugs, unlike a faulty battery design, are accumulative and additive to future product development, which means a few small wins in catching up with your competitor could also lead to massive failures in some next major battle.

Now imagine Intel's competitors are going through the exact same scenario. One possible outcome is both Intel and its competitors' products become less stable and more buggy over time, and until everyone's stuff seems to be broken they probably never have time to fix them.


> "we can’t live forever in the shadow of the early 90’s FDIV bug"

There is a valid point there though - if you are testing for testing's sake and not finding anything extra through the extra effort then you are wasting time and potentially worse: lulling yourself into a false sense of security. Testing should be done for utility, not just in response to fear - you need to test intelligently, not just test lots. Like TTD in software, good testing processes make life much easier and quality much higher, bad testing processes can be worse than useless.

Processor bugs are always a thing and always have been a thing - look at the list of bugs the linux kernel scans for and works around many of which pre-date the FDIV debacle.

What made FDIV special isn't that is was a bad bug, it was the recent change in marketing. Before then processors were sold to manufacturers who might tell the customer what is used, unless you were a hobbyist you didn't much care for the specifics. But the Pentium line was the first time a processor had been particularly marketed directly at the end user. It had started with the 486 lines a couple of years earlier when "Intel Inside" was first a thing, but there was a huge push in that direction with the release of the first Pentium lines. Suddenly Joe Public was more aware of that detail, but was blissfully unaware that CPUs are complex beasts and generally not 100% perfect.

It didn't help that the bug was very easy to demonstrate in common applications like Excel so Joseph & Josephine Public could see and understand the problem where they wouldn't, for example, the FOOF bug, and it was easy to joke about (We are Pentium of Borg. Division is futile. You will be approximated) which fanned the rapid spread of the news. The fact that the bug only significantly affected fairly rare combinations was lost in the mass discussion about how such a bug could happen at all.


Testing is not for finding bugs, testing is for preventing bugs. But otherwise you are right, it's hard to calibrate and further develop the testing procedure if you rarely find bugs. While you might wide awake on one eye you might be blind on seven others. It's usually the things you don't expect that kill you. So, some need to be paranoid, be very paranoid.


The current chip bug does look like a doozy...

See https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl... if you've not already picked up on the news.


In other words, sometimes competition is a race to the bottom. But then a bug like this tends to have a "reset" effect on everybody in the market.

I look at a statement like "Our competition is moving much faster than we are" as craven and lacking vision. At that point a wisened old Zen master type figure should've stepped forward.

Competition isn't about imitating the competitor anyway, is it? It's about differentiation, right? Maybe not. But it's not like you can't easily market literally any reasonable decision you make. Paul Masson wineries bragged about selling "no wine before its time" and turned their lack of "velocity" into marketing cachet. (Even though they weren't even unique in that regard.) There's no theoretical reason why Intel couldn't market itself as "the accurate chipmaker," keep on validating "lavishly"(1) and let AMD rush headlong into this kind of bug.

(1) Obviously not... but unfortunately you never know it's not enough validation until it's not enough validation.


The FDIV bug was the original Pentium floating point bug, right? Could this be the biggest blow to Intel since that one?


> Competition pressure could make a company's new product worse than (in this case, less stable than) their previous products

Well, that depends on the specific attributes on which there is competitive pressure. When its on time to market, yes, quality will suffer. When its on quality, products will be slower\more expensive,etc. Kind of similar to the often repeated quality triangle in Software dev


> Kind of similar to the often repeated quality triangle in Software dev

Which was disproved in practice.


Can you back this up? I googled and didn’t find anything relevant.


Yes I can.

https://www.amazon.fr/Economics-Software-Quality-Capers-Jone...

Not all good content is offered for free on the Internet.


Your terse and borderline patronizing answers aren’t helping to advance the conversation, though you do link to an important book.

Capers Jones’ work does not “disprove” the triangle, and doesn’t even mention it. If anything, he implies that quality is a more complicated subject than the simple triangle implies, in that lower quality will have non-linear negative impacts on the project cost, time, and scope.

This is the triangle: https://en.m.wikipedia.org/wiki/Project_management_triangle

The original triangle assumes that as a PM, if you don’t recognize the trade off among scope, time, and cost, quality will suffer. This is true even with Jones’ data, but it is trite. The practices that help to ensure quality that are outside the triangle’s intent as a guideline: it is necessary to recognize these trade offs, but not sufficient, to ensure on-time, on-budget, sufficient quality and scoped delivery. One could manage tradeoffs among the time/cost/scope variables and still screw up their product’s quality due to poor methods and practices. Which is obvious, when you think about it.

Jones’ book also has a number of flaws - it’s hard to put his recommendations in practice (its more a survey than a “how to”), and he lacks data on a number of effective , newer practices that involve both quality, improved velocity, and requirements gathering or product/market fit. The result is that he tends towards promoting older practices that help quality but don’t have much impact on whether you’re building the right thing in the first place. To be fair, he admits this throughout the book, but being a data guy, shrugs and moves on to discuss what works with the data he has. A classic case of “looking for your car keys under the street lamp”. Nevertheless, it illustrates why quality pays for itself, which makes it important.


But there is no one single such triangle.

Many people have in mind this version of the "quality triangle" which we were discussing (not a Project Management triangle which I agree is more on point).

You get that a lot if search for "quality triangle": https://www.google.fr/search?tbm=isch&q=quality+triangle

Why is it commonly assumed that quality and cost and time are opposed to each other?

The Capers Jones books have one idea throughout, that is backed by data: for Software, quality is positively correlated with reduced time-to-market, low defects, and reduced costs.

> The practices that help to ensure quality that are outside the triangle’s intent as a guideline

There is a chapter about that in EoSQ, where each and every practice are measured and classified by efficiency, like TDD and code reviews. It's very interesting.

> The result is that he tends towards promoting older practices that help quality but don’t have much impact on whether you’re building the right thing in the first place.

Yes. I find he puts way too much faith in "off the shelf" software as the last bullet we may have.

tl;dr Capers Jones disprove the "quality triangle" big time, which is a product of intuitive reasoning without any basis, which doesn't apply to software. Just look at Intel at this very moment.


Denverton is much more complex than a "simple" Atom (performance of a C3958 is up to about half of an i5-7500 in single-thread, twice the total multi-thread performance). Avoton is really no slouch either. It's really not surprising that the incidence of bugs is increasing on those uarchs as the complexity grows.

The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer applicable. It's perfectly safe to run HT on these processors now.

The AMD Ryzen segfault remains unmitigated at this point in time. Phoronix rushed to declare the bug fixed because they got a binned RMA replacement but there are plenty of reports of it occurring in current-production processors to at least a moderate degree, roughly proportionate with ASIC/litho quality. It's unclear what the scope is w/r/t Epyc since Epyc is on a different stepping but also hasn't really ramped yet either. The early Epyc processors were essentially engineering samples (on the order of hundreds to single-digit thousands of samples) with no real (public) visibility into any binning that might be taking place.

The Ryzen high-address bug is no big deal, that's the kind of thing that gets patched all the time (like the Skylake HT bug). That's one thing Dan is glossing over here - there are tons of these bugs all the time and as long as there is an effective mitigation available it's no big deal.

The PTI patch can be viewed as making syscalls take somewhat longer (about double iirc). Gamers and compute-oriented workloads won't be hurt hardly at all. The average mixed-workload case sees 5% performance loss, not ideal but it's not critical either. Losing 30% is real bad though, and that's what you will get on IO-heavy workloads that context-switch into the kernel a lot.

The only real mitigation there appears to be right now is to give up hyperconvergence for now and harden up those DB/NAS servers that are going to be pushing a lot of IO so that you know there won't be hostile code running on them. That will allow you to safely disable PTI and sidestep the performance hit.

Of course, Epyc was not that good at running databases in the first place, so you still might be better off sucking it up and running Intel even with the PTI patch. It will probably depend on your actual workload and the relative amount of IO vs processing.


> The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer applicable. It's perfectly safe to run HT on these processors now.

Only if you can actually get the fix. My main home PC has this bug and the motherboard manufacturer (ASUS) has yet to ship a BIOS update with the fix.


Your OS can deliver a microcode fixup that's installed at OS startup, Windows should do it automatically and Linux you just need to install the intel-microcode package.


>Gamers and compute-oriented workloads won't be hurt hardly at all.

Actually KPTI doesn't only affect syscall but also interrupts. It makes interrups slower, which affects every workload.


Gamers like to use PS/2 peripherals because they're interrupt based and thus more responsive than USB peripherals.

Does this mean they could take a hit due to this bug?


Interrupts from ps/2 are pretty insignificant in comparison to all other interrupt traffic on any PC.

Edit: also "interrupt-based" in this case has nothing to do with interrupts seen by the CPU. Difference is that in ps/2 the device can send data to the controller (which then generates interrupt) at any time, while for usb the controller periodically polls devices for data (and possibly generates interrupt if there are some). In the early days of USB and UHCI host controllers this polling was done in software, but since USB 2.0 this is done in hardware and generates real cpu interrupts when usb device requests interrupt (although with somewhat unpredictable but bounded latency)


Thanks, that's really useful information. I wasn't aware that USB 2.0 switched to hardware based polling.


Out of curiosity, if you notice a CPU bug in a computer under warranty, is there anything the vendor is usually obligated do, or are they under no obligation to do anything about a CPU bug? Is that considered a defect they have to handle?

(Edit: I'm assuming the USA, and I'm assuming bugs that were not known to the vendor at the time of the sale.)


Under Australian law, it depends on if the defect is material, and if it would have reasonably changed your buying decision.

A 30% performance reduction (like the page table isolation fixes) probably would be considered material.


> Under Australian law, it depends on if the defect is material, and if it would have reasonably changed your buying decision.

Interesting, so if you need that particular product (say it has something specific you need, e.g. a program that only runs well on Intel) and there is no competitor to it with that particular feature (e.g. AMD CPU runs the program poorly) then they can sell you as otherwise-defective of a product as they want and you cannot recover damages?

Or to put it another way, there is no notion of "I would have still bought it because I needed it but knowledge of the defect would have lowered its market value"?


Interesting point. I don't know if this has ever been tested in court, and I am not a lawyer, especially a consumer law lawyer, but I reckon the court system would probably handle it in some way. The ACL is written to be quite equitable, and courts have generally interpreted it as such.


Apple recently had to replace Iphone batteries because there device became slougish. So there is precedent.


Apple didn't have to replace batteries, they chose to do it to quiet down the bad PR they were receiving.

When Intel issues a microcode update to slow down aging Skylake processors so that everyone goes out to buy Cannonlake, you might be able to draw a comparison.


Sorry maybe I used the wrong word "precendence". But there is a similarity and Apple is still beeing sued. And purely a PR move is also misleading.


Under EU law (this is not legal advice), any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor. For 2 years after sale usually, some countries have longer (afaik, Norway for example has it set at 5 years. As EEA member, they also follow the EU laws on warranty).


> any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor

Unfortunately, this is incorrect on many levels.

First, under EU warranty laws, it's not any bug that is covered, but defects that have been assured or are expected to not be present. I'd expect disclaimers to allow for certain errata, for example. EDIT: user ta_wh posted an example for such a disclaimer in a sibling comment.

Second, the vendor is usually not the manufacturer, and therefore seldom in the position to fix the defect themselves.

Third, depending on the nature of the defect, the vendor might have other options besides fixing it/getting it fixed, eg: discount, or returns.


Good luck explaining to a random vendor what the problem is though. They also need to verify that the problem exists and most of them won't have a clue what you're talking about, they'll just turn the computer on and notice that it's working.

If you're a company buying from more qualified vendors then it might be a different story, however at that point consumer law does not apply to you.


Well depending on how this is turning out there will be benchmarks that show the chips performing slower than before. I've read that many chips will be getting 17% slower in the best case and up to 30% in the worst case. That could be enough to warrant a return of the product.


Yes, however most computers are marketed using GHz numbers, not experienced performance (And in the cases where they are it's relative, like x times faster than before). It's hard to show a 17% slowdown to a store clerk.

Don't get me wrong, I'm all for being able to return a product that's 17% slower today than yesterday, I'm just saying it might be difficult since it's an issue that requires a degree of technical knowledge to understand which most store clerks don't have.


I agree with most of what you say, but why do you take for granted that it's up to the customer to demonstrate that the product is defective? Furthermore, where does the idea that the customer must be able to communicate this to a uninformed clerk come from?


> Under EU law (this is not legal advice), any bug that wasn't known to you at time of sale, but was present at time of sale, has to be fixed by the vendor.

Is this true for software as well?


The law applies to any sale between a consumer and a company, even one for free.

It does also apply to software, but software vendors will rather refund you than fix it.


> The law applies to any sale between a consumer and a company, even one for free.

There is no such thing as a free sale, and no, when you are gifted something, then you don't get the EU warranty protection.


I think it's more referring to offers like these:

* Buy One Get One Free * Free USB-C to A hub with your new Macbook Pro

I would expect the free item to be covered in these cases?


Ahh, I was referring to the US, and bugs that weren't known at the time of sale (I'll edit this in), but thank you, this is still useful information.


See the three-year warranty here for boxed CPUs: https://www.intel.com/content/dam/support/us/en/documents/pr...

Note the first bullet under "WHAT THIS LIMITED WARRANTY DOES NOT COVER":

"design defects or errors in the Product (Errata). Contact Intel for information on characterized errata."

Guess we're not covered on this one.

EDIT: That being said, given the potential scope of this issue (years of affected CPUs, massive PR hit) I'm hoping that Intel will at least offer some remedy to recent buyers. According to the article from The Register [1], OS vendors have been working on the fix since November. The blog posted over on pythonsweetness [2] posits the bug may have been identified in October. It'd be interesting to know for how long Intel has been selling Coffee Lake CPUs that are known to be vulnerable.

[1] https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl...

[2] http://pythonsweetness.tumblr.com/post/169166980422/the-myst...


Thanks for posting this!

It seems self-contradictory to me. How can Intel warrant that

> the Product will substantially conform to Intel’s publicly available specifications

while simultaneously disclaiming warranty for

> design defects or errors in the Product (Errata)

?

If an instruction does something different than what their specs say on occasion, do they take that to mean it's substantially conforming to their specs?


Easy: Erratas are actually specification updates. (And indeed, not just by sound and smoke, since most errata are never fixed but rather declared this-is-how-it-works-now).

In some abstract, philosophical sense it means that the specs are actually elected by majority of the produced processors.


I'm no lawyer, but my instincts tell me someone would have to prove that the chips today do not conform to Intel's specs, and that this difference is such that the CPU no longer "substantially conforms" to the spec.

> If an instruction does something different than what their specs say on occasion, do they take that to mean it's substantially conforming to their specs?

We're on the same page. What do you think Intel will argue?

Keep in mind Intel did initiate a substantial recall of Pentium CPUs in the late 90s: https://en.wikipedia.org/wiki/Pentium_FDIV_bug


I mean they would likely argue it's still substantially conforming even if it has bugs that come up, but I'm trying to figure out what kind of a case they could actually win.


Why try to win a case if all you need is a settlement? Going into an actual trial is far riskier than negotiating a settlement, for both sides.


What they're saying is that they'll replace your chip if it behaves differently from all the other chips of the same model, not if ALL the chips are broken by design.


Where does their sentence about the specification come into play in your interpretation?


"the Product will substantially conform to Intel’s publicly available specifications"

They're saying that it should work as specified for the most part. And apparently the CPUs do, since they've been in continuous use for many years.

Note that they also have this exception:

"… THIS LIMITED WARRANTY DOES NOT COVER: … that the Product will protect against all possible security threats, including intentional misconduct by third parties;"

Which is likely designed to handle issues just like this one.


So I recently bought an 8700k. I was wondering if I should rather return it and get AMD instead? Not sure how much the recent bugs will impact me performance wise.


It looks like you'll see a significant slowdown, PostgreSQL with pgbench sees a 7% to 16% slowdown.

https://www.postgresql.org/message-id/20180102222354.qikjmf7...

Of course, this depends on workload — gaming will see different results than computationally heavy tasks.

It is likely that games using Vulkan, DX12 or OoenGL's AZDO functions will see much lower performance impact (because they usually only do a handful of syscalls per frame) than games using older APIs, or even OpenGLs immediate mode (which does one syscall per emitted vertex, in worst case)


> or even OpenGLs immediate mode (which does one syscall per emitted vertex, in worst case)

Perhaps with drivers written in the 90s for hardware from the 90s. Any OpenGL implementation worth its salt will buffer those requests on the client side until they need to be observed. Indeed this was a big feature in the heyday of DirectX 9 where D3D programmers had to count the drawcalls whereas with OpenGL you have way more leeway with calls since the driver tends to be smarter and caches that stuff.

In theory with a modern driver using OpenGL's immediate mode API shouldn't need any more syscalls than building the vertex buffers in your program, setting up the necessary state and issuing a buffer draw command.

The only time where you'd need a syscall per emitted vertex would be if the GPU had OpenGL-like commands and your OpenGL implementation was a thin wrapper over that. I think one of ATI's very early GPUs worked like that (although the commands were per primitive, not per vertex).


That's why I said in absolute worst case scenario. I know that in the past I actually wrote some program in high school that did use that.

Nowadays, thanks to vulkan and AZDO with glMultiDrawElementsIndirect, you're right, of course — you might even use a single syscall per frame.

That's why I said, in absolute worst case.


No i meant explicitly about the part i quoted: the immediate mode. Stuff like glBegin, glVertex3f, glEnd, etc. Those will not get you a syscall per vertex, they will be buffered by the OpenGL implementation. Modern OpenGL implementations, at least those by Nvidia and AMD (and i also suspect Mesa too) do a lot of optimizations on the client side.


I am also quite annoyed because I paid premium (Intel) prices with the expectation to get premium speed. Now if I can just get the same performance cheaper with AMD, maybe I should just return the whole thing.


Which chip to get depends on your usecase.

If you absolutely need the best single core performance you can, Intel is the way forward.

If multicore performance is important (lots of multitasking, lots of heavy processes running) then one of the 8 core Ryzen 7's will be better, for cheaper.


Thanks. I guess I'll stick to my 8700k then.


Just curious, what do you do that requires a fast single core? I always find it strange when people value that, as most computers nowadays generally run more than 1 process at a time.


A lot of games only run single core. Though finally, more modern games do take advantage of more cores!


Computing things serially, obviously!


Ruby on Rails.


if I had an 8700K (I have a 7700K, btw) and my use case was gaming (that's this cpu's target market) I'd keep it. Otherwise I'd get a Ryzen.


Yes, I am mostly gaming on that PC. Thanks.


But note that that those benchmarks pretty much are the worst case for PTI. Each of the queries is either near trivial (single pkey lookup) or the most trivial (SELECT 1), therefore the send/recv syscalls to/from the clients are taking the most time. If you instead have queries that do a bit more actual work this'd look very different.


The very article mentions at the bottom that AMD has its fair share of nasty bugs recently, too. If anything, I would expect AMD to spend even less effort on validation (because they are not flush with cash).


Or, because they're not flush with cash, they go for a simpler to verify design.

Remember, a lot of the Zen arch was developed by Jim Keller, who is the brains behind the Athlon 64.


I loved my Athlon 64. IIRC, it ran hot as hell, but the price/performance was amazing.


I remember running two of the AthlonXPs on a Tyan SMP board back in the day. That was before we had multiple cores on a single processor.


What bugs?


The article spends 4 paragraphs on them. Just one example:

    Although AMD’s response in the forum was that these were isolated issues, phoronix was able to
    reproduce crashes by running a stress test that consists of compiling a number of open source
    programs. They report they were able to get 53 segfaults with one hour of attempted compilation.


This is only true with Ryzen CPUs that were manufactured before week 25. ThreadRipper wasn't affected, nor was EPYC.


8700k is an "end of the line" CPU in terms of motherboard support. Also AMD Ryzen CPUs give you much better bang for buck.


I can't tell if end of the line is a good or a bad thing here?


Most likely bad. With Intel you tend to need a new motherboard anytime you upgrade your processor since they change sockets frequently. This also makes it harder to find replacement motherboards for old processors.

AMD has committed to the AM4 socket they're using until at least 2020, so if you buy a Ryzen processor and motherboard today, you should be able to use that motherboard with Ryzen class processors until 2020.


Wow, that's .. kinda opposite world from a few years ago. For the Intel chips, the socket remained consistent from the Pentium D all the way up to the Core2Duo. IIRC they had an offboard memory management unit, so the chips wasn't tied directly to a motherboard. The AMDs on the other hand had integrated MMUs and each new chip required a totally different board and the processors themselves were tied to the RAM type (DDR2 vs DDR3 etc).

Intel switched to the AMD model starting with the i3/i5/i7 series, now moving to an integrated MMU themselves.


Ah okay. Thanks. I have been an Intel guy for all the years now, so I was used to getting new motherboards.

Good to know AMD is doing that! Getting new motherboards is annoying.


It annoys me when some online content (or edit) hasn't a date explicitly stated.


Is this article from 2015 or 2017?

It would be great if the page displayed the date that the article was posted/updated. It is not in the URL nor the sources. The only way to see the dates is in the RSS feed and even that is only for new articles.


Jan 2016, according to https://danluu.com/ and ctrl+f


Last update is aug 2017 or later


Last update is _today_. It links to a comment on this thread. I think (2016) should be removed.


As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

Why?

Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing.

Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

So this is why Krzanich sold his stock. He knows the bug is his fault. Whoops. I think someone may "quit for personal reasons" soon.

https://www.fool.com/investing/2017/12/19/intels-ceo-just-so...


I can only expect that future will be worse. It may be that VM providers will find it unprofitable to offer a VM capable of running generic native code. Another thing that security products for desktops like Qubes OS that rely on hardware isolation to run untrusted code may need to reconsider their business model.


What’s the connection between the performance hit of this bug and Qubes’ business model?


This bug can be worked around. But the next one may not, making hardware-based virtualization as a secure way to run unprivileged code with max native performance unworkable. I.e. longer term if one wants to run untrusted code, it cannot be native one so any bug can be fixed without replacing hardware.


>if one wants to run untrusted code, it cannot be native one so any bug can be fixed without replacing hardware

It's a bit hard to parse that sentence, could you rephrase?

Are you saying that untrusted code should only be run on systems which do not use hardware virtualization, because there's a risk of hardware bugs that require hardware replacement? The problem is that there is no single-system equivalent, users would have to use multiple laptops/desktops and air gaps to achieve separation (e.g. between network drivers and userspace apps). May not be practical.

Yes there's a risk of a catastrophic hardware bug with no workaround, but that risk applies to every feature in the CPU, not only virtualization or page tables or speculative execution. Statistically it's only happened once with the single Intel CPU recall, which are better odds than other risks.


My point is that to run untrusted code it should be delivered in some form of bytecode, not the native code for CPU. This way one can always workaround CPU issues by changing the compiler or the interpreter even for catastrophic bugs in any part of CPU. Moreover, as hardware VM can execute much more instructions than unprivileged user processes, the probability that something unfixable will happen to them is higher then for ordinary processes.

As for statistics, there are strong indications that modern efforts for CPU verification do not keep up with increasing CPU complexity. So number and severity of bugs will grow.


I suggest reading this discussion about reinventing the AS/400: https://news.ycombinator.com/item?id=16053518


Which bytecode do you recommend? Are you assuming the bytcode interpreter to be bug-free? There have been JVM escapes.


I do not assume that bytecode interpreter or compiler is bug free. But I assume that the interpreter can be trivially updated, while a bad CPU bug may require hardware replacement or taking terrible performance hit.

As for a particular bytecode format, I have no idea. Webassembly is a possibility, but it is still slower by factor of 2 compared with native code. Perhaps CPU-specific symbolic assembler will be a better choice as long as one can realibly alter it to workaround CPU bugs.


I have seen a lot of talk of AMD benefitting from this but what about ARM - how are their server offerings shaping up?


ARM is interesting because of the customisability of ARM chips. Some companies, such as Apple, license the ISA and spin their own silicon. Others license the whole CPU design from ARM holdings, and ARM produce several lines of designs with different CPU capabilities.

Edit: Looks like ARM64 was affected, but it has an architectural feature that makes the mitigation much easier: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-N...


Can we put [2016] in the title? Thanks!


At least part of the article (the 'More updates' section) had to have been written in 2017 because it references Ryzen chips which were released in March 2017. It's not entirely clear when the original article was posted but it seems like it could have been either late 2015 or early 2016.


Last update is _today_. It links to a comment on this thread.


All the CPUs in my house are 4th and 5th generation Intel CPUs except for one PC laptop that has a Skylake processor.

I guess I'm glad now that Apple put a 2 year old CPU in the early 2015 Macbook Pro! Besides my 2012 Mac Pro, that is the most expensive machine in the house!


The latest big bug evidently affects all intel x64 processors back to 2007 or so.


I thought it was 1995?


I wonder if this is related to the recently discovered design flaw in Intel CPUs:

https://news.ycombinator.com/item?id=16055395


Considering the late Intel problems, Apple is going to be even more tempted to design its own CPUs/GPUs for the mac.

What do you think, is this realistic?


I'm not an insider, but it seems like Apple's motto is "long-term profit above almost anything else." As a result they certainly are considering designing their own x86-64 or ARM CPUs in hopes of reduced costs down the line from vertically integrating their business. What may stop them is the fact that their PC sales don't enjoy nearly as much volume as their mobile devices.


Before we all jump on Intel for being buggy, what's the list of serious AMD bugs like for the past five years? If AMD has a similar amount of bugs we should jump on them, too. If not, then we will actually know that Intel deserves being jumped on to the exclusion of AMD.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: