I'm afraid the level of paranoia at Intel has decreased since then.
Now they are pivoting everywhere, but theirs is the only market with sufficient margins. And the perspective is that their margins will shrink because of competition and software emulation (that they are keeping into control by patent trolling).
Competition pressure could make a company's new product worse than (in this case, less stable than) their previous products, e.x. Samsung phone explosion. I still remembered the story was Samsung wanting to release their phone ahead of iPhone and I would imagine the testing went through a similar stressful time as Intel.
Of course not all cases of taking such risks would lead to disasters - just imagine Intel rushes on releasing new chips ahead of competition and 99 out of 100 times it ended up performing well. But a unique character in Intel's case is these bugs, unlike a faulty battery design, are accumulative and additive to future product development, which means a few small wins in catching up with your competitor could also lead to massive failures in some next major battle.
Now imagine Intel's competitors are going through the exact same scenario. One possible outcome is both Intel and its competitors' products become less stable and more buggy over time, and until everyone's stuff seems to be broken they probably never have time to fix them.
There is a valid point there though - if you are testing for testing's sake and not finding anything extra through the extra effort then you are wasting time and potentially worse: lulling yourself into a false sense of security. Testing should be done for utility, not just in response to fear - you need to test intelligently, not just test lots. Like TTD in software, good testing processes make life much easier and quality much higher, bad testing processes can be worse than useless.
Processor bugs are always a thing and always have been a thing - look at the list of bugs the linux kernel scans for and works around many of which pre-date the FDIV debacle.
What made FDIV special isn't that is was a bad bug, it was the recent change in marketing. Before then processors were sold to manufacturers who might tell the customer what is used, unless you were a hobbyist you didn't much care for the specifics. But the Pentium line was the first time a processor had been particularly marketed directly at the end user. It had started with the 486 lines a couple of years earlier when "Intel Inside" was first a thing, but there was a huge push in that direction with the release of the first Pentium lines. Suddenly Joe Public was more aware of that detail, but was blissfully unaware that CPUs are complex beasts and generally not 100% perfect.
It didn't help that the bug was very easy to demonstrate in common applications like Excel so Joseph & Josephine Public could see and understand the problem where they wouldn't, for example, the FOOF bug, and it was easy to joke about (We are Pentium of Borg. Division is futile. You will be approximated) which fanned the rapid spread of the news. The fact that the bug only significantly affected fairly rare combinations was lost in the mass discussion about how such a bug could happen at all.
See https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl... if you've not already picked up on the news.
I look at a statement like "Our competition is moving much faster than we are" as craven and lacking vision. At that point a wisened old Zen master type figure should've stepped forward.
Competition isn't about imitating the competitor anyway, is it? It's about differentiation, right? Maybe not. But it's not like you can't easily market literally any reasonable decision you make. Paul Masson wineries bragged about selling "no wine before its time" and turned their lack of "velocity" into marketing cachet. (Even though they weren't even unique in that regard.) There's no theoretical reason why Intel couldn't market itself as "the accurate chipmaker," keep on validating "lavishly"(1) and let AMD rush headlong into this kind of bug.
(1) Obviously not... but unfortunately you never know it's not enough validation until it's not enough validation.
Well, that depends on the specific attributes on which there is competitive pressure. When its on time to market, yes, quality will suffer. When its on quality, products will be slower\more expensive,etc. Kind of similar to the often repeated quality triangle in Software dev
Which was disproved in practice.
Not all good content is offered for free on the Internet.
Capers Jones’ work does not “disprove” the triangle, and doesn’t even mention it. If anything, he implies that quality is a more complicated subject than the simple triangle implies, in that lower quality will have non-linear negative impacts on the project cost, time, and scope.
This is the triangle:
The original triangle assumes that as a PM, if you don’t recognize the trade off among scope, time, and cost, quality will suffer. This is true even with Jones’ data, but it is trite. The practices that help to ensure quality that are outside the triangle’s intent as a guideline: it is necessary to recognize these trade offs, but not sufficient, to ensure on-time, on-budget, sufficient quality and scoped delivery. One could manage tradeoffs among the time/cost/scope variables and still screw up their product’s quality due to poor methods and practices. Which is obvious, when you think about it.
Jones’ book also has a number of flaws - it’s hard to put his recommendations in practice (its more a survey than a “how to”), and he lacks data on a number of effective , newer practices that involve both quality, improved velocity, and requirements gathering or product/market fit. The result is that he tends towards promoting older practices that help quality but don’t have much impact on whether you’re building the right thing in the first place. To be fair, he admits this throughout the book, but being a data guy, shrugs and moves on to discuss what works with the data he has. A classic case of “looking for your car keys under the street lamp”. Nevertheless, it illustrates why quality pays for itself, which makes it important.
Many people have in mind this version of the "quality triangle" which we were discussing (not a Project Management triangle which I agree is more on point).
You get that a lot if search for "quality triangle": https://www.google.fr/search?tbm=isch&q=quality+triangle
Why is it commonly assumed that quality and cost and time are opposed to each other?
The Capers Jones books have one idea throughout, that is backed by data: for Software, quality is positively correlated with reduced time-to-market, low defects, and reduced costs.
> The practices that help to ensure quality that are outside the triangle’s intent as a guideline
There is a chapter about that in EoSQ, where each and every practice are measured and classified by efficiency, like TDD and code reviews. It's very interesting.
> The result is that he tends towards promoting older practices that help quality but don’t have much impact on whether you’re building the right thing in the first place.
Yes. I find he puts way too much faith in "off the shelf" software as the last bullet we may have.
tl;dr Capers Jones disprove the "quality triangle" big time, which is a product of intuitive reasoning without any basis, which doesn't apply to software. Just look at Intel at this very moment.
The Skylake/Kaby hyperthread bug has been fixed in microcode and is no longer applicable. It's perfectly safe to run HT on these processors now.
The AMD Ryzen segfault remains unmitigated at this point in time. Phoronix rushed to declare the bug fixed because they got a binned RMA replacement but there are plenty of reports of it occurring in current-production processors to at least a moderate degree, roughly proportionate with ASIC/litho quality. It's unclear what the scope is w/r/t Epyc since Epyc is on a different stepping but also hasn't really ramped yet either. The early Epyc processors were essentially engineering samples (on the order of hundreds to single-digit thousands of samples) with no real (public) visibility into any binning that might be taking place.
The Ryzen high-address bug is no big deal, that's the kind of thing that gets patched all the time (like the Skylake HT bug). That's one thing Dan is glossing over here - there are tons of these bugs all the time and as long as there is an effective mitigation available it's no big deal.
The PTI patch can be viewed as making syscalls take somewhat longer (about double iirc). Gamers and compute-oriented workloads won't be hurt hardly at all. The average mixed-workload case sees 5% performance loss, not ideal but it's not critical either. Losing 30% is real bad though, and that's what you will get on IO-heavy workloads that context-switch into the kernel a lot.
The only real mitigation there appears to be right now is to give up hyperconvergence for now and harden up those DB/NAS servers that are going to be pushing a lot of IO so that you know there won't be hostile code running on them. That will allow you to safely disable PTI and sidestep the performance hit.
Of course, Epyc was not that good at running databases in the first place, so you still might be better off sucking it up and running Intel even with the PTI patch. It will probably depend on your actual workload and the relative amount of IO vs processing.
Only if you can actually get the fix. My main home PC has this bug and the motherboard manufacturer (ASUS) has yet to ship a BIOS update with the fix.
Actually KPTI doesn't only affect syscall but also interrupts. It makes interrups slower, which affects every workload.
Does this mean they could take a hit due to this bug?
Edit: also "interrupt-based" in this case has nothing to do with interrupts seen by the CPU. Difference is that in ps/2 the device can send data to the controller (which then generates interrupt) at any time, while for usb the controller periodically polls devices for data (and possibly generates interrupt if there are some). In the early days of USB and UHCI host controllers this polling was done in software, but since USB 2.0 this is done in hardware and generates real cpu interrupts when usb device requests interrupt (although with somewhat unpredictable but bounded latency)
(Edit: I'm assuming the USA, and I'm assuming bugs that were not known to the vendor at the time of the sale.)
A 30% performance reduction (like the page table isolation fixes) probably would be considered material.
Interesting, so if you need that particular product (say it has something specific you need, e.g. a program that only runs well on Intel) and there is no competitor to it with that particular feature (e.g. AMD CPU runs the program poorly) then they can sell you as otherwise-defective of a product as they want and you cannot recover damages?
Or to put it another way, there is no notion of "I would have still bought it because I needed it but knowledge of the defect would have lowered its market value"?
When Intel issues a microcode update to slow down aging Skylake processors so that everyone goes out to buy Cannonlake, you might be able to draw a comparison.
Unfortunately, this is incorrect on many levels.
First, under EU warranty laws, it's not any bug that is covered, but defects that have been assured or are expected to not be present. I'd expect disclaimers to allow for certain errata, for example. EDIT: user ta_wh posted an example for such a disclaimer in a sibling comment.
Second, the vendor is usually not the manufacturer, and therefore seldom in the position to fix the defect themselves.
Third, depending on the nature of the defect, the vendor might have other options besides fixing it/getting it fixed, eg: discount, or returns.
If you're a company buying from more qualified vendors then it might be a different story, however at that point consumer law does not apply to you.
Don't get me wrong, I'm all for being able to return a product that's 17% slower today than yesterday, I'm just saying it might be difficult since it's an issue that requires a degree of technical knowledge to understand which most store clerks don't have.
Is this true for software as well?
It does also apply to software, but software vendors will rather refund you than fix it.
There is no such thing as a free sale, and no, when you are gifted something, then you don't get the EU warranty protection.
* Buy One Get One Free
* Free USB-C to A hub with your new Macbook Pro
I would expect the free item to be covered in these cases?
Note the first bullet under "WHAT THIS LIMITED WARRANTY DOES NOT COVER":
"design defects or errors in the Product (Errata). Contact Intel for information on characterized errata."
Guess we're not covered on this one.
EDIT: That being said, given the potential scope of this issue (years of affected CPUs, massive PR hit) I'm hoping that Intel will at least offer some remedy to recent buyers. According to the article from The Register , OS vendors have been working on the fix since November. The blog posted over on pythonsweetness  posits the bug may have been identified in October. It'd be interesting to know for how long Intel has been selling Coffee Lake CPUs that are known to be vulnerable.
It seems self-contradictory to me. How can Intel warrant that
> the Product will substantially conform to Intel’s publicly available specifications
while simultaneously disclaiming warranty for
> design defects or errors in the Product (Errata)
If an instruction does something different than what their specs say on occasion, do they take that to mean it's substantially conforming to their specs?
In some abstract, philosophical sense it means that the specs are actually elected by majority of the produced processors.
> If an instruction does something different than what their specs say on occasion, do they take that to mean it's substantially conforming to their specs?
We're on the same page. What do you think Intel will argue?
Keep in mind Intel did initiate a substantial recall of Pentium CPUs in the late 90s: https://en.wikipedia.org/wiki/Pentium_FDIV_bug
They're saying that it should work as specified for the most part. And apparently the CPUs do, since they've been in continuous use for many years.
Note that they also have this exception:
"… THIS LIMITED WARRANTY DOES NOT COVER: … that the Product will protect against all possible security threats, including intentional misconduct by third parties;"
Which is likely designed to handle issues just like this one.
Of course, this depends on workload — gaming will see different results than computationally heavy tasks.
It is likely that games using Vulkan, DX12 or OoenGL's AZDO functions will see much lower performance impact (because they usually only do a handful of syscalls per frame) than games using older APIs, or even OpenGLs immediate mode (which does one syscall per emitted vertex, in worst case)
Perhaps with drivers written in the 90s for hardware from the 90s. Any OpenGL implementation worth its salt will buffer those requests on the client side until they need to be observed. Indeed this was a big feature in the heyday of DirectX 9 where D3D programmers had to count the drawcalls whereas with OpenGL you have way more leeway with calls since the driver tends to be smarter and caches that stuff.
In theory with a modern driver using OpenGL's immediate mode API shouldn't need any more syscalls than building the vertex buffers in your program, setting up the necessary state and issuing a buffer draw command.
The only time where you'd need a syscall per emitted vertex would be if the GPU had OpenGL-like commands and your OpenGL implementation was a thin wrapper over that. I think one of ATI's very early GPUs worked like that (although the commands were per primitive, not per vertex).
Nowadays, thanks to vulkan and AZDO with glMultiDrawElementsIndirect, you're right, of course — you might even use a single syscall per frame.
That's why I said, in absolute worst case.
If you absolutely need the best single core performance you can, Intel is the way forward.
If multicore performance is important (lots of multitasking, lots of heavy processes running) then one of the 8 core Ryzen 7's will be better, for cheaper.
Remember, a lot of the Zen arch was developed by Jim Keller, who is the brains behind the Athlon 64.
Although AMD’s response in the forum was that these were isolated issues, phoronix was able to
reproduce crashes by running a stress test that consists of compiling a number of open source
programs. They report they were able to get 53 segfaults with one hour of attempted compilation.
AMD has committed to the AM4 socket they're using until at least 2020, so if you buy a Ryzen processor and motherboard today, you should be able to use that motherboard with Ryzen class processors until 2020.
Intel switched to the AMD model starting with the i3/i5/i7 series, now moving to an integrated MMU themselves.
Good to know AMD is doing that! Getting new motherboards is annoying.
It would be great if the page displayed the date that the article was posted/updated. It is not in the URL nor the sources. The only way to see the dates is in the RSS feed and even that is only for new articles.
Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing.
Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signaled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.
So this is why Krzanich sold his stock. He knows the bug is his fault. Whoops. I think someone may "quit for personal reasons" soon.
It's a bit hard to parse that sentence, could you rephrase?
Are you saying that untrusted code should only be run on systems which do not use hardware virtualization, because there's a risk of hardware bugs that require hardware replacement? The problem is that there is no single-system equivalent, users would have to use multiple laptops/desktops and air gaps to achieve separation (e.g. between network drivers and userspace apps). May not be practical.
Yes there's a risk of a catastrophic hardware bug with no workaround, but that risk applies to every feature in the CPU, not only virtualization or page tables or speculative execution. Statistically it's only happened once with the single Intel CPU recall, which are better odds than other risks.
As for statistics, there are strong indications that modern efforts for CPU verification do not keep up with increasing CPU complexity. So number and severity of bugs will grow.
As for a particular bytecode format, I have no idea. Webassembly is a possibility, but it is still slower by factor of 2 compared with native code. Perhaps CPU-specific symbolic assembler will be a better choice as long as one can realibly alter it to workaround CPU bugs.
Looks like ARM64 was affected, but it has an architectural feature that makes the mitigation much easier: http://lists.infradead.org/pipermail/linux-arm-kernel/2017-N...
I guess I'm glad now that Apple put a 2 year old CPU in the early 2015 Macbook Pro! Besides my 2012 Mac Pro, that is the most expensive machine in the house!
What do you think, is this realistic?