Hacker News new | comments | show | ask | jobs | submit login
Intel CPU Bugs of 2015 and Implications for the Future (danluu.com)
204 points by tolien 473 days ago | hide | past | web | 38 comments | favorite



Awesome article! From footnote 1:

In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that?

I can't figure it out, any fellow posters have ideas?

Well funded unicorns with iffy infrastructure seem like good candidates to pay out DDoS ransoms, while Azure, AWS, etc will never pay out a ransom. If attackers want credit cards and email addresses, softer targets seem like a better choice as well. It doesn't seem like an attacker can extract any value from hardened targets unless they have state-sponsored or corporate espionage level resources and skill.


Some people just really like hard problems.


Might the attacker have the mindset that "Azure's business depends on it's reputation as a reliable and safe host, so they are more likely to give in to my demands."

The unicorns, while they still have much to lose by having their service denied, don't necessarily live and breathe on the image of stability and uptime.


Whenever I've glanced through the errata for processors, it makes me think it's a miracle that the CPUs run as well as they do...

https://www-ssl.intel.com/content/dam/www/public/us/en/docum...


Take a look at the challenges at 28nm:

http://electronicdesign.com/digital-ics/understanding-28-nm-...

You're right that it's damn-near a miracle that they work. The miracle is the investment of literally billions in tooling and projects to figure out how to get it to work. Then, it's still so expensive that cutting-edge nodes at 28nm and below where Intel often works are exclusive to only the elite engineers and tools if it's anything of significant size.


I dunno, Adapteva taped out at 28nm with 5 people. Do you mean that full custom design is particularly hard at 28nm or that it's hard for EDA vendors? 28nm is also a bulk node, I think it ought to have been easier to understand for everyone relatively to FinFET nodes.


I've given them props before as they're a special case. They wrote a whole blog article on how they kept things cheap. In comments in past 48 hours, they mentioned something like $2-3 million in 3 years to get their relatively simple design done. I also remember they deal with HPC crowd that pays good money for even low-volume.

So, they're just unusual across the board. Doesn't tell us much about a large design except to say physics will fight us every step of the way. Various approaches like asynchronous and self-correcting circuits might help us a bit. Look at Achronix FPGA's to see the power of the former.


Uh, tape out doesn't mean you've verified anything properly.

Looking at the various Intel bugs found, I can't help sniggering a little. Verification mistakes happen all the time though. It's just a matter of when the bugs are found: if found later, it has much greater impact.

On the 28nm issue, getting your design to synthesize to 28 nm depends on the architecture and your constraints. Trickier data paths make it harder to get it meeting timing. Where your IOs are placed, where your RAMs are placed and how much you have, each thing can make it harder and harder to synthesize. Adapteva doesn't have much RAM I believe.


Are you joking? Intel uses (even invented) so much verification tech it's crazy. I'm talking more than any small vendor could hope to use as it would slow them down or take too much brains/training. Any bugs they're having are more likely due to size of their project or how custom the optimizations are. Here's some Intel techniques they use for verification:

https://www7.in.tum.de/um/25/pdf/LimorFix-2.pdf

https://www.cl.cam.ac.uk/~jrh13/slides/nasa-14apr10/slides.p...

IBM's papers on their verification system for POWER processors mention all kinds of optimizations for things like pipelines that make the logic ridiculous. Then they jump through hoops in verification. Yet, they don't hit 3+GHz on good pipeline without that. Undoubtedly, Intel's using similar tricks with similar issues.

Regular ASIC verification doesn't cut it at their level. What they're doing is on another level. It's hard to say what exactly we should expect in terms of errata level given their operating constraints (esp marketing). Only thing I expect is to know clearly the circumstances errata appears so I can avoid it. They let me down...


"Are you joking? Intel uses (even invented) so much verification tech it's crazy."

That's kinda why I am sniggering. Intel has a rep for letting bugs go to silicon despite all that stuff. In terms of verification, they dropped the ball probably (unless they found these bugs in verification and decided to tape out anyway).

One area where they fell on their face seems to be AVX. From TFA, "Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior". That's a huge bug. Remember how we discussed the torture testing of floating point on the RISC-V ? Kinda similar issue. A customer wouldn't be happy with a huge bug there.

The big 3 of EDA tools (Cadence, Synopsis, Mentor Graphics) provide their own tools for verifying this stuff called UVM. Like anything, it still relies on the person using the tool. It takes a lot of effort and planning to use this stuff.

Whenever the verification engineer has to create Verification IP to test the IP, there's a chance they create bugs of their own. It's like a golden rule.

That's why I am not a fan of formal methods. Nothing is proven until you have it working in silicon.


"That's why I am not a fan of formal methods. Nothing is proven until you have it working in silicon."

The one's that did use formal methods all the way did what they were supposed to and usually first pass. They were a mix of academic and defense-related stuff most people can't buy. What I normally see when I look up formal verification in industry is equivalence checking with custom shops also doing protocol verification and certain correctness angles. We've seen lots of what Intel does in their docs. So, that narrows the question down to "Why are these errata in there anyway?"

"The big 3 of EDA tools (Cadence, Synopsis, Mentor Graphics) provide their own tools for verifying this stuff called UVM. Like anything, it still relies on the person using the tool. It takes a lot of effort and planning to use this stuff."

Didn't know about that one. Thanks. The briefs I just Googled sound weaker than Intel's stuff and especially IBM's where presentations cover an incredible amount of specific verifications. Like you said, what one puts in determines what one gets out of it. So, are Intel just being lax on verification or is their stuff just too complex + optimized to catch all the corner cases?

"Intel has a rep for letting bugs go to silicon despite all that stuff. "

If it's intentional and avoidable, then I think it might be wise in another light. (Or not, but worth considering.) The other light is the Lipner essay on why shipping is more important than highest quality:

https://blogs.microsoft.com/cybertrust/2007/08/23/the-ethics...

That comes from a background where he and Karger did high assurance systems that aimed for perfection and got as close as they could in that period. Kept slipping behind competition in terms of features/speed/price and that would effect market share. So, his prior employer canceled that product with his next one following his recommendation to hit acceptable quality levels, ship, and continuosly improve the product. Wonder if Intel is doing that to keep market dominance?

"Nothing is proven until you have it working in silicon."

This we agree on. The formally verified stuff usually works first try but that's thanks to billions in R&D in tooling & fabs they used. I like knowing a batch of chips performed exactly according to spec when probed during operation. Funny I can't remember what you HW people call that activity.

Anyway, I'd love to email or chat with you sometime to see an insider's view on this topic and fill in some blanks. Reason being experienced ASIC people talk very little vs software people. I'm collecting what tidbits of reality I can for a variety of reasons. Two important ones are giving a head start to people aiming for HW design and boosting high assurance design by determining where the weak points are currently. Really busy right now but maybe later on, eh?


"So, are Intel just being lax on verification or is their stuff just too complex + optimized to catch all the corner cases?"

Lax! Of course it's complicated but verification is about finding the corner cases. Intel is driving all these extensions to the ISA. They have a pretty captive CPU market so they are slack. Qualcomm was the same with Wifi SoCs when I was there. Freescale was better but that may be because of the particular projects.

"I like knowing a batch of chips performed exactly according to spec when probed during operation. Funny I can't remember what you HW people call that activity."

ATE (automatic test equipment)?

The thing is that Intel does have the toughest job. They are 28nm with a complicated design, lots of RAM, power is a big issue so clock gating probably everywhere etc. You can't really compare that with a military or an academic chip. The design constraints are much tougher for Intel.

Still, Intel supposedly has all the geniuses and the money. They should have no excuses.

On the formal stuff, I have yet to be convinced. I never just trust the tools, remember?

My email is now in my user profile if you want to discuss further.


"ATE (automatic test equipment)?"

Yeah. It's one of the only processes I have little data on. Must be straight-forward if I haven't stumbled on many academic papers on the subject. If not, there's a siloing effect happening on publishing side & the term will be helpful.

"The thing is that Intel does have the toughest job. They are 28nm with a complicated design, lots of RAM, power is a big issue so clock gating probably everywhere etc. You can't really compare that with a military or an academic chip. The design constraints are much tougher for Intel."

That's part of my point. Hitting perfection took making the problem a lot simpler than what Intel faced. Same happened in high assurance security where everything TCB was verified down to every trace and state. Took lots of geniuses...

"Still, Intel supposedly has all the geniuses and the money. They should have no excuses."

...but still couldn't solve all the problems, keep up in feature parity, meet profit requirements, etc. So, I'm not so harsh on Intel for now given the complexity & business model. I might change my mind later. For now, we'll just disagree. :)

"On the formal stuff, I have yet to be convinced. I never just trust the tools, remember?"

Now, that I don't get. I've seen, in synthesis/verification work, one 9-transistor analog circuit take (IIRC) 55,000+ equations to represent all its behaviors. Digital ones easier but with tons of multi-layer cells wired up. For custom, they often behave differently. DRC's on modern nodes I read are in 1,000-2,500 range. I'm ignoring OPC because you're handicapped enough at this point. If you don't trust the tools, how are you getting anything done in ASIC land?

You must write really fast plus have a discount card at Office Depot to do it all on pencil and paper. :P

I think you trust tools more than you're letting on. You probably just cross-check tools with tools in various ways like I did with high assurance SW to catch tool-specific issues. That implies a lot of, but not total, trust in the tools. If I'm wrong, I'll be surprised and probably learn something in the process.

There's another method HW people might already use that comes from theorem provers. They know the proving process is complex. It also breaks down into a series of primitive actions in logic. So, they split the activity between a complex, untrusted prover and a simple, easy-to-verify checker. I know state-machine equivalence & even many physical phenomenon can be modeled well in software. I've seen as much with FEC systems. Trick for HW might be turning all the tool outputs into a series of steps like in an audit log that such tools can verify. That might take a hell of a long time, though, but also should be easy to parallelize onto clusters, GPU's, FPGA's, etc. Do it on a macro-cell at a time composing the results like in proof abstraction or abstract interpretation for software.

What you think?


"I think you trust tools more than you're letting on. You probably just cross-check tools with tools in various ways like I did with high assurance SW to catch tool-specific issues. That implies a lot of, but not total, trust in the tools. If I'm wrong, I'll be surprised and probably learn something in the process."

While it is true that we use tools to cross check each other, what I mean is that we regularly are manually looking through waveforms. At every stage of the flow, we are checking that our verification infrastructure is actually doing what it should to find bugs. Because a lot of the time, either we've stuffed up using the tool or the tool itself is stuffed.

So much tooling is provided for you. Bus functional models, protocol checkers, etc. You are just cramming it all together and writing your own stuff over the top. There is always a mistake in there somewhere.

"Trick for HW might be turning all the tool outputs into a series of steps like in an audit log that such tools can verify."

This is what happens with the UVM. A checker is written with a SystemVerilog interface by a third party or ourselves. It uses the UVM standard so you can integrate it with other UVM stuff to make even more abstract checkers. If I am writing the prover, I know I probably threw a few bugs in there!

If only it were all parallelised because it is slow as hell.


"what I mean is that we regularly are manually looking through waveforms. At every stage of the flow, we are checking that our verification infrastructure is actually doing what it should to find bugs. Because a lot of the time, either we've stuffed up using the tool or the tool itself is stuffed."

Waveform-based verification is something I know nothing about. I haven't seen it in any paper I've looked at. Is that what people do in logic analyzers and such? Do you have a link to a free reference discussing what people do with that stuff and how it's used to verify digital designs? I really should have this info in mind and on hand if you all rely on it more than verification tools.

"This is what happens with the UVM. A checker is written with a SystemVerilog interface by a third party or ourselves. "

That makes sense.

"If I am writing the prover, I know I probably threw a few bugs in there!"

I like that you're realistic. It's how I used to look at code on a complex project. Even with Correct-by-Construction, I didn't get to feel safe with my code: only wonder how obscure or unnecessarily simple a problem was left in it. I need to make a Philosoraptor meme along the lines of: "Do we create coding schemes or does the code scheme against us?" Haha.

"If only it were all parallelised because it is slow as hell."

There's the opportunity. Whether it can be acted on who knows. I do know so far that hardware is many blocks strung together with all kinds of tests that should be parallelizable. Now you've reinforced this potential in my mind. I have a trick for this but I'm holding off publishing it for now. Let's just say it's easier to parallelize stuff if one doesn't force their implementation to be inherently sequential or even tied to CPU's. And there's a little-known, albeit alpha-quality, way of doing both at once. :)


"Waveform-based verification is something I know nothing about. I haven't seen it in any paper I've looked at. Is that what people do in logic analyzers and such? Do you have a link to a free reference discussing what people do with that stuff and how it's used to verify digital designs? I really should have this info in mind and on hand if you all rely on it more than verification tools."

Yeah, waveforms from a logic analyzer are mimicked by simulator tools.

Not sure about free references. Just googling around I found this about using logic analyzers : http://www.eetimes.com/document.asp?doc_id=1274572

For example, page 3 shows a RAM timing diagram. Like any good spec, the interface from one module to another is defined via a timing diagram. We build our UVM checkers and monitors to detect these memory transactions based on the sequences specified. When a transaction occurs it triggers a UVM event which in turn can be observed by other monitors/checkers or it can create other events or record the event to a log file etc.

We build our verification infrastructure to automatically check transactions behave as specified. However knowing I can't trust my own work, I manually check the waveforms to see whether the infrastructure is performing correctly.

"Let's just say it's easier to parallelize stuff if one doesn't force their implementation to be inherently sequential or even tied to CPU's."

Sounds interesting. I don't know much about how it's all implemented in the simulator.


Well, yeah, I'm just saying it wasn't a picnic at 40nm or 90nm, either. It didn't go from a walk in the park to a nightmare at 28nm.

Also I believe all the bugs in TFA are logical bugs where the circuit would have misbehaved in a logical simulation, not the kind where the circuit "flips zeros to ones" or vice versa because of physical implementation issues. In this sense advanced nodes make things worse only indirectly by enabling larger, more complex designs.


Agreed. The bugs seem logical. It doesn't reflect well on the Intel verification effort if they were logical though. It means they didn't verify properly.

One possible guess/excuse-for-stuffing-up is that tools don't always simulate correctly. There can be weird gotchas where simulation models and reality don't match.


I guess that's why Xeon is a bit behind, so that they can fix those bugs.


My thoughts as well, I guess that is Why things are taking a lot longer for Xeon E5.


Xeon E5 and E7 are always longer because getting SMP working reliably can reveal all sorts of bugs that never arise when you have One True Processor Die.


And more PCIE lanes, more cores, more cache, etc...


It's also amazing that with over 5 billion transistors (on a high E5). It's not transistors the cause the errors but the CPU design.


Old news says Kris Kaspersky:

http://cs.dartmouth.edu/~sergey/cs258/2010/D2T1%20-%20Kris%2...

Will likely continue to get worse in full-custom designs like Intel's since complexity keeps going up but ease of modeling and verification doesn't.

On the other end, look up VAMP from Verisoft, SSP from Sandia, or AAMP7G from Rockwell-Collins if you want to see what high-assurance processors look like. They ran error-free during testing IIRC with a ton of validation. Sandia SSP was first-pass. In any case, they're all kind of simple compared to Intel's stuff. That's on purpose given there's an upper limit to how much complexity you can squeeze into a chip without significant errata. One can expand such methods to larger SOC's but that's not what big vendors are doing [out of necessity]. And that has security implications that aren't going away.


I'm curious: what would a fix for these look like? Does it mean a new revision to be bought, a recall, a software patch?


It depends on the severity of the bug and the level of foresight of the hardware designer. In the worst case (e.g. the FDIV bug on the Pentium) you have to recall the CPU. Obviously, this is very bad. That's why modern CPUs have what are called "chicken bits" or "kill bits", which the BIOS or OS can set to disable specific features. The most recent use of kill bits I can recall is Intel disabling the TSX instructions on the Haswell line of CPUs. Finally, the least invasive option is to issue a microcode update, alters the way that x86 instructions get decoded to avoid the problematic behavior. Microcode updates are issued as software patches, and they're actually in the Linux kernel source tree.

If you want to learn more about all three of these fix techniques, in addition to learning a bit about the steps that Intel and AMD take to prevent such bugs in the first place, I highly recommend this CCC talk: https://media.ccc.de/v/32c3-7171-when_hardware_must_just_wor....


Don't forget the Segfault bug on the Intel Quark (which doesn't have killbits - oops) [1], or the stack pointer overflow bug that Matt Dillon found compiling DragonflyBSD [2], or the Barcelona "you weren't really using your TLB, were you?" cache coherency bug [3].

Microcode is also in multiple places - there's a flavor baked into your chip, flavors baked into your BIOS to boot on your CPU if its rev is lower, and flavors pushed into {Windows,Linux,...} that also flash if their revs are newer.

[1] - https://en.wikipedia.org/wiki/Intel_Quark#Segfault_bug

[2] - http://wiki.osdev.org/CPU_Bugs#DragonFly_BSD_Heavy_Load_Cras...

[3] - http://www.anandtech.com/show/2477/2


Usually just a microcode update. Before microcode though, a bug would usually just be worked around. IIRC, Windows would actually trap the FDIV instruction on buggy Pentiums and compute the true result with software. With enough pressure, a recall might be issued, but it took a while for Intel to finally recall the buggy Pentiums.


For the FDIV bug, IBM pulled and replaced all shipped bad chips. Not every vendor did that.


The FDIV bug was an oddity in part because of when it happened. It was the first significant bug not fixable by microcode update or similar in the first Intel chip to be marketed directly at the end user.

Marketing/PR was a key factor: public awareness was high, public understanding was low, and the press had a field day showing examples of the bug affecting popular spreadsheet applications. Prior to the Pentium line, CPUs were something only techie users like us were more than casually aware of - to everyone else they were just a part in an appliance and bugs like this would have been blamed on the software (for which there would be an update to work around the bug which would cement the opinion the software was the source of the problem).

Traditionally kernels and compilers would be reissued with a work-around (and software affected by a complier change patched) and people who were significantly affected (number crunchers for who the workaround massively affected performance in this case) might get a replacement if they bought direct from Intel. Home and office users would be pretty much unaware of this happening. There were a number of bugs in x86 chips prior to this that were handled that way (see the list of bugs the Linux kernel checks for on startup like "FOOF" and so forth). Obviously this isn't just an Intel thing, I expect all significant processor lines have had bugs. I remember some machine code I'd badly crafted being affected by a bug in some variants of the 6502 so it would work on some BBCs/compatible but not all (I forget the exact details but it was something about jumps from locations at the last byte of a page, the workaround was easy: pad the location or to use the mallet instead of a more precise tool and 16-bit align everything).


Only after the bug was described on CNN.


You don't fix them, the tools and systems (gcc or the operating systems) have to work around them (i.e. not allow the sequences of instructions that generate the bad behaviour).


That's one way, and sometimes the only way on certain platforms (hello Intel Quark), but - those kinds of workarounds can cause problems on other platforms in turn - they don't stop malicious people from making use of such exploits, if they don't require such privileges as to make them pointless once you can use them

So you generally prefer to be able to update the CPU microcode or flip a killbit to disable the relevant silicon path, to having to recompile everything on your architecture to include a workaround detection.


presumably just a microcode update


There was a talk on 32C3 about CPU bugs and the insane work that goes into preventing them: https://www.youtube.com/watch?v=eDmv0sDB1Ak


That doesn't even include any CPU bugs deliberately installed as backdoors.[1]

[1] http://www.eteknix.com/expert-says-nsa-have-backdoors-built-...


It's called AMT/vPro. It's in the brochure. People tell me it's even on when the system is off. All that circuitry is probably in most of the family just to reduce NRE costs. Couldn't be more ideal.

And to think some people here mock people for worrying about their random number function while ignoring their official backdoor and its implications. (sighs)


On the bright side: There are worse things than a processor lockup (which is easy to spot when it happens). And the other bug was in the newest architecture (Skylake) and did not affect Xeons.

Arguably the memory Row Hammer exploit was far worse, and is a sign of how bad things can get outside the CPU.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: