Linux page table isolation is not needed on AMD processors

caiobegotti · on Jan 2, 2018

Did he know it would blow up in a few weeks? https://www.fool.com/investing/2017/12/19/intels-ceo-just-so...

calt · on Jan 2, 2018

What's the connection?

amckenna · on Jan 3, 2018

Here is some more context - http://pythonsweetness.tumblr.com/post/169166980422/the-myst...

The connection between the linked article in this comment and the linked page for this post is that there is a potentially huge bug that will be made public soon and it just affects Intel processors, not AMD - hence the large sale of stock by the Intel CEO.

yeukhon · on Jan 3, 2018

Not sure if it makes any sense and even logical to compare the market before and after Intel's floating point bug was uncovered a decade ago. My bet is this current bug won't shake Intel's stock price much.

Certhas · on Jan 3, 2018

If the workaround being deployed now causes a 30% performance hit in real world usage, even just for some cases, it could hit Intel way harder than fdiv.

A lot of people on Intel will suddenly lose a noticeable amount of performance. Conversely, if your Intel based VMs lose 25% performance, you are now booting up and paying for 20% more VMs for the same load.

lvs · on Jan 3, 2018

If your allegation is true, that would seem to be very illegal.

cma · on Jan 3, 2018

I've heard you can schedule big sales all the time and then regularly cancel them unless something goes wrong. Apparently there is no rule against insider canceling.

aslkdjaslkdj · on Jan 3, 2018

That's not true. Changing a stock sale plan in any way is considered insider trading. The window has nothing to do with whether it's legal or not. It's only used a risk mitigation and is up to company policy.

https://corpgov.law.harvard.edu/2013/02/05/rule-10b5-1-plans...

adrianratnapala · on Jan 3, 2018

I think the point the cma was making is that this trick doesn't involve making changes to (formal) stock-sale plans. The formal plan is to sell regularly, and that remains unchanged, you just cancel it by hand habitually, except when you don't.

I am no lawyer, so I don't know if this is really allowed. My gut instinct is (a) no, it is not allowed and (b) there will always be some more subtle version of the tactic that is allowed.

Natanael_L · on Jan 3, 2018

Just hire a DDoS attack anonymously so your attempt to sell fails, pretend you've got nothing to do with it

cma · on Jan 3, 2018

It is a lot less clear cut than that: https://en.wikipedia.org/wiki/SEC_Rule_10b5-1#A_possible_loo...

cjcole · on Jan 3, 2018

Speculative execution bug causes speculative stock sale non-cancellation?

rphlx · on Jan 3, 2018

That's correct as I understand it, though there is a risk that some high-profile event will cause the media to whip up enough (understandable) public outrage to cause the SEC to actually, reluctantly fine someone for that.

CyberDildonics · on Jan 3, 2018

Michael Milkin, the junk bond king of the 80s, realized that corporate bonds traded similarly to stocks, yet were not covered by insider trading laws.

michaelermer · on Jan 3, 2018

Full Story https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl...

twotwotwo · on Jan 2, 2018

Was the connection with speculative execution already being discussed openly? I know about https://cyber.wtf/2017/07/28/negative-result-reading-kernel-..., but not about anything between that and 28 Dec suggesting someone made it work and that's the reason for KPTI.

If it wasn't in the open, seems...not ideal embargo-wise for AMD to leak it there. Though no one's in that thread complaining about the disclosure, so maybe they either think that part is already known to anyone looking closely, or just don't think it's a very big piece of the exploit puzzle (like, finding the way to get info out a side channel was the hard part).

daenney · on Jan 2, 2018

It wasn't publicly acknowledged but people figured it out already. Take a look at https://news.ycombinator.com/item?id=16046636 (both the article and the comments) for example. This wasn't going to stay secret much longer.

twotwotwo · on Jan 2, 2018

That post is a couple days after the 28 Dec AMD commit, though. Curious if it was _already_ discussed since that would mean no way what AMD said is how people figured it out.

my123 does point out that the author of the speculative execution blog post is first in the KAISER paper's acknowledgments, and looks like the paper was presented at a July conference, so that's an earlier clue out in public, for what it's worth.

my123 · on Jan 2, 2018

The author of https://cyber.wtf/2017/07/28/negative-result-reading-kernel-... is listed in the Acknowledgements part as the first name in the Kaiser whitepaper.

twotwotwo · on Jan 2, 2018

Ah, yes, that does look like an earlier public clue. Thanks.

AnssiH · on Jan 2, 2018

> Though no one's in that thread complaining about the disclosure, [...]

I imagine if someone had complaints they would make them in private so as to not make the situation even less ideal embargo-wise.

benmmurphy · on Jan 3, 2018

https://twitter.com/dougallj has released source code (https://t.co/vaaMyajriH) which partially reproduces the problem. you need a little bit of tweaking to read kernel memory and to read the actual values. from his twitter and from i've observed sometimes the speculative code will see 0 and sometimes it will see the correct value. he speculates that it might work if the value is already in the cache.

emusan · on Jan 2, 2018

I think the original patchset is from December 4th: https://lkml.org/lkml/2017/12/4/709 though I could be mistaken.

twotwotwo · on Jan 3, 2018

That's "a major overhaul of the KAISER patches" as the commit message says. It doesn't mention the connection to speculative execution, though; that was the bit I was interested in.

tedunangst · on Jan 2, 2018

A leaking embargo? I'm shocked!

electic · on Jan 3, 2018

This is going to have dramatic effect on the cloud computing market. It might make sense to make sure any VMs you run are on AMD processors or it can really hurt your performance and basically cost you more to do the same workload.

It also seems, from early benchmarks, this can slaughter performance with databases.

caf · on Jan 3, 2018

Here's a Postgres benchmark on Skylake (supports PCID) showing a ~6% reduction in TPS.

http://lkml.iu.edu/hypermail/linux/kernel/1801.0/01274.html

bhouston · on Jan 3, 2018

I wonder if cloud providers will ask Intel for partial refunds when their CPUs get 5% to 30% slower than promised?

dx034 · on Jan 4, 2018

Why? They pass it on to customers. More interesting if Google or Facebook will react. They could need up to a data centre each to compensate for that (I assume both have very syscall heavy applications). Maybe not suing Intel but pouring more money into the development of competing chips

yeukhon · on Jan 3, 2018

Why are people insisting this affects cloud computing market? I am not sure if this bug is absolutely limited to cloud instances.

zlynx · on Jan 3, 2018

The bug affects transistions into kernel mode. Virtual machines have one extra transistion. A read() call in the guest calls the guest OS which calls the host OS.

yeukhon · on Jan 3, 2018

You are referring to the slow down and hence the extra slow down for calling an extra syscall?

If so, then isn't it technically correct that the bug will affect regardless of virtualization or not, but heavier penalty for VMs?

_xnmw · on Jan 3, 2018

Because this bug allegedly affects hypervisors/VMs.

yeukhon · on Jan 3, 2018

But shouldn’t the fix affect the kernel performance regardless of whether the OS system is a guest or not?

hawski · on Jan 3, 2018

Because for example most consumers don't care.

vasili111 · on Jan 3, 2018

Don't worry. I don't think that there will be two separate kernels for Intel and AMD. I think performance drop will be on both CPUs no matter has it the bug or not.

kbart · on Jan 3, 2018

No. The check is being made what CPU is underlying before applying the fix.

dingo_bat · on Jan 3, 2018

Not yet. That patch hasn't been merged as of now.

consp · on Jan 3, 2018

Probably because the method of checking is flawed and should be done in make/model (i'm no kernel expert but this is the general oppinion i've seen so far).

I've seen nothing except the 'caution side lets do it with all' approach and no indication of other problems on the other hand.

russdill · on Jan 3, 2018

You can turn it off and on via a commandline flag.

_fq4v · on Jan 3, 2018

You could just compile the kernel without it on?

anonacct37 · on Jan 2, 2018

This feels like a big FU to Intel. I've heard this patch can slow down programs like du by 50%. Does that mean AMD is going to find itself running twice as fast as competitors?

jandrese · on Jan 2, 2018

I think the du case was an outlier. Normal workloads shouldn't be so heavily affected. I am expecting a few percent loss on most programs though. It's basically a larger penalty for making a syscall, which was already a fairly slow operation so performance minded people avoid them in tight loops. It will be bad for people who need to do lots of fast I/O I suspect.

dannyw · on Jan 3, 2018

Postgres and Redis are looking at a 20-25% performance hit.

https://www.phoronix.com/scan.php?page=article&item=linux-41...

tankenmate · on Jan 2, 2018

Some portable / embedded databases come to mind. Also "normal" databases doing replication initiation, and replication re-sync. And lastly backup, restore, tar, etc with small files. For files under a handful of pages long mmap() isn't a big gain.

Another syscall I think that might cause issues is gettimeofday(), that particular call has been optimised to the nth degree, and lots of user programs spam the crap out of it (mostly necessarily), especially networking and streaming programs. It would be interesting to see how much of an overhead percentagewise page table isolation will cost, and its effects on low end media devices, et al.

pm215 · on Jan 2, 2018

Linux gettimeofday these days is implemented in the 'vdso', which is code provided by the kernel that runs solely in userspace. So it's not a syscall in the 'privilege level switch by executing insn that takes an exception' sense and shouldn't be affected by the syscall-entry/exit path becoming more expensive.

tankenmate · on Jan 2, 2018

Does this also apply to vsyscall's being emulated? Will this mean that older static binaries will no longer run? or just suffer a penalty as well?

wolf550e · on Jan 2, 2018

Linux kernel has a compatibility guarantee to the user space visible API, so static binaries will continue to run. If the static binary is so old it does not know about vdso and uses regular syscall to query time, it will be slowed down.

amluto · on Jan 3, 2018

Emulated vsyscalls were already very slow and will be even slower on a patched kernel. They'll still work, though.

tankenmate · on Jan 2, 2018

This topic got me to look at the vDSO for my own machine so I wrote a short Perl program to dump your kernel's vDSO.

https://pastebin.com/UnQX5U1f

tankenmate · on Jan 3, 2018

I added a quick repo to github if you want to clone or send pull requests for architectures apart from 64bit LE.

https://github.com/mattkeenan/dump-vdso

amluto · on Jan 2, 2018

vDSO-based timing is unaffected.

dataflow · on Jan 2, 2018

Isn't every (edit: contended) mutex/etc. wait operation a syscall? That's gotta hurt for any program that waits for frequent events that don't take too long to process.

mortehu · on Jan 2, 2018

They are only system calls when you need to wait or wake up processes.

https://en.m.wikipedia.org/wiki/Futex

dataflow · on Jan 2, 2018

I was assuming contention but I guess I wasn't clear, sorry. I updated the post. But saying this "only" occurs when there is contention is very misleading since it makes it seem like the scenario of lock contention is a negligible concern. It's not.

kllrnohj · on Jan 3, 2018

Thread-suspending contended mutexes are already extremely slow. If you have a heavily-contended mutex you already have a major performance bug. If this is the kick in the pants you need to go fix it that's arguably a good thing ;)

Note that mutex contention does not itself mean immediately falling back to futex - commonly you'll spinloop first and hope that resolves your contention (fast), then fall back to futex (slow)

dataflow · on Jan 3, 2018

> If you have a heavily-contended mutex you already have a major performance bug.

I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug. It certainly isn't consistent with my experience, but if every problem you've solved could have been parallelized infinitely without increasing lock contention, more power to you.

kllrnohj · on Jan 3, 2018

> I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug.

Good, because that's not what I said. If you're heavily hitting futex convention you do have a performance bug, though. You might be confused with general contention that's being resolved with a spinlock rather than futex wait, though.

dataflow · on Jan 3, 2018

>> I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug.

> Good, because that's not what I said.

It is literally what you said:

>>> If you have a heavily-contended mutex you already have a major performance bug. If this is the kick in the pants you need to go fix it that's arguably a good thing ;)

> You might be confused with general contention that's being resolved with a spinlock rather than futex wait, though.

I'm not confusing them at all; I'm literally reading exactly what you wrote. You literally said contended mutexes are necessarily bugs (right here^) and that you considered mutexes to include the initial spinlocks ("note that mutex contention does not itself mean immediately falling back to futex - commonly you'll spinloop first"). But maybe you meant to say something else?

netheril96 · on Jan 3, 2018

He said "heavily contended", and then you dropped the "heavily" prefix and claimed that was literally what he said. That adverb is material to the discussion and your dropping it completely changes the meaning.

I concur with his opinion. Infrequent contention is not a bug; otherwise no mutex is needed. Frequent contention (or heavy contention in his words) is a performance bug.

dataflow · on Jan 3, 2018

> He said "heavily contended", and then you dropped the "heavily" prefix and claimed that was literally what he said. That adverb is material to the discussion and your dropping it completely changes the meaning.

"Heavily" was not dropped intentionally at all. Add it back to my comments. It changes nothing whatsoever. The incredible opinion that every problem can be necessarily parallelized without eventually resulting in contention (and I license you to freely modify this term with 'light', 'heavy', 'medium-rare', 'salted', 'peppered', or 'grilled at 450F' to your taste) is so fantastically absurd that I cannot believe you are debating it. I definitely don't know how you can justify such an unfounded claim with no evidence and I certainly have no interest in wasting time debating it. As I said earlier: if you never encounter problems that exhibit eventual scalability limits, more power to you.

kllrnohj · on Jan 3, 2018

I literally said spin loop resolving is fast. Maybe read more than the single phrase you're pulling out of context to go on a rant about?

asveikau · on Jan 3, 2018

Let's put it this way. If every contended mutex were a bug, why not remove the mutex and let the code run as-is? No, you wouldn't, so no, not a bug.

dataflow · on Jan 3, 2018

> Let's put it this way. If every contended mutex were a bug, why not remove the mutex and let the code run as-is? No, you wouldn't, so no, not a bug.

I mean, the parent's argument is wrong, but isn't that naive. Presumably the argument is a bad (yet still correct) solution would result in lock contention while a better solution would e.g. use a different algorithm that is more parallelizable.

revelation · on Jan 2, 2018

Ideally a mutex is just a cmpxchg. It gets more expensive when it is contended. See the Drepper paper on futexes:

http://www.akkadia.org/drepper/futex.pdf

dataflow · on Jan 2, 2018

Thanks, yeah, someone already mentioned this and I already edited in "contended" to clarify. I was actually already aware of futexes (thanks for the link though, I've never actually read the paper), but I was assuming contention -- the "every" referred to every type of operation, not every instance. See my reply to the sibling comment regarding lock contention.

contrarian_ · on Jan 2, 2018

Sounds like servers handling lots of small UDP packets would be hit pretty hard.

revelation · on Jan 2, 2018

Applications like this where the syscall overhead (and latency) starts to be a significant factor in processing time and latency have moved to userland drivers anyway:

DPDK for 10-100 Gbps networking: https://dpdk.org/

SPDK for NVMe storage: http://www.spdk.io/

The queuing and balancing stuff the kernel does makes sense for spinning rust harddisks and residential networking, but when the underlying hardware is so fast that nothing is ever queued, really what are you doing. At 100 Gbps line speed, a 1518 byte packet takes all of ~ 120ns to transmit, or about 360 clock cycles for a 3 GHz processor.

emj · on Jan 3, 2018

Taking control over network cards in user space seems doable nowdays. There was a talk about doing such drivers with IOMMU/DMA at CCC: https://media.ccc.de/v/34c3-9159-demystifying_network_cards

arghwhat · on Jan 6, 2018

User-space drivers have been doable for a while, and dpdk[1] is definitely worth a check. There's also some manufacturers[2] that only does user-space drivers for their high-performance cards (e.g. 4x10Gb/s, 2x40Gb/s, 2x100Gb/s cards). Being designed with this in mind helps performance a lot.

1: https://dpdk.org/ 2: http://www.napatech.com/

DSMan195276 · on Jan 3, 2018

> Applications like this where the syscall overhead (and latency) starts to be a significant factor in processing time and latency have moved to userland drivers anyway:

I would personally think that is worse, though please correct me if I'm wrong. The userland driver will run with an isolated PT like any other userland process won't it? If so, it will suffer the same slowdown that every other process now has every-time it has to communicate with the kernel, which I would think would be a lot for a driver.

revelation · on Jan 4, 2018

It's counter intuitive at first, but the key to understand how this works is that while you can use an MMU to assign chunks of physical memory to a process, you can of course also just use the MMU to assign the memory mapped IO registers of say a PCI express peripheral to a process.

That is in a nutshell what a "userland driver" is. It's not too far removed from poking the parallel port at 0x378 on your DOS computer :)

arghwhat · on Jan 6, 2018

A user-space driver doesn't communicate with the kernel. It is assigned DMA buffers, and communicates with the NIC solely through reading and writing to shared memory buffers.

Even before this fix, the benefits were massive, as sending a buffer was just writing to some memory, rather than syscalls and copies galore.

zmatt · on Jan 3, 2018

Reduce number of syscalls by using sendmmsg()/recvmmsg() to batch together multiple packets per syscall?

_hyn3 · on Jan 4, 2018

Anyone run any DNS server benchmarks, esp BIND and PowerDNS?

brazzledazzle · on Jan 2, 2018

This may end up doing some significant hurt to some build and testing pipelines.

mtanski · on Jan 2, 2018

It sounds like databases on very fast storage will be updated. Tons of syscalls made for disk io and network io.

lmm · on Jan 2, 2018

Databases are already written to minimise syscalls, like any other heavily performance-tuned system.

amadvance · on Jan 3, 2018

Applications that already put an effort to minimize syscalls are the ones that most likely depend on syscalls performance.

lmm · on Jan 3, 2018

Unlikely. It's relatively easy to get to the point where syscalls aren't the bottleneck by a long margin, so the only apps where syscalls are the bottleneck will be those that haven't put in any optimisation effort.

ambrop7 · on Jan 2, 2018

I think syscalls are not as slow as many people imagine they are, especially with modern CPUs and kernels (there are special instruction for syscalls that are faster than the old "interrupt" approach). See here: http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemil... ("Off-topic: A system call"). But they will be slow with this mitigation.

blattimwind · on Jan 2, 2018

"The overhead was measured to be 0.28% according to KAISER's original authors,[2] but roughly 5% for most workloads by a Linux developer.[1]" [1] = https://lwn.net/Articles/738975/

Though the patches evolved since then. So I guess we'll see.

tscs37 · on Jan 2, 2018

I believe the 0.28% are only for CPUs that support PCID. Earlier CPUs (which is a lot still) will get a much harder hit since you'll have to flush the entire TLB.

my123 · on Jan 2, 2018

Even with PCID, the hit is 29% in a tight syscall loop, but is a complete disaster above 50% with PCID off...

BeeOnRope · on Jan 2, 2018

In fact, the tight syscall loop isn't even necessarily the worst case: the primary cost of this change isn't the a direct cost in the syscall, but the CR3 switch which invalidates the TLB and incurs an ongoing cost for some time following the syscall.

The worse case would be something like a frequent syscall followed by code that touches a number of distinct cache lines, which all now require a TLB reload and page-walk (even here the cost is tricky to evaluate since there are various levels where the paging structures can be cached beyond the TLB, so the cost of a page-walk various a lot depending on locality of the paging structures used).

bonzini · on Jan 2, 2018

PCIDs avoid the TLB invalidation.

BeeOnRope · on Jan 2, 2018

Yup, I think so.

In that case (PCID hardware on a PCID-enabled kernel), the performance effect should be more limited to the syscall itself. That said, why is the hit still so big with PCID? Surely just the CR3-swap by itself shouldn't be so slow?

bonzini · on Jan 3, 2018

MOV-to-CR3 is pretty slow, yes. In the ballpark of a hundred clock cycles, and you have to do 2 of them. The cost of a system call was about 1000 cycles, maybe less on newer processors---both OSes and processors optimize the hell out of SYSCALL/SYSRET.

walterbell · on Jan 2, 2018

Which Intel CPUs support PCID? Based on this Linus message, it was introduced in 2015, so Broadwell and later?

http://lkml.iu.edu/hypermail/linux/kernel/1504.3/02961.html

Freaky · on Jan 2, 2018

PCID's been around for a long time - even my old Westmere Xeons have it. INVPCID is more recent.

SolarNet · on Jan 3, 2018

Yes. AMD didn't take shortcuts, and implemented the spec correctly. Intel took shortcuts, introduced bugs, and now to compensate for that the OS has to work around it in software, it's going to be slow. For years Intel has reaped the benefits of shortcuts for performance, while AMD has been implementing things correctly; now there is a correction.

That's how the market works.

amluto · on Jan 3, 2018

AMD doesn't exactly do an amazing job of avoiding gotchas in their CPUs. They have a bizarre idea of what writing zero to a segment register should do (resulting in info leaks that were only recently fixed on Linux), their demented leaky IRET is even more demented than Intel's, and their SYSRET's handling of SS is downright nutty.

OTOH, Intel's SYSRET is actively dangerous and has resulted in severe security holes, and Intel doesn't appear to acknowledge that their design is a mistake or that it should be fixed.

tfcata · on Jan 3, 2018

Can you post a few links maybe to the SYSRET issue mentioned? Just curious.

amluto · on Jan 3, 2018

SYSRET on Intel will fault with #GP if the kernel tries to go to a noncanonical user RIP. The #GP comes from kernel mode but with the user RSP. Before SMAP, this was an easy root if it happened. With SMAP, it's still pretty bad. AMD CPUs instead allow SYSRET to succeed and send #PF afterwards, which is very safe.

AMD CPUs are differently dumb. If SYSRET is issued while SS=0, then the SS register ends up in a bogus state in which it appears to contain the correct value but 32-bit stack access fails. Search the Linux kernel for "SYSRET_SS_ATTRS" for the workaround.

cesarb · on Jan 3, 2018

I believe it's this one: https://blog.xenproject.org/2012/06/13/the-intel-sysret-priv...

kzrdude · on Jan 2, 2018

The 50% figure is from a benchmark that didn't run on an Intel CPU!

kpcyrd · on Jan 2, 2018

    - /* Assume for now that ALL x86 CPUs are insecure */

Before this patch the mitigation was active for all vendors.

tankenmate · on Jan 2, 2018

The text as written only seeks to defend AMD's product. Whether the sub text goes further is open to non objective speculation. Having said that I'm sure AMD are feeling pretty happy with their statement. Schadenfreude may be too long a bow...

rurban · on Jan 2, 2018

What are you talking about? Every other single CPU vendor with an MMD, arm, s390, Sparc, ... either has a separate page translation table register for kernel and user space, or like AMD memory page capabilities, just Intel not. It's very clear who is at fault here.

amluto · on Jan 2, 2018

Not ARM. You're confusing kernel mode with kernel addresses.

rurban · on Jan 2, 2018

Nope, read the paper, read the patches. Only Intel is affected. Arm has two such registers, TTB0 and TTB1. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

amluto · on Jan 2, 2018

I think you're still misunderstanding. The CPU picks TTBR0 or TTBR1 based on the top significant bit of the VA, irrespective of whether the access was initiated by user or kernel code. This is in contrast to s390, which has separate page tables for user mode and kernel mode. I personally much prefer s390's model.

And yes, I've read quite a few papers, and I wrote a good fraction of the patches.

rurban · on Jan 2, 2018

Oh, thanks. My fault then. I haven't read the arm patches, only the summary.

kev009 · on Jan 3, 2018

I vaguely remember some threads from last decade where Linus trashed PowerPC and s390 TLBs. I wish I could find them and reread with this in mind.

stefanfisk · on Jan 5, 2018

are you thinking of this? http://yarchive.net/comp/powerpc_page_tables.html

tankenmate · on Jan 2, 2018

My response was about whether AMD was part taking in schadenfreude, in reference to the OP's statement that this was "a big FU to Intel". I wasn't making a statement on who else was affected and/or who was at fault.

thinkMOAR · on Jan 2, 2018

Hmm i was wondering about the performance hit, and i kind of miss performance details in the report as i consider it to be significant to report.

bitwind · on Jan 2, 2018

All Intel CPU's are affected, mitigation syscall overhead increased by 50%, and none of AMD CPU's affected? I would say this could be an indicator to short INTC and long AMD...

IgorPartola · on Jan 2, 2018

Short INTC maybe but I am not sure this means that AMD will increase in value over the long run as a result of this one incident.

dboreham · on Jan 2, 2018

I think it will because it shows the downside of a monoculture. Hence big purchasers of CPUs will want to diversify. Also good for ARM vendors I suppose. Disclosure : bought AMD this morning before headlines saying "Buy AMD, short INTC" appeared.

IgorPartola · on Jan 2, 2018

Has there ever been a precedent for this? When there were major bugs in Intel CPU's (or drives, or RAM, or motherboards) did the likes of Amazon and Google invest in diversification? And has it affected stock prices meaningfully? My guess is that they'll see this as just another one off issue that can be fixed with software, then move on. For a large enterprise, monoculture that works is actually better than diversification.

When you think about your own workstation, it's not a big deal to build an Intel or AMD system. But when you buy 100k motherboards and spend the time adjusting your tooling to those, from packaging to power, to cooling, to support, to OS code, etc. and then you on a whim decide to get another 100k motherboards of a different architecture, you spend a non-trivial amount of time and money to support those as well. Again, if AMD provides better hardware, it's absolutely worth it. But I personally wouldn't do it based on this bug.

I don't own shares of either AMD or Intel.

revelation · on Jan 2, 2018

I checked the stock of Intel during the FDIV bug (1994/1995) where they had to go as far as recalling the affected processors at a cost of $500M in January 1995 and there was basically zero effect. By the end of 1995 the stock had actually pretty much doubled in value..

sitkack · on Jan 3, 2018

I personally think FDIV made Intel money, it told the world how important Intel was. It wasn't just the calculator sitting on some trader's desk. It ran the stock market and the stock market responded.

zrm · on Jan 2, 2018

> For a large enterprise, monoculture that works is actually better than diversification.

But that's the problem. Nothing works 100% of the time, which is why monoculture is bad. When there is a bug that affects 20% of your systems, you can continue operating at 80% capacity, which at a reasonable level of reserve/redundancy means you're still entirely up. With a monoculture the bug affects everything and you're entirely down.

> But when you buy 100k motherboards and spend the time adjusting your tooling to those, from packaging to power, to cooling, to support, to OS code, etc. and then you on a whim decide to get another 100k motherboards of a different architecture, you spend a non-trivial amount of time and money to support those as well.

This is why hardware abstraction is a thing.

It's almost always less expensive to support diverse hardware from the beginning than to wait until after the market shifts.

Eventually the day comes to switch from 68K to PowerPC, or PowerPC to Intel, or Intel to ARM, or ARM to whatever else. Because eventually you save/gain a zillion dollars by switching and it "only" costs three quarters of a zillion to switch.

But it would have cost a tenth that much to have supported diverse hardware from the start, and then the transition is only a matter of using more of the now-superior hardware rather than being stuck on the now-inferior hardware for potentially years while everything is rearchitected from scratch.

IgorPartola · on Jan 3, 2018

> This is why hardware abstraction is a thing.

This is the mistake in your argument. Motherboards, CPUs, RAM chips, GPUs are analog and physical objects. For AWS to switch a DC from one mobo to another just to find out that this one draws 5% more power and their standard backup generator can't handle it, which now starts a chain reaction of upgrades is going to incur real world costs. Costs that can't be amortized by writing some code to make the motherboards look the same.

This is basically the A/B testing/one-armed bandit problem. How much do you spend time exploring alternatives vs how much time to you reap the benefits of the fact that all your hardware is exactly the same and best of breed, as based on your testing?

> When there is a bug that affects 20% of your systems, you can continue operating at 80% capacity, which at a reasonable level of reserve/redundancy means you're still entirely up. With a monoculture the bug affects everything and you're entirely down.

These situations simply don't lose enough money to make up for the gains of a monoculture.

Think about it this way. You probably run or at least know someone who runs SaaS products. Do you/they use five different cloud providers in equal measure to make sure you have diversity if one has an issue? Do you/they use five different software stacks in case there is a remote exploit for RoR and PHP holds things up? Do you/they buy groceries at five different grocery stores in case one of them has an e. coli outbreak that the other four don't? The answer to all that is now, because no matter how you try to abstract these things, there is a meaningful difference between PHP vs Django vs RoR vs Express vs .NET and between AWS and GC and Azure, that it would cost you a lot more, not just in billing but in engineering effort to support.

Another example: chances are you've at some point built a RAID array. Did you put different size and performance drives from different manufacturers into it or did you buy N of the same drive type to ensure even performance? If so, why?

Put another way, how much more are you willing to pay on your AWS bill to ensure they are running a mix of ARM, AMD, PowerPC, and Intel chips? Because my guess is that it won't be in the range of 1-2%.

zrm · on Jan 3, 2018

> Motherboards, CPUs, RAM chips, GPUs are analog and physical objects. For AWS to switch a DC from one mobo to another just to find out that this one draws 5% more power and their standard backup generator can't handle it, which now starts a chain reaction of upgrades is going to incur real world costs. Costs that can't be amortized by writing some code to make the motherboards look the same.

Being physical isn't different. The data center is designed to allow systems that consume up to, for example, 500W. When one consumes 400W and another consumes 420W, they're still fungible. A system that consumes 525W can't be used, but you know that so you don't use those.

> This is basically the A/B testing/one-armed bandit problem. How much do you spend time exploring alternatives vs how much time to you reap the benefits of the fact that all your hardware is exactly the same and best of breed, as based on your testing?

That isn't the relevant problem. Even if you choose monoculture, you still have to pay the cost of weighing your alternatives to decide which single model to use.

The cost of diversity is that the second best model on some metric is 20% worse than the best. But that is also the advantage, because on some other metric it's 20% better. You can use each model for its strength. And since you can't perfectly predict the future, when something unexpected happens you're better able to handle it, because for any given thing that only some systems can do, you will have some systems that can do it.

> These situations simply don't lose enough money to make up for the gains of a monoculture.

Beware survivorship bias. It's easier to find an active monoculture company that has never had a major problem than one that has, because having a major problem in a monoculture often results in bankruptcy.

> Do you/they use five different cloud providers in equal measure to make sure you have diversity if one has an issue?

For services with high availability requirements, people absolutely do that.

> Do you/they use five different software stacks in case there is a remote exploit for RoR and PHP holds things up?

That wouldn't reduce attack surface. The relevant thing people do is to use two factor authentication.

> Do you/they buy groceries at five different grocery stores in case one of them has an e. coli outbreak that the other four don't?

Having multiple local grocery stores is a thing people want. And people do actually use them, because different stores have the best price or quality for different products.

> chances are you've at some point built a RAID array. Did you put different size and performance drives from different manufacturers into it or did you buy N of the same drive type to ensure even performance? If so, why?

These are spec differences, not supplier differences. There is no issue with using drives of the same size and speed from different manufacturers.

Also compare ZFS, which allows you to efficiently use unmatched drives for the same filesystem.

uhhhhhhh · on Jan 2, 2018

When you're building out your DC, its a function of cost relative to performance/power use. I could see new setups may look at AMD over Intel, especially if they're running workloads impacted by the software fix, at least for the current generation of CPU's, maybe even the next 1-2 that are in the pipeline.

When you're scaling up/maintaining your DC, you're much more likely to be looking for single sku, like for like products that allow similar tooling, knowledge base, experience etc... Like you said, monoculture has its benefits in some situations.

Personally even with this bug I'd be very hesitant to switch. Our previous tests between them for very specific workloads showed our best cost/performance was with Intel over successive generations, and the scaling/tip over points were different. We have a combination of experience and knowledge around the existing arch and how our applications and workloads interact with it that involved a number of pain points that I'm not sure its worth it to re-experience with another arch.

On the other hand, those running on non-bare metal, cloud based, auto-scaling/automated solutions that have a wider tolerance for individual app performance, are probably in a situation where they care less about this, but at the same time have little to no say in the arch they run on, that decision is left to the cloud providers they use.

just my 2 cents anyways.

shaklee3 · on Jan 3, 2018

Google has stated that they would move to the power9 if the performance claimes live up to the hype. That has yet to be seen.

kev009 · on Jan 3, 2018

POWER has been better performing than intel since basically 1990, though intel's tick/tock cadence and trading blows in fab tech have kept things interesting. That shouldn't be surprising since POWER is ultra focused on the high end and intel is fending off attacks from the low end and never had good long term thinking on the high end.

The reason every server isn't POWER is: ecosystem. For any random company, switching archs for anything less than a multiple factor gain is a daunting multi-generation proposition. For a hyperscaler like Google the bar is a lot lower but you need a compliant vendor that will do a lot of the long haul platform work. IBM's been trying to establish that for many years and is just about to pull it off. Supply chain is also important, hyperscalers have come to expect buying and building systems a certain way and IBM will now just sell chips or even the IP for you to fab yourself. And of course the total cost calculus: capex, and opex in the form of TDP, support burden.

Google _will_ be using P9 for GPU servers internally. The inflection point for them was I/O and memory bandwidth. So, paradigm shift was what was needed to turn a juggernaut.. and that is what adding a bunch of accelerators to your platform is. Intel has no good solution there.

shaklee3 · on Jan 3, 2018

Right. I think HPC will be the first to take on POWER9 since it has some huge advantages with CAPI and PCIe v4. Outside of that, it will take some run time to convince the larger cloud providers it's useful.

I believe POWER9 has the ability to be either big or little endian as well, so that helps for compatibility issues, and it's just a matter of whether your application can compile.

freeone3000 · on Jan 2, 2018

Why would this cause you to diversify? Long-term negative effects of a monoculture are not evenly distributed to purchasers. In fact, if you ran both AMD and Intel CPUs, you'd see application performance differences solely based on processor architecture. This makes application deployment planning way harder. At any given time, there's one CPU that should be purchased, and artificially introducing two "so they don't fail the same way" is bad, specifically because they won't fail the same way.

myrandomcomment · on Jan 3, 2018

It depends. There was a reason back in the day that if you were a telco you have phone switches from from 2 providers,ie a DMSxxx and ESSxxx. Another example would be how the big providers got screwed by the in ability of Cisco to get their GSR working right without a few forklift upgrades (really they were moved with a forklift). This opened the path for Juniper. For a long time the telcos moved to have one router from each so a nasty bug in one would not take them down. In a properly tooled setup you should be able to account for the load characteristics between AMD and Intel. Having 2 is safer then one.

Google is pushing both PowerCPU development as well as ARM. They seem to be able to sort for this just fine. You can write tools to sort the differences. You cannot write tools to fix major HW issue.

Anyway my 2 cents based on experience and history for whatever the comments of a random person on the intertubes is worth.

phamilton · on Jan 3, 2018

Imagine you are AWS. You have a range of instance types with various performance characteristics. Having customers move from c5s to c4s is much better than customers moving from AWS to GCE.

JackFaker · on Jan 2, 2018

This is probably completely unrelated but apparently Intel's CEO sold a large amount of stock late last year. https://www.fool.com/investing/2017/12/19/intels-ceo-just-so...

amckinlay · on Jan 3, 2018

AMD is doing a lot of things right recently and they have a bright future. And after seeing this, apparently they have been doing things right longer than I thought.

SteveNuts · on Jan 2, 2018

What about this combined with the Intel ME stuff? I could see large cloud providers starting to at least think about switching to AMD

IgorPartola · on Jan 2, 2018

Large cloud providers don't make decisions emotionally. They'll take a "let's mitigate the ME stuff and buy best support + performance per dollar hardware possible" approach. They don't care much about the opinion of the outraged hackers.

fulafel · on Jan 2, 2018

Security track record is taken seriously by many big vendors.

rdtsc · on Jan 2, 2018

Usually mitigated by a special incentives like "15% extra discount for next 2 years if you stay with us". Intel has enough cash and market presence to be able to do those deal.

At the same time AMD also has a golden opportunity to for some PR and marketing.

IgorPartola · on Jan 2, 2018

Is it? That seems like a big broad claim. Again, after things like RowHammer, did anyone actually do anything differently in a way that affected stock prices?

fulafel · on Jan 3, 2018

Like everyone else, I sure would want to know what kinds of conversations have been going on around RowHammer and customers most affected by it, and system/DRAM vendors.

I'm not a believer in stock price as good indicator of anything, sorry to skip that part.

ComputerGuru · on Jan 2, 2018

AMD has their own ME equivalent in their recent CPUs.

IgorPartola · on Jan 2, 2018

ME is actually a gasp useful feature. The problem is with Intel's implementation of it: it's not open source, it can't be disabled, and it's buggy. Fix all three of those, and Intel's stock will go up.

zanny · on Jan 2, 2018

Do you really think enough people care about the ME / control of hardware in general / hardware that spies on you or is out of your control to influence the stock price of a company the size of Intel?

IgorPartola · on Jan 2, 2018

No. That's exactly what I'm saying. Most people don't care. Enterprise users do care because ME is useful for them. It's a feature, not a nefarious backdoor that the NSA made Intel include under the cover of darkness. They'll see this as a small problem that should be fixed and will ask Intel to do so. Intel will fix it, most everyone will move on. I don't think ME will take down Intel stock, and neither will this page isolation bug.

Intel's value is 99% engineering + manufacturing ability + customer relations. It would be a poor CEO indeed who'd direct their IT to start buying AMD because of this alone.

technion · on Jan 3, 2018

That's the narrative, but consulted to a lot of enterprises, and I've never once seen ME in use. Servers have hardware like HPE iLO, and desktops will use OS based agents. And failing that they'll use PXE boot and get rebuilt. The only discussion I've ever seen an Enterprise have about ME was the debate about how you deal with HPE's latest laptop security update.

myrandomcomment · on Jan 3, 2018

And the sad part is the current LOM stuff is not better (even somewhat worse) then the stuff on Sun gear from the late 90s. Oh well.

_m7bj · on Jan 2, 2018

>It's a feature, not a nefarious backdoor that the NSA made Intel include under the cover of darkness.

Let's be clear: It's both.

IgorPartola · on Jan 2, 2018

Is it? Again, that a big claim to make in five words and drop the mic. Can you cite anything to back it up?

_m7bj · on Jan 2, 2018

If Intel weren't under pressure to keep a negative-ring network enabled snoopstack open by an external entity, they would by now definitely have released an update that allowed people to disable the networking aspect of IME.

Major system vendors are now offering to apply bootleg removal situations at the factory on customer request[1]. That request is not free. People are willing to /pay extra/ for no-IME laptops.

Either Intels marketing and public relations department are asleep at the wheel, or they've gone to the top to request a friendly switch to disable this and been told by the legal department that they can't have one.

[1]https://liliputing.com/2017/12/dell-also-sells-laptops-intel...

IgorPartola · on Jan 2, 2018

OK, but that's (a) 100% speculation and (b) fails Hanlon's razor.

I don't like the fact that you can't disable ME, that it's not open source, and that it's vulnerable any more than anyone else. But this does seem like hyperbole much more than fact.

_m7bj · on Jan 2, 2018

>OK, but that's (a) 100% speculation

95% speculation. The last 5% comes from exercising basic pattern recognition.

I remind you that we're probably talking about interference from the organization that arranged this:

https://arstechnica.com/tech-policy/2014/05/photos-of-an-nsa...

The existence of that program was pure speculation, until it turned out to be totally real.

>(b) fails Hanlon's razor.

This is completely irrelevant to any argument made between two informed participants. It's worse than speculation, it's a plea to glib colloquialisms. Any chance you've got evidence or even reasoned speculation supporting the theory that the worlds most successful CPU manufacturer has an incompetent marketing department?

IgorPartola · on Jan 3, 2018

I think they have an incompetent management department that decided that no open sourcing ME is a good idea. Marketing is may also be incompetent at picking up the pieces after the bugs were discovered.

> 95% speculation. The last 5% comes from exercising basic pattern recognition.

No, it's all speculation because pattern recognition is not evidence, as applied here. Like, is it possible that I am an NSA agent trying to persuade you that you are safe and shouldn't worry about ME? Of course it's possible. But do you have any evidence of that? No.

"Well, in the past the NSA has asked big companies for backdoors into their products" is a true statement with evidence. "That implies that in this case there is a 5% chance that is exactly what's happening" is 100% speculation because again there is no evidence. If you can find any, I am all ears because honestly I am not a fan of Intel, Intel ME, the NSA, government spying, big corporations taking advantage of consumers, or a number of other things I imagine you and I agree on. But I think I am being rational when I say that chances are this is a stupid bug or number of bugs, plus bad old school thinking on the part of the management team, and not a deliberate NSA feature.

Here is my bit of speculation: if the NSA asked Intel to include a backdoor, wouldn't they both have done a better job of creating it? Why introduce a bug when you can include whatever code you want in a closed source firmware? You can literally add any kind of C&C mechanism you want because nobody can see what you are doing and nobody would ever know. Is the NSA that stupid to to ask for a bug that can be found and exploited? Is Intel not able to offer a better technical solution? Wouldn't it be to both of their benefits to do this right from the start? Also, why only approach Intel and not AMD? AMD is not as popular but surely has enough market share to warrant spying on.

pdimitar · on Jan 3, 2018

You say "do you have proof?". But nobody can have proof beforehand. That's how these things always go -- something is done under cover and later (usually much later) somebody uncovers it and shows it to the world. Why do you ask of a proof that can't possibly be in the spotlight right now? Many historical facts have been denied and met with skepticism and mockery until they have been proven to be indeed facts. Why is this case different in your eyes?

Why aren't you viewing the possibility of intelligence agencies ordering the Intel ME as one of these future historical facts? If the proof for that became known today, both the agency and Intel would scramble to introduce a better backdoor in the next generation CPUs / MBs and devise a marketing campaign to make it sound good -- and to bash their former selves for "making a mistake" while simply thinking "OK, we're gonna cover it up much better this time and we're gonna twist it in such a way that people would flock to buy it". It's what marketing and spies do; they twist facts. Why is that so non-legit for you?

Furthermore, you're asking why didn't they do a better job if it was a conspiracy. People in closed circles aren't exposed to public criticism and their thinking is affected in the process. They usually think "meh, good enough, nobody will ever find it anyway". They are humans like you and I and are susceptible to bad days or negligence due to being tired. Furthermore, it's very likely they were under pressure to make it work quickly so they took shortcuts. What makes you think the programmers of the intelligence agencies have godlike powers over their (very likely) military superiors? Answer is, they don't. Programmers have no executive powers and their counsel is usually met with skepticism if it doesn't fit the management's agenda.

When talking about intelligence, our best bet is to do educated guesses. If we had hard facts we would be targets. As mentioned in another reply of mine directed at you -- it's their job to hide the facts. So you requesting proof of these matters is basically refuting all possibility of intelligence agency commission of the Intel ME on the grounds of "hey, you are not the next Edward Snowden so your arguments are invalid".

Meh. You come across as a guy who basically says "my speculation is better than yours". Not constructive.

IgorPartola · on Jan 3, 2018

Ok you lost me at “future historical fact”. Again that is a fancy way of saying pure speculation. No I don’t know for a fact that the NSA didn’t order Intel to build a buggy ME into all its processors. I can’t prove that it didn’t happen. And maybe your speculation will turn out to be right. I am arguing that my speculation that this was incompetence is significantly more likely to be correct than your speculation of conspiracy.

Your theory in the above comment is that the NSA or equivalent ordered Intel to build a C&C mechanism into their processors. Intel then did a perfect job covering up this request, but did a piss poor job of implementing it due to incompetence and has not managed to correct it for 10 years. There is no indication that this might be the case but because of other unsavory activities by the NSA or equivalent it can be assumed that at some point evidence will be uncovered that you are right and therefore we should accept it as fact. Do I have that right?

pdimitar · on Jan 3, 2018

Not exactly but almost. I am saying this is the most likely outcome.

Judging by other activities of the intelligence agencies and working with pure speculation -- not hiding from these words, you are correct by calling it that -- I still think it's much more likely they commissioned the Intel ME.

You mention critical thinking in another comment. Critical thinking, the way I apply it, also requires a historical context to be applied to the situation one is analyzing. Agencies have been doing pretty shady stuff and some of it has been uncovered for the entire world to see.

Critical thinking, the way I apply it, says that the odds are there is a foul play. I merely wish you to recognize that this is the more likely scenario than a bunch of coincidences and/or people supposedly making the ME to serve data center sysadmins -- btw many of those sysadmins, including on several threads here in HN, said they never used the ME and named a plethora of other tools.

Obviously I am not trying to change the way you think in general. I believe we can both agree that none of us knows for sure. The human brain's strength is to work with many variables and be able to impose some order in the chaos by pattern recognition and using historical info. I am not gonna deny this can lead to people drawing awfully misguided conclusions sometimes -- and I've been guilty of that as well! -- but it's the best we have, especially having in mind what tiny imperfect brains we have to work with.

Everything I can name are circumstantial evidence. I accept that. It's the nature of the area. Intelligence data isn't easy to come by.

IgorPartola · on Jan 4, 2018

OK. And with that you are saying that you are basing this on 95% speculation and 5% pattern recognition with no direct evidence, and yet it's the most likely outcome.

And I am saying that the confidence interval on that calculation is just orders of magnitude not tight enough. I am not denying that you could be right. It's just that I am giving that possibility something like a 1% chance of being true, while something like 85% chance of this being pure incompetence by Intel management and engineers (the rest being some other explanation that's neither malice nor direct incompetence). I don't think you and I can find a common ground on this estimation.

Again though, ME is a bad thing because it's not open source, it can't be turned of, and it's buggy. Regardless of who ordered its creation, it sucks.

_m7bj · on Jan 4, 2018

>And I am saying that the confidence interval on that calculation is just orders of magnitude not tight enough.

You're also saying, implicitly, that therefore we must default to assuming it is incompetence.

That link isn't a given. Stating that it is incompetence is also speculation, not some kind of universal backup truth.

However, when it comes to that last 5%, I assert that the historical data does not back a claim that Intel's marketing department is incompetent.

johnny22 · on Jan 3, 2018

you are the mvp of this post. thanks for keeping things rational.

kllrnohj · on Jan 3, 2018

Adding that option costs them money (engineering time, QA time, support issues resulting from it, etc...)

Until it's financially worth their while, why would they spend money on it?

pdimitar · on Jan 3, 2018

How can you request a citation about things relating to possible intelligence agencies efforts with a straight face? It's literally their job to make sure such material doesn't exist or sees the light of day if it does. It's not exactly publicly-funded science now, is it?

You request a proof that's impossible to procure. Are you now gonna claim the lack of this proof supports your thesis?

IgorPartola · on Jan 3, 2018

Yes? Because acting on pure unvarnished unburdened by critical thinking speculation is not a good idea?

Karunamon · on Jan 3, 2018

Critical thinking would demand recognition of the fact that intelligence agencies compromising security isn't a hypothetical anymore, it's a fact, and it would further demand intense skepticism of unauditable and hostile (resists attempts to disable it) code running below ring 0.

IgorPartola · on Jan 4, 2018

I never said they don't. Simply that in this case there is no evidence, direct or circumstantial, pointing to Intel ME being born out of an order by an intelligence agency. Could it be? Sure. But critical thinking demands facts, not speculation. Facts are:

1. Intelligence agencies have been known to force companies to give them access to their products.

2. Companies have been known to comply, if reluctantly, at least until a whistleblower exposes the program.

3. Intel ME was developed as an on-chip version of an external card that is actually useful.

4. Intel has made poorly engineered products before.

5. Intel isn't in a habit of open sourcing firmware.

6. From a technical standpoint, Intel is fully capable of creating a system that doesn't allow C&C through a bug and an exploit.

7. AMD, the second largest computer chip maker does not have a matching system that can't be disabled and that has similar bugs.

Based on this, I'd say it's possible that the NSA (or equivalent) asked Intel to develop ME and add a bug to allow C&C, but very unlikely.

It's also possible that the NSA (or equivalent) asked Intel to develop ME and add C&C and Intel did it through a deliberate bug, but very unlikely.

It's also possible that Intel tried to develop a feature the market might want, and screwed up the implementation. This seems to me to be very likely. It's the simplest explanation (Occam's razor) and it requires only incompetence, not malice (Hanlan's razor), so it's sort of by default most likely.

If someone can produce an iota of evidence to the contrary I will change my allocation of probabilities appropriately, but so far the evidence is "it could have been done" and "they've been known to spy on people in the past". In my book that's not a strong enough argument.

boomboomsubban · on Jan 3, 2018

It creates a huge attack vector on most computers that the user has almost no control over. Even if Intel are completely uninvolved, some intelligence agency will try to exploit it.

JdeBP · on Jan 3, 2018

The claim at hand, however, was that the NSA made Intel include it.

boomboomsubban · on Jan 3, 2018

No, the claim being made is that ME is being added as a feature, with a hyperbolic version of the other argument tacked on. Whether they were forced to include it doesn't matter, the way they included it benefits the intelligence agencies.

uhhhhhhh · on Jan 2, 2018

If my recollection serves, when Intel had what was the largest/most expensive recall in the world in the 90's (at that time anyways), their stock still nearly doubled that year.

kuschku · on Jan 2, 2018

Yes, but on recent motherboards with recent AMD CPUs, a bios option to turn the PSP off has appeared, and support says that’s exactly what it does.

rdtsc · on Jan 2, 2018

I think there is a critical opportunity for AMD here to take this to the public and the media. Basically kick Intel while it is down. Intel will probably recover fine, but AMD shouldn't miss its chance either. Investors and such might pay attention to that and start selling INTC and buying AMD.

0x00000000 · on Jan 3, 2018

If the hit is as bad as they say (30% performance), cloud providers will be almost forced to upgrade when the new hardware comes out that fixes it. Are they really ready to adopt AMD? Go long on INTC?

asgioiobuio · on Jan 3, 2018

They could get AMD hardware that works today. We don't know when Intel will have working hardware. It will be at least months and possibly years. Processor design is a long process.

matwood · on Jan 3, 2018

I'm not sure AMD could even handle the volume...

dx034 · on Jan 4, 2018

I'm sure they'll be happy to produce 24/7 or raise some prices to take care of that.

rdtsc · on Jan 2, 2018

> I would say this could be an indicator to short INTC and long AMD...

I would say that too if I'd be waiting for everyone to sell so then I could buy INTC :-)

cloakandswagger · on Jan 3, 2018

Options are much better for playing these short-term, news related swings. This has the potential of being a good one, as INTC is at the peak of a bull run and this news doesn't seem to have hit mainstream sources yet.

cjbprime · on Jan 2, 2018

Are you sure that all Intel CPUs are affected? Might just be older ones.

dboreham · on Jan 2, 2018

The kernel code changes target all Intel.

simcop2387 · on Jan 2, 2018

Yep, and until the embargo about this is over, we won't know anything with any certainty. This has been one hell of a fun thing to watch from the outside. I run a small test your code service (for all versions of Perl) that could be affected by this so I'm really curious what the whole thing is.

dmitrygr · on Jan 3, 2018

All recent (< 6 years old) ones

dx034 · on Jan 4, 2018

I thought all since 1995?

artellectual · on Jan 3, 2018

Essentially looks like Intel compromised (whether intentional or not is a different point) the design to get the speed boost that gave them the lead over AMD for the past decade. Will be interesting to see how all this plays out.

jchw · on Jan 3, 2018

Other than leaking timing information though, is there any reason why this kind of speculative execution can't be secure? Apparently we're going to find out more in the coming weeks, but it feels strongly like Intel has made a number of mistakes leading up to this.

rootlocus · on Jan 3, 2018

> Essentially looks like Intel compromised (whether intentional or not is a different point)

If it wasn't intentional, then it wasn't a compromise. So it's not a different point.

CRConrad · on Jan 3, 2018

"To compromise" means "to weaken" or "to endanger", not "to make _a_ compromise". To make a compromise is an intentional act, but you can compromise (e.g. the security of) something by sloppiness. So yes, it is a different point.

(Yeah, I know, don't blame me. English _is_ weird.)

bhouston · on Jan 3, 2018

What chip exactly introduced this feature?

Core 2 architecture? Nehalem?

artellectual · on Jan 3, 2018

> It is understood the bug is present in modern Intel processors produced in the past decade.

Source: https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl...

dx034 · on Jan 4, 2018

Arstechnica says it could be any processor since 1995.

https://arstechnica.com/gadgets/2018/01/whats-behind-the-int...

mindcrash · on Jan 2, 2018

So first they bring a DLC concept ("unlock features by spending money") to their enthousiast platform, and now this?

Having a hunch Threadripper will sell extremely well amongst PC enthousiasts this year...

rrdharan · on Jan 2, 2018

I'm curious what you are referring to re: the DLC concept? Did you mean this thing?

https://en.wikipedia.org/wiki/Intel_Upgrade_Service

Seems like that was discontinued a long time ago (2011) so was wondering if there was something more recent that happened?

brians · on Jan 2, 2018

IIRC, this fall’s i9 chips and the motherboards supporting them have software-unlockable features. You can literally buy more PCI lanes.

Which is another way of saying you had those lanes, and Intel wanted more money before letting you use what you’d already bought.

unethical_ban · on Jan 2, 2018

AMD's triple core processors were quads with disabled cores. Often times processors within a line are processors with manually set lower clock multipliers or disabled cache.

Sounds like Intel has just made it unlockable instead of permanent. It just brings to the fore what was already being done, and makes us question again the ethics of pricing models.

yarg · on Jan 3, 2018

This is generally the result of segregating defective parts of the chip in order to create a stable (albeit less powerful) chip.

Some of the chips will be fully capable of running with all parts enabled, but in a higher power envelope (this is a guess - but I believe that the fully capable chips most likely to be sacrificed are those that have trouble fitting in the ideal power envelope).

I would also imagine that (under some circumstances) chips that are fully functional within the expected power envelope will be artificially limited in order to control levels of stock.

The vast majority of chips that are limited in this way will be out of spec, unstable or inoperable when unlocked.

jhasse · on Jan 2, 2018

You can't unlock all triple cores to quad cores though. It's called binning and all chip manufacturs do it.

The chipset unlock thing is different as there's no technical reason to lock it in the first place.

masklinn · on Jan 3, 2018

> AMD's triple core processors were quads with disabled cores.

That's binning & price discrimination, Intel did the same (with quads v dual IIRC): if you have a defective core, you gate it and sell a 2/3 core instead of a quad. Of course the issue is when the low bin becomes too popular and you have to start low-binning "perfect" parts to keep supplies acceptable (used to be very common for Intel starting ~mid-cycles, they'd literally run out of defects, which is why their low-end CPUs had such good performances & were ridiculously overclockable)

manmal · on Jan 2, 2018

Nvidia‘s GeForce cards could be converted to Quadro cards by opening a chip and adding some lines with a pencil. Don’t think that this still works, but a colleague of my father did it for his home PC.

eropple · on Jan 3, 2018

One line, AFAIK. It was just a trace.

However, that (and the later software modification) could both hamper performance in games and could exhibit correctness problems in accuracy-focused use cases, so it was rarely a great idea.

exDM69 · on Jan 3, 2018

No, that was just spoofing the PCI VID:PID to the kernel. It did not enable hardware features, just fooled the driver into thinking it was another device. You could do the same with a patched kernel if you don't want to solder.

floatboth · on Jan 2, 2018

Probably the paid hardware RAID unlock key.

amluto · on Jan 2, 2018

Has Intel ever had hardware RAID? They have firmware RAID, but that's quite different.

keltor · on Jan 2, 2018

They actually DID have hardware RAID controllers, but not like you're talking about.

blattimwind · on Jan 2, 2018

Though that's not news to anyone who ever laid their hand on an Intel server board.

api · on Jan 2, 2018

At the meta level this is just a special case of "complexity is evil" in security. CPUs have been getting more and more complex, and the relationship between complexity and bugs (of all types) is exponential. Each new CPU feature exponentially increases the likelihood of errata.

A major underlying cause is that we're doing things in hardware that ought to be done in software. We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution. This would allow the kernel and OS to do tons and tons of stuff the CPU currently does: process isolation, virtualization, much or perhaps even all address remapping, handling virtual memory, etc. CPUs could just present a flat 64-bit address space and run code in it.

These chips would be faster, simpler, cheaper, and more power efficient. It would also make CPU architectures easier to change. Going from x64 to ARM or RISC-V would be a matter of porting the kernel and core OS only.

Unfortunately nobody's ever really gone there. The major problem with Java and .NET is that they try to do way too much at once and solve too many problems in one layer. They're also too far abstracted from the hardware, imposing an "impedance mismatch" performance penalty. (Though this penalty is minimal for most apps.)

What we need is a binary format with a thin (not overly abstracted) pseudocode that closely models the processor. OSes could lazily compile these binaries and cache them, eliminating JIT program launch overhead except on first launch or code change. If the pseudocode contained rich vectorization instructions, etc., then there would not be much if any performance cost. In fact performance might be better since the lazy AOT compiler could apply CPU model specific optimizations and always use the latest CPU features for all programs.

Instead we've bloated the processor to keep supporting 1970s operating systems and program delivery paradigms.

It's such an obvious thing I'm really surprised nobody's done it. Maybe there's a perverse hardware platform lock-in incentive at work.

titzer · on Jan 2, 2018

A lot of these ideas were in the back of our heads in designing WebAssembly, but to keep expectations low, we don't make too much noise about them. However I personally believe that we are on the right track with WASM and am very excited about the future!

kps · on Jan 2, 2018

> It's such an obvious thing I'm really surprised nobody's done it.

IBM AS/400 for about 30 years now.

cr0sh · on Jan 2, 2018

It also made me think of PICK (and PICK cpu hardware implementations); though I never learned enough about the internals of PICK when I last used it 20+ years ago (so I could be wildly off-base).

gecko · on Jan 2, 2018

Tao/Intent/Elate (which I think is defunct nowadays) would also qualify, and I'd argue .NET on Windows with the GAC would, too (although there'll be a legitimate argument about whether that's "simple and closely models the processor").

pm215 · on Jan 2, 2018

Tao is long defunct, yes (went under a decade ago). It turns out that people don't really want a runtime-portable OS/apps (IIRC the biggest takeup it got was as a Java runtime for mobile, because the competition at that time was all interpreted). There was no security model in VP, though -- single flat address space and bytecode could turn any integer into a pointer and dereference it (loads just got translated into host cpu load instructions), so there was no isolation between processes or between processes and the os.

puzzle · on Jan 3, 2018

AS/400 and descendants have a security model, but they rely at least partially on a trusted runtime code generator (and, transitively, trusted boot). The systems have HW assist to tag real pointers, but that's mainly for performance reasons. Pointer validity checks are performed in software (or they were until ten years ago), automatically inserted by the bytecode translator. If you subverted the code generator, your malicious code could get a bit further by forging pointers.

sedachv · on Jan 2, 2018

> We really need to stop shipping software as native blobs and start shipping it as pseudocode, allowing the OS to manage native execution.

What we really need to do is to start shipping all software as source code. This is exactly what JavaScript does, and why it is the most successful method of software distribution ever. WebAssembly is a huge step backward.