Hacker News new | past | comments | ask | show | jobs | submit login

This feels like a big FU to Intel. I've heard this patch can slow down programs like du by 50%. Does that mean AMD is going to find itself running twice as fast as competitors?



I think the du case was an outlier. Normal workloads shouldn't be so heavily affected. I am expecting a few percent loss on most programs though. It's basically a larger penalty for making a syscall, which was already a fairly slow operation so performance minded people avoid them in tight loops. It will be bad for people who need to do lots of fast I/O I suspect.


Postgres and Redis are looking at a 20-25% performance hit.

https://www.phoronix.com/scan.php?page=article&item=linux-41...


Some portable / embedded databases come to mind. Also "normal" databases doing replication initiation, and replication re-sync. And lastly backup, restore, tar, etc with small files. For files under a handful of pages long mmap() isn't a big gain.

Another syscall I think that might cause issues is gettimeofday(), that particular call has been optimised to the nth degree, and lots of user programs spam the crap out of it (mostly necessarily), especially networking and streaming programs. It would be interesting to see how much of an overhead percentagewise page table isolation will cost, and its effects on low end media devices, et al.


Linux gettimeofday these days is implemented in the 'vdso', which is code provided by the kernel that runs solely in userspace. So it's not a syscall in the 'privilege level switch by executing insn that takes an exception' sense and shouldn't be affected by the syscall-entry/exit path becoming more expensive.


Does this also apply to vsyscall's being emulated? Will this mean that older static binaries will no longer run? or just suffer a penalty as well?


Linux kernel has a compatibility guarantee to the user space visible API, so static binaries will continue to run. If the static binary is so old it does not know about vdso and uses regular syscall to query time, it will be slowed down.


Emulated vsyscalls were already very slow and will be even slower on a patched kernel. They'll still work, though.


This topic got me to look at the vDSO for my own machine so I wrote a short Perl program to dump your kernel's vDSO.

https://pastebin.com/UnQX5U1f


I added a quick repo to github if you want to clone or send pull requests for architectures apart from 64bit LE.

https://github.com/mattkeenan/dump-vdso


vDSO-based timing is unaffected.


Isn't every (edit: contended) mutex/etc. wait operation a syscall? That's gotta hurt for any program that waits for frequent events that don't take too long to process.


They are only system calls when you need to wait or wake up processes.

https://en.m.wikipedia.org/wiki/Futex


I was assuming contention but I guess I wasn't clear, sorry. I updated the post. But saying this "only" occurs when there is contention is very misleading since it makes it seem like the scenario of lock contention is a negligible concern. It's not.


Thread-suspending contended mutexes are already extremely slow. If you have a heavily-contended mutex you already have a major performance bug. If this is the kick in the pants you need to go fix it that's arguably a good thing ;)

Note that mutex contention does not itself mean immediately falling back to futex - commonly you'll spinloop first and hope that resolves your contention (fast), then fall back to futex (slow)


> If you have a heavily-contended mutex you already have a major performance bug.

I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug. It certainly isn't consistent with my experience, but if every problem you've solved could have been parallelized infinitely without increasing lock contention, more power to you.


> I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug.

Good, because that's not what I said. If you're heavily hitting futex convention you do have a performance bug, though. You might be confused with general contention that's being resolved with a spinlock rather than futex wait, though.


>> I can't really devote time to countering the unfounded assertion that every contended mutex must be a bug.

> Good, because that's not what I said.

It is literally what you said:

>>> If you have a heavily-contended mutex you already have a major performance bug. If this is the kick in the pants you need to go fix it that's arguably a good thing ;)

> You might be confused with general contention that's being resolved with a spinlock rather than futex wait, though.

I'm not confusing them at all; I'm literally reading exactly what you wrote. You literally said contended mutexes are necessarily bugs (right here^) and that you considered mutexes to include the initial spinlocks ("note that mutex contention does not itself mean immediately falling back to futex - commonly you'll spinloop first"). But maybe you meant to say something else?


He said "heavily contended", and then you dropped the "heavily" prefix and claimed that was literally what he said. That adverb is material to the discussion and your dropping it completely changes the meaning.

I concur with his opinion. Infrequent contention is not a bug; otherwise no mutex is needed. Frequent contention (or heavy contention in his words) is a performance bug.


> He said "heavily contended", and then you dropped the "heavily" prefix and claimed that was literally what he said. That adverb is material to the discussion and your dropping it completely changes the meaning.

"Heavily" was not dropped intentionally at all. Add it back to my comments. It changes nothing whatsoever. The incredible opinion that every problem can be necessarily parallelized without eventually resulting in contention (and I license you to freely modify this term with 'light', 'heavy', 'medium-rare', 'salted', 'peppered', or 'grilled at 450F' to your taste) is so fantastically absurd that I cannot believe you are debating it. I definitely don't know how you can justify such an unfounded claim with no evidence and I certainly have no interest in wasting time debating it. As I said earlier: if you never encounter problems that exhibit eventual scalability limits, more power to you.


I literally said spin loop resolving is fast. Maybe read more than the single phrase you're pulling out of context to go on a rant about?


Let's put it this way. If every contended mutex were a bug, why not remove the mutex and let the code run as-is? No, you wouldn't, so no, not a bug.


> Let's put it this way. If every contended mutex were a bug, why not remove the mutex and let the code run as-is? No, you wouldn't, so no, not a bug.

I mean, the parent's argument is wrong, but isn't that naive. Presumably the argument is a bad (yet still correct) solution would result in lock contention while a better solution would e.g. use a different algorithm that is more parallelizable.


Ideally a mutex is just a cmpxchg. It gets more expensive when it is contended. See the Drepper paper on futexes:

http://www.akkadia.org/drepper/futex.pdf


Thanks, yeah, someone already mentioned this and I already edited in "contended" to clarify. I was actually already aware of futexes (thanks for the link though, I've never actually read the paper), but I was assuming contention -- the "every" referred to every type of operation, not every instance. See my reply to the sibling comment regarding lock contention.


Sounds like servers handling lots of small UDP packets would be hit pretty hard.


Applications like this where the syscall overhead (and latency) starts to be a significant factor in processing time and latency have moved to userland drivers anyway:

DPDK for 10-100 Gbps networking: https://dpdk.org/

SPDK for NVMe storage: http://www.spdk.io/

The queuing and balancing stuff the kernel does makes sense for spinning rust harddisks and residential networking, but when the underlying hardware is so fast that nothing is ever queued, really what are you doing. At 100 Gbps line speed, a 1518 byte packet takes all of ~ 120ns to transmit, or about 360 clock cycles for a 3 GHz processor.


Taking control over network cards in user space seems doable nowdays. There was a talk about doing such drivers with IOMMU/DMA at CCC: https://media.ccc.de/v/34c3-9159-demystifying_network_cards


User-space drivers have been doable for a while, and dpdk[1] is definitely worth a check. There's also some manufacturers[2] that only does user-space drivers for their high-performance cards (e.g. 4x10Gb/s, 2x40Gb/s, 2x100Gb/s cards). Being designed with this in mind helps performance a lot.

1: https://dpdk.org/ 2: http://www.napatech.com/


> Applications like this where the syscall overhead (and latency) starts to be a significant factor in processing time and latency have moved to userland drivers anyway:

I would personally think that is worse, though please correct me if I'm wrong. The userland driver will run with an isolated PT like any other userland process won't it? If so, it will suffer the same slowdown that every other process now has every-time it has to communicate with the kernel, which I would think would be a lot for a driver.


It's counter intuitive at first, but the key to understand how this works is that while you can use an MMU to assign chunks of physical memory to a process, you can of course also just use the MMU to assign the memory mapped IO registers of say a PCI express peripheral to a process.

That is in a nutshell what a "userland driver" is. It's not too far removed from poking the parallel port at 0x378 on your DOS computer :)


A user-space driver doesn't communicate with the kernel. It is assigned DMA buffers, and communicates with the NIC solely through reading and writing to shared memory buffers.

Even before this fix, the benefits were massive, as sending a buffer was just writing to some memory, rather than syscalls and copies galore.


Reduce number of syscalls by using sendmmsg()/recvmmsg() to batch together multiple packets per syscall?


Anyone run any DNS server benchmarks, esp BIND and PowerDNS?


This may end up doing some significant hurt to some build and testing pipelines.


It sounds like databases on very fast storage will be updated. Tons of syscalls made for disk io and network io.


Databases are already written to minimise syscalls, like any other heavily performance-tuned system.


Applications that already put an effort to minimize syscalls are the ones that most likely depend on syscalls performance.


Unlikely. It's relatively easy to get to the point where syscalls aren't the bottleneck by a long margin, so the only apps where syscalls are the bottleneck will be those that haven't put in any optimisation effort.


I think syscalls are not as slow as many people imagine they are, especially with modern CPUs and kernels (there are special instruction for syscalls that are faster than the old "interrupt" approach). See here: http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemil... ("Off-topic: A system call"). But they will be slow with this mitigation.


"The overhead was measured to be 0.28% according to KAISER's original authors,[2] but roughly 5% for most workloads by a Linux developer.[1]" [1] = https://lwn.net/Articles/738975/

Though the patches evolved since then. So I guess we'll see.


I believe the 0.28% are only for CPUs that support PCID. Earlier CPUs (which is a lot still) will get a much harder hit since you'll have to flush the entire TLB.


Even with PCID, the hit is 29% in a tight syscall loop, but is a complete disaster above 50% with PCID off...


In fact, the tight syscall loop isn't even necessarily the worst case: the primary cost of this change isn't the a direct cost in the syscall, but the CR3 switch which invalidates the TLB and incurs an ongoing cost for some time following the syscall.

The worse case would be something like a frequent syscall followed by code that touches a number of distinct cache lines, which all now require a TLB reload and page-walk (even here the cost is tricky to evaluate since there are various levels where the paging structures can be cached beyond the TLB, so the cost of a page-walk various a lot depending on locality of the paging structures used).


PCIDs avoid the TLB invalidation.


Yup, I think so.

In that case (PCID hardware on a PCID-enabled kernel), the performance effect should be more limited to the syscall itself. That said, why is the hit still so big with PCID? Surely just the CR3-swap by itself shouldn't be so slow?


MOV-to-CR3 is pretty slow, yes. In the ballpark of a hundred clock cycles, and you have to do 2 of them. The cost of a system call was about 1000 cycles, maybe less on newer processors---both OSes and processors optimize the hell out of SYSCALL/SYSRET.


Which Intel CPUs support PCID? Based on this Linus message, it was introduced in 2015, so Broadwell and later?

http://lkml.iu.edu/hypermail/linux/kernel/1504.3/02961.html


PCID's been around for a long time - even my old Westmere Xeons have it. INVPCID is more recent.


Yes. AMD didn't take shortcuts, and implemented the spec correctly. Intel took shortcuts, introduced bugs, and now to compensate for that the OS has to work around it in software, it's going to be slow. For years Intel has reaped the benefits of shortcuts for performance, while AMD has been implementing things correctly; now there is a correction.

That's how the market works.


AMD doesn't exactly do an amazing job of avoiding gotchas in their CPUs. They have a bizarre idea of what writing zero to a segment register should do (resulting in info leaks that were only recently fixed on Linux), their demented leaky IRET is even more demented than Intel's, and their SYSRET's handling of SS is downright nutty.

OTOH, Intel's SYSRET is actively dangerous and has resulted in severe security holes, and Intel doesn't appear to acknowledge that their design is a mistake or that it should be fixed.


Can you post a few links maybe to the SYSRET issue mentioned? Just curious.


SYSRET on Intel will fault with #GP if the kernel tries to go to a noncanonical user RIP. The #GP comes from kernel mode but with the user RSP. Before SMAP, this was an easy root if it happened. With SMAP, it's still pretty bad. AMD CPUs instead allow SYSRET to succeed and send #PF afterwards, which is very safe.

AMD CPUs are differently dumb. If SYSRET is issued while SS=0, then the SS register ends up in a bogus state in which it appears to contain the correct value but 32-bit stack access fails. Search the Linux kernel for "SYSRET_SS_ATTRS" for the workaround.



The 50% figure is from a benchmark that didn't run on an Intel CPU!


    - /* Assume for now that ALL x86 CPUs are insecure */
Before this patch the mitigation was active for all vendors.


The text as written only seeks to defend AMD's product. Whether the sub text goes further is open to non objective speculation. Having said that I'm sure AMD are feeling pretty happy with their statement. Schadenfreude may be too long a bow...


What are you talking about? Every other single CPU vendor with an MMD, arm, s390, Sparc, ... either has a separate page translation table register for kernel and user space, or like AMD memory page capabilities, just Intel not. It's very clear who is at fault here.


Not ARM. You're confusing kernel mode with kernel addresses.


Nope, read the paper, read the patches. Only Intel is affected. Arm has two such registers, TTB0 and TTB1. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....


I think you're still misunderstanding. The CPU picks TTBR0 or TTBR1 based on the top significant bit of the VA, irrespective of whether the access was initiated by user or kernel code. This is in contrast to s390, which has separate page tables for user mode and kernel mode. I personally much prefer s390's model.

And yes, I've read quite a few papers, and I wrote a good fraction of the patches.


Oh, thanks. My fault then. I haven't read the arm patches, only the summary.


I vaguely remember some threads from last decade where Linus trashed PowerPC and s390 TLBs. I wish I could find them and reread with this in mind.



My response was about whether AMD was part taking in schadenfreude, in reference to the OP's statement that this was "a big FU to Intel". I wasn't making a statement on who else was affected and/or who was at fault.


Hmm i was wondering about the performance hit, and i kind of miss performance details in the report as i consider it to be significant to report.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: