Hacker News new | past | comments | ask | show | jobs | submit login
Intel disables Hardware Lock Elision on all current CPUs (kernel.org)
324 points by my123 on Nov 14, 2019 | hide | past | favorite | 102 comments



The title "Intel disables hardware lock elision on all current CPUs" seems too broad. Intel is disabling the backward ISA compatible implicit HLE capability--I can't remember exactly how it worked without looking it up, but IIRC it was a hack that leveraged existing cache coherency and ISA semantics to permit optimized spin-lock implementations that without feature detection still worked correctly on older chips.

Explicit TSX instructions are now merely opt-out, where previously they couldn't be disabled by the kernel. The posted patch doesn't seem to disable it but simply adds some of the bits needed to do so.

EDIT: Okay, I guess it is disabling Hardware Lock Elision (HLE), proper. But "hardware lock elision", lower case, implies all TSX-based lock elision code will stop being optimized. AFAIU RSX (explicit TSX) is usually only used for lock elision, anyhow; it's more of a microcode and speculation hack than a generic transactional memory feature, so the dividends quickly become elusive for anything beyond lock elision.


> The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled by the new microcode but still enumerated as present by CPUID(EAX=7).EBX{bit4}, unless disabled by IA32_TSX_CTRL_MSR[1] - TSX_CTRL_CPUID_CLEAR.

Disabling HLE makes the code still works because of the backward ISA compatibility... but you still lose the performance benefits of HLE.

Also, disabling it but still keeping it shown in CPUID...


So if I'm understanding this correctly, xacquire/xrelease now unconditionally fall back to being interpreted as the (useless) repne/repe prefixes, even on chips that "support" HLE?


Yes...


Is anything know about how much performance will be lost?



Why would Array.Sort have locking in need of elision?

[I'm reminded of the old Java "Vector" or "StringBuffer" that put a lock around every operation "because thread safety is good" which had high cost for single threaded usage. But this is likely not what this is about.]


It’s a different bug that requires extra padding for branching instructions.


I was just quoting the source :)


Well, judging from the Phoronix benchmarks, performance is worse than just outright disabling TSX in its entirety: https://www.phoronix.com/scan.php?page=article&item=zombielo... (Though they show basically no performance benefit from TSX, so...)


I concur. I wondered whether their patch hit AMDs as well. Unlikely, but not entirely impossible, at least to me after hearing news about Linus "commenting" on their practices of creating problems for everyone.


As far as I know, HLE is not implemented on AMD processors.

However, part of the fix for JCC (https://www.intel.com/content/dam/support/us/en/documents/pr... ) to recover the performance loss will make other CPUs slower if separate executables aren't used for unaffected CPUs... which they won't.

> The MCU prevents jump instructions from being cached in the Decoded ICache when the jump instructions cross a 32-byte boundary or when they end on a 32-byte boundary. In this context, Jump Instructionsinclude all jumptypes: conditional jump (Jcc), macro-fusedop-Jcc(where opis one ofcmp, test, add, sub, and, inc, ordec), direct unconditional jump, indirect jump, direct/indirect call, and return

The workaround to lower perf loss of having jumps uncached when they are on a 32-byte boundary involves adding quite some nops... (or padding with meaningless prefixes)


Looks like the impact is pretty bad in certain cases: https://twitter.com/damageboy/status/1194751035136450560


Yeah, edited the title to uppercase Hardware Lock Elision now.


Good to avoid any misunderstanding, even if I don't think there should have been some: hardware lock elision is hardware lock elision, and TSX is not limited to hardware lock elision. So Intel disabled hardware lock elision, really, and the locks are not longer elided...

Now maybe some people also like to elide some locks thanks to explicit RTM, but that's up to them, and that's not even completely hardware in this case (you have to write your own algo to do that and provide an explicit soft fallback), and of course Intel is not going to analyse algorithms to determine if there is logically some lock elision in a transnational section or not, and to only break out of speculation on the blocks of code presenting some...


Shame. When TSX was announced in 2012, I started looking forward to a day, years down the line, when it would be ubiquitous enough that I could write code that depended on it. (At least on Macs, which don't run AMD processors.) The first processors supporting it shipped in 2013... but then in 2014 Intel abruptly disabled it in all then-shipping processors via microcode update, due to an erratum. Eventually they started shipping processors without the bug, and 4-5 years passed, a decent chunk of the time needed for ubiquity. But now the clock's reset to zero once again.

Though, apparently Intel always disabled TSX on i3 processors for product segmentation, so maybe universal support was never in the cards...

I wish more CPUs would support at least a limited version of this. Basically, I want the equivalent of atomic<shared_ptr<T>>, but lock-free. That requires reading a pointer value from memory and then incrementing a reference count stored at that pointer, as a single atomic operation. I'm pretty sure that's doable with TSX, although I haven't actually tried it.


FWIW, my understanding is that XBEGIN/XEND/XABORT are still available on the affected CPUs. Only XACQUIRE/XRELEASE are disabled. So the clock isn't really reset.

I, too, wish TSX was more ubiquitous. I'm working on a kernel, and was hoping to use TSX to greatly simplify the logic of safely reading user memory from kernelspace - catching various exception cases without going through the real exception handler.

Turns out, none of the machines I have at my disposal have TSX - not my Desktop PC nor my server machine. So, RIP that plan, I guess.

What you want is absolutely doable with TSX. The overhead might be significant though. I wouldn't be surprised if locking a mutex was faster.


You can’t simplify the logic because TSX transactions may abort for any reason and make no promises of forward progress. You must implement a fallback codepath.

If you meant having an optimized codepath then that is doable. But given your writing a kernel there may be microarchitectural hazards that trigger excessive aborts.


Forget microarchitectural issues. If you have user memory paged out (or not present for accounting reasons, etc), you can try to access it in a transaction until the cows come home and it will never work — to make progress, you need to update the page tables either via a page fault or manually.


Not an issue for me, my kernel has no swap or other such thing - by design. Of course I realize not everyone is in my shoes - such an approach would absolutely fail on Linux!


Yeah, at first I wanted to use TSX exclusively, but I came to the same conclusion you did while digging into it more deeply.

I still think having a fast path using TSX could be useful, but since I never had a CPU with it, I never had a chance to benchmark it.


In some cases it can have a huge speedup. But it's rather tricky to get right. So far for almost all of my use cases the transaction sizes have been too large, and it almost always aborted.

The RPCS3 PS 3 emulator saw a 40% perf boost so some amazing gains are possible.

https://www.phoronix.com/scan.php?page=news_item&px=RPCS3-In...


> I still think having a fast path using TSX could be useful

That’s what TSX is intended to be: a path that runs faster when possible.


Ditto - and it seems like every time TSX was "re-released", it ended up being disabled later on for some random microcode issue.

Transactional memory is hard!


> I want the equivalent of atomic<shared_ptr<T>>, but lock-free.

You can do that without transactions with a lock cmpxchg16b instruction, or with a more complicated algorithm like this:

https://github.com/facebook/folly/blob/3de8f357eecaff89e3847...


Is there a reason a standard-compliant implementation of atomic<shared_ptr<T>> couldn't use that approach?

Or is it just a matter of someone doing it?


Sadly TSX was just a bad idea from the get-go.


Is https://github.com/speed47/spectre-meltdown-checker the best tool? Impressive chunk of sh.


lscpu contains a pretty good summary of vulnerabilities and seems to be installed on all of the Linux systems I have. Not quite as extensive as the linked script, though.


I’m also impressed. For a 5000+ line shell script, it’s pretty well-structured & quite readable.


You have to go out of your way to make a shell script that isn't easy to read. They're generally more readable at scale than "proper" languages.


As soon as you deal with env variables, need to support zsh and various shells, it quickly devolves into a hot mess.


Ignoring POSIX-compliance does that, yes.


POSIX compliance in no way makes environmental variables in shell scripting more usable and readable, its usually the opposite.

I say this as somebody who has written shell scripts for almost 20 years.

Constructs like

    if [ "X${str1}" = "X${str2}" ]
to avoid things exploding because a zero length variable might exist is a stupid and ugly hack.

Not to mention the hell that is trying to use a user supplied filename safely in shellscript. (for the uninitiated, a unix filename can contain any character, and any sequence of bytes besides / and 0x00, including newlines, backspaces, shell meta-charachters and so many more exciting things)


> to avoid things exploding because a zero length variable might exist is a stupid and ugly hack.

That has less to do with POSIX compliance, as opposed to working around quirks with less-compliant historical implementations. Also, I think you'd have to dig pretty deep to find a test that doesn't support zero-length arguments; you'd be safe as long as you quote your variables.

The actual problem of ambiguity arises from variables that are also syntax, like '(', or '!', when using compound expressions (which you should avoid anyway).

  # this is bad (also obsolete)
  [ "$var" != foo -a "$var" != bar ]
  # this is good (and fully POSIX compliant)
  [ "$var" != foo ] && [ "$var" != bar ]
  # you probably don't need to care for systems where this was necessary
  [ "X$var" != Xfoo ] && [ "X$var" != Xbar ]


Sounds like your POSIX-fu is more than 20 years out of date. Even 1003.2-1992 refers to systems that break with the correct form as historical.

https://i.imgur.com/p20s8Pi.png


What do you have against that script? It's a long script but it's pretty clean and readable


Oh, nothing! I think it's beyond excellent.

I ran shellcheck to see if it was bash or sh and it didnt make a peep:). My Q is just asking if there is another similar quality tool. I see how you could interperit it the other way, thanks for asking.


Oh I see now, I misinterpreted the sense of the question :)


He meant "impressive chunk of bash" (.sh) not "of shit"


It's not bash, it's a shell script.

bash is not POSIX-compliant. This script is.


A script for bash is a shell script. I've never understood this tic some people have about writing shell scripts executable by strictly minimal POSIX-compliant shells: nobody needs that level of strictness. Everyone has bash. Everyone can easily get zsh too. Why should I spend one minute of my time worry about whether ksh93 or dash or something can run my script when I can just put bash in the shebang line and get access to a much richer feature set? It makes no sense. Portability for portability's sake is a waste of time.


It's pretty common to run into environments without bash, or without the bash your script needs. Three examples would be any Docker container based on Alpine (just ash), OpenBSD which ships with (I think) a Korn shell, and macOS which is stuck at Bash 3.x.

Similarly, not every Linux distribution puts bash in the same place, e.g. NixOS. So blindly putting `#!/bin/bash` as your shebang will result in a broken script on those systems.

You can just install bash or move it around to suit your script, but maybe you're shipping this script for other people to use and can't control their environments. It's easier to make a handful of changes to the script instead. This is why some people care about portability.


There used to be a lot more POSIX OSes; most of them, other than Linux and the BSDs, don’t have Bash, but instead only some other default shell. (Remember the https://en.wikipedia.org/wiki/Microsoft_POSIX_subsystem ? Xenix? Solaris?)

But the modern reason is https://en.wikipedia.org/wiki/Almquist_shell (ash) and its descendant, dash. Busybox environments ship with ash as the only available shell, unless you manually build another shell into the busybox chroot. As well, many initrd/initramfs environments have only one shell, usually dash. When scripting either of these, you have to write pure POSIX scripts.

And—mostly due to inertia—this tends to apply to all such environments, no matter their size. My Synology NAS (because it’s technically a busybox env) doesn’t have bash, despite having huge honking disks! Neither does CoreOS (because it’s technically a PXE initramfs env), despite not being intended as a “link-bandwidth-constrained” PXE environment at all. They’re just descended from these “lineages” of OSes that were used on more constrained devices in more constrained times, so their maintainers have kept to the old embedded-system ways.

On top of this, as the ash wiki article says, Debian made a choice to make dash their /bin/sh, and then forced all system scripts to use it, thus forcing them to be rewritten to be POSIX-compatible. (I think this was done with goal of lower memory-usage for constrained Debian environments that might still need to install some arbitrary subset of Debian system services.)

There was also another kind of pressure, 10-20 years ago, with Linux single-floppy or low-link-bandwidth PXE boot environments wanting to be as slim as possible, and therefore not shipping with bash.


I keep my dotfiles bootstrapping script POSIX-compliant; it’s a little bit of a pain but it lets me use it in really stripped down environments.


Bash is POSIX compliant, but it has a number of extensions.


No, it isn't. You have to explicitly enable a POSIX-compliant mode:

https://www.gnu.org/software/bash/manual/html_node/Bash-POSI...


That still makes it POSIX compliant…


If you have to set a mode to make it so, it isn't.


When executed as /bin/sh, Bash enables POSIX mode automatically. /bin/bash isn't POSIX compliant by default, but /bin/sh symlinked to /bin/bash is, so you can be sure that #!/bin/sh will be POSIX-like even when implemented by Bash.


No, it is.

An x86-64 CPU boots in 16-bit real mode. An explicit option has to be used to switch it to 64-bit long mode.

That doesn’t mean that it’s not a 64-bit CPU.


Not entirely.

$ echo I am posix compliant &>/dev/null


Is there a place where I can find how much performance my Haswell CPU has lost due to all of these 'fixes'?


For example from March 2019:

https://www.phoronix.com/scan.php?page=article&item=spec-mel...

I think there are some further in-depth benchmarks with multiple generations of Intel and AMD, but can't find the link.


it depends on the kind of applications / loads you are performing, so really it's not possible to put a static value to that. best is just to test it with some benchmarks yourself to be sure how it impacts you. some people report losses of like 20%, others 1 or 2%...


The Haswell TSX was found to be unsound and disabled not long after it came out.


Haswell doesn't have this feature


I believe Haswell does have Hardware Lock Elision, but it was disabled by a microcode update.


It was disabled because it didn’t work. So I guess it’s in the eye of the beholder.


Uh, so there's absolutely no way to avoid disabling HLE (unless you block microcode updates altogether)? Can people who paid extra for a computer whose CPU has HLE get a refund?


You can avoid it. They didn't disable it, but implemented MSR which disables it. Don't configure this MSR to disable HLE and you're good.


But it says "Hardware Lock Elision (HLE) is unconditionally disabled by the new microcode"?


https://ieeexplore.ieee.org/document/6877452 describes a speedup of 41% when using TSX under a specific HPC workload.

https://www.phoronix.com/scan.php?page=news_item&px=RPCS3-In... describes a 40% speedup for an emulation workload with TSX.

Do we know what the slowdown will be for current gen CPUs for various workloads?


As far as I know the only software that really wants it is SAP HANA. The slowdown should be small, that 40% speedup on emulation is specifically on load-locked/store-conditional instructions, and the linked blog post says "non-TSX CPUs such as Ryzen [had] a noticeable improvement in performance, although not to the same extent".


How do microcode changes actually work? My mental model of a CPU is hard baked logic paths.


Modern CPUs are more like VMs. The actual architecture is totally different from x86 and just pretending to be x86 to the outside world. X86 instructions get translated to the native instruction set of the CPU which is more or less "secret". This makes it very easy to patch issues with the CPU through microcode, as seen in this case.


I would go so far as to say that modern CPUs are emulators, not even VMs. Something like 20% of the die area is devoted to instruction decoding.

It's the only reason I'm interested in RISC-V or advanced arm stuff. Even there's a metric fuckton more effort going into the latest Intel x86_64 chip, there's a lot of silicon left on the table.


>Think of microcode like a firmware for your computer’s CPU. Microcode translates the instructions the CPU receives into the physical, circuit-level operations that happen inside the CPU. In other words, an updated microcode can send different instructions to the circuits inside the CPU. This can prevent certain Spectre attacks by changing the way the CPU functions. Microcode updates can also fix bugs and other errors, without requiring complete replacement of CPU hardware.


I recently learned about how microcode works from this video actually https://youtu.be/Zg1NdPKoosU. It might not be the state of the art anymore but is probably a good basis


My personal "Aha!" moment about this was in front of an old Zuse Z5 in a museum with a block diagram of the CPU hanging next to it. That diagram showed which bits toggled which functional unit within the CPU. The comtrol panel on the operator desk had the same bit labels on the data input buttons. This early machine had no modern instruction decoder at all. Each bit in an instruction word was used directly.

This becomes wasteful when the number of functional units grows and not all bit combinations are reasonable. So the next reasonable thing to do is to introduce a lookup table from shorter coded instructions to the actual control bit patterns.

The CPU in this video adds an internal step counter that allows splitting an instruction over multiple cycles with a different control word for each cycle.

Modern CPUs are probably conceptually similar, but much more complex. Optimizations like pipelining and multiple ALUs make the whole decoding and checking process more complex and more dynamic.


How do the microcode upgrades get delivered? Do people have to manually install them, or do Intel have some way to force a microcode update over the Internet?


It's distribution specific, but generally it gets loaded at every boot. It's not permanent. https://wiki.gentoo.org/wiki/Microcode


On Windows it's handled in Windows Update, and maybe there's a way to disable loading the new microcode. On Linux, it's explicitly provided to the kernel by whatever userspace you have. On Arch, for example, it's in a separate package called intel-ucode.

Some board firmware loads updated microcode when it's updated. It has to be loaded at each boot by software in order to change.


On windows you can disable / enable some of the mitigation via the registry. https://support.microsoft.com/en-us/help/4072698/windows-ser...


On many newer CPUs, the memory for the update is volatile, so it gets rewritten at every power-on. That could be done by your motherboard or the kernel of the OS you are using.



Multiple ways:

- through BIOS updates (so the motherboard OEM has to release a BIOS update for that model that includes newer microcode and you'd have to download/apply the update)

- through the OS: both Linux and Windows can and do load ucode at boot, depending on the version/configuration/etc


Read the whole thing and though it was an article, reached the Signed-off-by at the end and realized it was a commit message.

This is how a good commit message should look like! Telling what, why and how something was fixed, instead of just "fixed X".


Is this related to: https://news.ycombinator.com/item?id=21534232

Or do we have two bugs being fixed at the same time?


It’s a somewhat unrelated bug that also had a workaround released for it recently.


Does this have any affect on AMD systems?


While the link is to a Linux commit message, the described change (disabling HLE) is implemented in an Intel microcode update, which of course will not have any effect on an AMD processor.


AMD not affected with the latest wave of speculative execution bugs.


AMD does not support any similar features at all.


Is there a way to NOT get that "fix"? I don't care for security, I prefer performance. A magic kernel switch to not load new microcode? Or a switch in my distro (Debian)? Something similar to `mitigations=off` in current kernels.


Each time I’ve seen this story today as I scrolled down the list my first though has been “Why and how did intel disable Larry Ellison on CPUs? And why didn’t they do it years ago?”

Maybe it’s just been a long week...


They do not disable it. They only implement MSR which can allow you to disable it. If you want to stay with updated microcode and with HLE, just don't configure it as disabled.


Yes they have. The only conditionally disabled feature is xbegin/xend/xabort.

From the article:

> The other TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally disabled by the new microcode...


What percentage of functionality has Intel already disabled on their CPUs? 25%? Soon they'll have to disable the entire CPU. It's fucking hilarious.


... in the Linux kernel.


Nope. The link is to a Linux kernel commit message, but it simply describes what the latest microcode update does.


... in the microcode updates. Which any of your BIOS, UEFI, or OS may and likely will apply.


If you're on Linux, you may need to install a package to get µcode updates. The Arch Wiki has a particularly informative article on the subject:

https://wiki.archlinux.org/index.php/Microcode


It is installed by default in most distributions, e.g. Debian.


Ok you can all stop downvoting me now.


If you haven't already, would you mind reading about HN's approach to comments (https://news.ycombinator.com/newswelcome.html) and site guidelines (https://news.ycombinator.com/newsguidelines.html)?


I think speculative execution is in principle incompatible with untrusted code execution. Even if CPU makers will place memory protection in front of speculative execution, and not behind as it is now, any untrusted code/bytecode can still pwn the process running it, and there is no way to work that around as such.


The problem is not the speculative execution. It's the observable side effects.


With current CPUs, yes. Adding either a speculation barrier instruction, or even better, more fine grain ways than a process to describe memory protections (like segmentation) would go a long way.


It's more like how having JavaScript enabled in a browser is "in principle incompatible with untrusted execution". That is to say, yes, you are technically right, but at the same time the benefits are great so we are still pursuing that approach while at the same time fighting an arms race of finding and closing any security issues this design introduces.


If the only thing in the process is the untrusted code (+ interpreter) then there's nothing to pwn.


Indeed, and that's the reason why a lot of current "sandboxing" efforts are rather misguided.

There is no reason to filter syscalls from some kind of bytecode with an interpreter run with full privileges if you simply run all of that unprivileged and you already have all syscall hardened and ACLed.

But for as long as there is a remotest possibility of a process being able to get around MMU, there is no reason to do that either.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: