This is another TSX (transactional memory) issue, and you can disable TSX without much of a problem.
The attacker basically needs to be running a binary on the machine (not JS in a browser or anything).
The leakage is extremely slow, about 225 byte/minute for ASCII text (a 4k page in 18 minutes). I'm not sure if that was an exact recovery either, just a probabilistic one. With noise reduction to enhance recovery, they said it took twice as long - so about 113 byte/minute.
It seems to be able to only be able to control the bottom 12-bits of the address to recover (but I didn't fully get why) and require the process to be either reading or writing the data to get it into L1 cache somehow, so just sitting in RAM isn't good enough.
attacker still needs to figure out an address (even with ASLR there is still a lot of guessing, and if you have really sensitive data, just move it every second until you wipe it).
Interesting, but kind of a non-issue.
It's only really hard and messy for the first guy who implements it, after that it is much easier, albeit still fairly messy. I'm not saying we need to panic, but it's more than a "non-issue".
How do you know where the key is, and how are you guaranteed to be able to read enough of it before the "shifting sands" that is timing unpredictability and general noise in the system make you read something else?
That's what really irritates me about all these side-channels that have been found ever since the first Spectre/Meltdown --- they all demonstrate something flashy like "we read a key/password/secret/something important from memory in a few minutes" while conveniently ignoring to mention the countless hours spent aligning everything just right so they could show off the one "magic trick". In a lot of the cases there's barely even any control over where it reads.
Every time something like this comes up, I feel compelled to post some bytes from somewhere random in the memory of a random process on my machine to show just how much I care; here you go:
E8 7F 00 00 00 A1 64 30 40 00 89 45 D8 8D 45 D8
C4 20 00 00-CE 20 00 00-D8 20 00 00-E2 20 00 00
F8 20 00 00-00 21 00 00-0E 21 00 00-16 21 00 00
26 21 00 00-36 21 00 00-42 21 00 00-56 21 00 00
66 21 00 00-76 21 00 00-84 21 00 00-96 21 00 00
AA 21 00 00-00 00 00 00-FF FF FF FF-B4 16 40 00
C8 16 40 00-7C 20 00 00-00 00 00 00-00 00 00 00
EC 20 00 00-00 20 00 00-00 00 00 00-00 00 00 00
I don't mean to understate the difficulty of the task, but I've also seen probably tens of repros by now that use intimate knowledge of e.g. the kernel page allocator and known post-boot state of e.g. a firmware image flashed on millions of devices to drastically cut down the search space
If a university can accomplish this task with their funding and no urgent incentive, what is the nation state actor doing with their enormous budget?
Badly enough as in they can ever do it a few times a year against a few high-value targets, or badly enough that they can do it to "only" a few tens of millions a year?
Even end-to-end encrypted communications can be bypassed if the government wants them "badly enough". But the point is that with E2E encryption, they can no longer tap into and data mine the conversions of billions in real-time to fish for crimes.
Just like when fighting malware creators, the point of security is to keep raising the standards and the difficulty for the malicious actors.
It's really not that hard. E.g. people working on browser exploits have been working on exactly this for years and years, back when just having an out-of-bounds read due to a regular browser bug was the mechanism. Turns out that programs are absolutely chock full of pointers and it doesn't take long to run across one that points to what you are looking for. Especially because programs tend to have lots of data structures that end up pointing to more and more important data structures, funneling you into the guts of the program.
Sure, reverse engineering takes work, but blindly hunting in memory with no clue it is definitely not.
Side-channel attacks are basically a persistent out-of-bounds read mechanism. That is a very bad thing (TM).
There are literally thousands of people who work on this day in and day out, and millions in bug bounty programs out there.
perhaps package servers and package management software should round up the package size to hide the identity of the package? use oblivious transfer to hide the identity of the package from the package server itself?
This makes no sense. If you have the privileges to install a rootkit, there is no need to use any speculative execution exploit.
You also, don't get to extract constantly. You only get a shot when the data is in the LFB, so the program needs to be actively reading or writing it to keep it moving back and forth from L1 and L2, at least that is the way I read the paper.
Luckily there are other vendors who seem to know what they're doing: We just received our first batch of non-intel servers at work and have no intention of returning to intel for the foreseeable future. What they offer is simply not worth the risk. I expect more companies to follow suit.
There are cache architectures (such as pseudo random replacement) that are much more difficult to perform side channel attacks on, but have slightly lower performance.
Intel's problem has been its relentless pursuit of per clock performance, and that has led to some of these side channel attacks. Cache coherency on multi-core is hard.
My guess is that if the market share of Intel and AMD were swapped, you would see similar cache attacks on AMD.
It's worth noting, too, that Intel has diversified-- the processor business is only one very important business to Intel.
Unless you are running untrusted VM from others, you're not making a rational, knowledgeable decision.
Most people are just using this to falsely justify their Intel hate, similar to Microsoft in the 2000s.
2^12 = 4096, or the x86 page size in bytes.
(As an aside, this also limits the size of the L1 cache, which is why it hasn't grown much despite the L2 and L3 cache growing a lot; an 8-way set-associative VIPT cache with a 4KiB page size is limited to 32KiB, absent tricks like page coloring. Perhaps this will change if 64-bit ARM servers become popular, since they can only address the largest amount of memory when using a 64KiB page size, and this would make enterprise distributions default to that page size.)
> To prevent the confusing situation where the same physical address has more than one index in the cache
Ah yes, of course. Is this something that actually happens (e.g. because of processes sharing memory, OS hands out different virtual addresses pointing to the same physical memory) or theoretical in nature?
For example, a common trick to implement an efficient circular list for objects of varying size is to map the same physical addresses twice in adjacent locations. Then you can just use a pointer to the start of each object and have no special handling for the case where an object straddles the end of the allocated area.
A huge L1 would be slower to access. You might end up slowing down all memory accesses by a cycle (or more) just for accessing the large cache. You also have to find some place to put the thing in your floorplan, and route everything else around it. This may result in timing challenges in other parts of the chip, which might require additional delay to resolve.
A huge L1 would take up more space. This would increase area, reduce yield, and increase cost. Since in a multi-core chip each core has its own L1, you will have to pay this cost multiple times. Also, L2 caches are typically inclusive, so you would potentially need a much larger L2 to be able to accommodate all the extra information in these L1s.
These tradeoffs have to be studied with simulated experiments to make the right call. For programs with huge working set, maybe the added latency pays for itself because you have have fewer cache misses. For programs with good locality, maybe you end up losing performance. Maybe you save power by reducing misses, or maybe you waste more power because the cache is using more power. Maybe it's an insignificant area increase, or maybe it completely blows up your budget.
That looks conspicuously like the load/store port address aliasing size (look up "4K Aliasing"), which can be used to stall data availability while the conflict is resolved. I'll read up on this particular one, but there's a growing family of vulnerabilities with 12-bit address aliasing in their toolbox.
It is on you to make sure something interesting is in the cache, but if you can make your target execute, 12-bits is fairly good selectivity.
That feels like a big issue.
Glad it's possible to do this in virt now: used to be impossible to disable TSX (you could avoid advertising it via cpuid, but you couldn't actually prevent usage of the instructions).
>More specifically, modern operating systems employ Kernel Address Space Layout Randomization (KASLR) and stack canaries. KASLR randomizes the location of the data structures and code used by the operating system, such that the location is unknown to an attacker. Stack canaries put secret values on the stack to detect whether an attacker has tampered with the stack. CacheOut extracts this information from the operating system, essentially enabling full exploitation via other software attacks, such such as buffer overflow attacks.
Can anyone explain this scenario? Is this really a realistic scenario? Do they mean if you have code execution on a system, and want to escalate privileges, you would find another network/socket service that is running on the same system, find an exploit in this service, and then leak the stack canary to allow corrupting the stack? There's often easier ways to defeat the stack canary.
>Leaking KASLR/stack canary just mean you get 90s level triviality of attacking any stack-overflowing API you can find.
It does not seem trivial with this exploit, but maybe I'm just not getting it. With the low accuracy and transfer rate, it seems like a lot of stars need to line up with regards to how the service you attack function.
(The one exception is also interesting. AMD processors allow speculative reads past the end of x86 segments and past BOUND instructions, which of course no-one uses these days. This suggests there may have been a deliberate decision to block them in the more important cases.)
Somebody messed-up big time. Or from a business point of view did they? Intel current problems are manufacturing and the continuously lower power of their "legacy" processor (except due to manufacturing problems, this "legacy" is still mostly the current one) makes it so that people are buying more. Of course there is AMD back in the game, but the market demand is large enough; plus AMD would have been there anyway, and, in the fiction that Intel did take good parts of the perf hit upfront instead of the secu vulns, as competitive as in the current situation.
The people most annoyed are the users. Intel got away by pretending this was not really defects in their product but only new SW tricks that they will help defend against, and their clients just let them say that without much complaint (well, I guess big ones got some rebate...) but security researchers and/or processor designers know very well this is bullshit (see the vulns papers and FAQ) and that they simply fucked-up big time on Meltdown, MDS, etc. I don't care that a few other vendors did some of the same mistakes: they are still mistakes and design flaws, and not even something new.
Pretty much the only new shinny thing in this stream of vulns was Spectre and the few variants that appeared quite early on (but NOT Meltdown&co). The rest are design flaws that comes from the "oh not a big deal to leak that potentially privileged data, we will drop anything and trap before any derivative can go out anyway" mentality. Yeah, no, I'm sorry but the funding paper about speculation already told to not do that :/ Either they did not do their homework, or they voluntarily chose to violate the rule.
... if you value security over raw performance. Clearly Intel has decided at some point that it was worth playing with fire in order to get ahead in benchmarks. In their defence it seems to have worked reasonably well for them for quite a while.
>The people most annoyed are the users.
I wish, but I wonder how much of that is true. Are most users even aware of these problems? They get patched automatically by OS vendors and then most of the time they won't hear about them anymore. I think the "nobody gets fired for choosing Intel" will probably still prevail for quite some time.
They built a bunch of tech debt into their processors to boost their numbers, and now they hens are coming home to roost.
What I'm wondering is how many changes this will make to their product roadmap, and to what degree it will make next generation chips look lackluster compared to what people (think they) have now.
It seems to be down to the notoriously buggy TSX (hardware transactional memory) in Intel CPUs.
They had an additional mode that would transparently convert many spinlocks into transactions without code changes - that is now gone.
As core counts increase spinlocks and other synchronization primitives simply become too expensive. We'll need transactional hardware support eventually.
Scaling workloads does not require transactional memory and certainly doesn't require a vulnerable implementation of it. HTM might be the easiest way to scale a relatively naive algorithm, but the most scalable synchronization is none at all (or vanishingly infrequent) — and that works just fine with conventional locks and atomics (both locked instructions and memory model "atomics" such as release/acquire/seq_cst semantics).
TSX brings hardware supported optimistic locking and breaks the latency imposed by MESI and related protocols in use today. Of course its great if you can get away with no synchronization at all - but then you might as well just use a GPU. TSX helps with those non-trivially parallelized problems that are still best performed on a CPU.
Obviously many workloads require some coordination, but often something as trivial as allocating one of a given resource per CPU is sufficient to avoid most contention even on 100s of CPU core machines. Profile; improve. The same is required with HTM.
Regardless of your thoughts on HTM and scaling technology, TSX is broken from a security standpoint, which is the primary subject of the fine article. HTM != TSX.
I’m going to nitpick, though.
> I explicitly mentioned memory model atomics in addition to locked instructions in an attempt to prevent getting hung up on locked atomics. I guess that didn't work.
By “memory model atomics” do you mean atomic loads and stores rather than, say, compare-and-swap or atomic-increment? Because C++11 compare-and-swap and atomic-increment operations take a memory ordering parameter, yet they still generate the same `lock cmpxchg` and `lock inc` instructions.
But the issue isn’t locks; none of those instructions actually lock the entire memory bus like in the old days (...unless you pass an unaligned address!). They lock the cache line, which causes cache line thrashing if many processors do it at the same time, and that’s the biggest source of overhead. But plain stores also lock the cache line. A compare-and-swap is more expensive than a plain store, but not that much more.
Yet another Intel exclusive side channel vulnerability might help change that. This side channel stuff is terrible for cloud operators. Every time they have to adopt another layer of mitigation some fraction of their capacity disappears in a puff of shame and excuses.
The big migration will happen in about 2-4 years. Typical enterprise schedule is 3-5 year cycles, zen2 launch was 2019.
But the CacheOut paper describes how to use it in practice and why the intel fix is not sufficient.
What is an operating system?
An operating system (OS) is system software responsible for managing your computer hardware by abstracting it through a common interface, which can be used by the software running on top of it. Furthermore, the operating system decides how this hardware is shared by your software, and as such has access to all the data stored in your computer's memory. Thus, it is essential to isolate the operating system from the other programs running on the machine.
So uh... that's a reference to the IME right? Not the user's installed OS.
See also Intel's advisory: https://software.intel.com/security-software-guidance/softwa... (cite 23 in the CacheOut PDF).
Is it for TeX or a common syntax for some publishing platforms to pick up?
My memset_s does a full memory barrier, but no others do. Esp. the so-called "secure" libs.
I'm guessing you're referencing this . libsodium's response that your suggestion wasn't enough seems reasonable, as does their response:
> `sodium_memzero` should be considered as best-effort, with reasonable tradeoffs.
The documentation for memzero  doesn't read like it works 100% of the time, no matter what. It reads like there are tradeoffs involved.
> My memset_s does a full memory barrier, but no others do. Esp. the so-called "secure" libs.
Your memset_s also doesn't work on big endian, whilst "the so-called 'secure' libs" often do.
I'm fine with self-promotion, but don't disparage efforts that have a much larger surface that they deal with, simply because you disagree with the tradeoff they find acceptable. A one-liner isn't enough for that conversation.
I don't see anything that suggests you did have an idea about "upcoming cache exploits", so I'm afraid I can't say that you were.
I saw you mentioning something that would help, but was nowhere near enough, to deal with Spectre, which was then dealt with by microcode, kernels and every other part of the software stack, making attacking libsodium with Spectre something that might not even be possible.
> The tradeoff they are talking about is leaving secrets in the cache, because mb(lb) is slow. This is not acceptable for security, only for performance.
Not entirely. The main reason is this one:
> SPECTRE-like attacks can be conducted during all the time the secret is present, prior to zeroing. The required preconditions and the time window during which a full memory barrier could help, seem to be negligible compared to the actual lifetime of the secret.
Untrusted code execution must go, that's the only way.
The genie is out of the bottle now. There will be more and more practical instruction level attacks with each year now.
This is how you get corporate rule over what can and can't run on machines you own.
Safer architectures can be developed, without handing over control to a third party.
It will take many billions for the industry to do a U-turn, and switch back to dedicated servers as a golden standard, and putting a leash on ambitions of browser makers.
Disagree. You do have to put a lot of work into the process. Formal spec as well as a formal method/simulation. There are certainly a lot of fun things to consider in that, but I don't think it's completely unfeasible.
As an aside, one of the things that intrigues me about RISC-V is things such as a formal instruction set spec being openly published . You could apply all sorts of tooling to that before actually creating silicon.
Focusing so much on Jitting JS was as bad idea then just like it was now. The entire world of the web is artificially propped up and will probably die the way it should have died when it started: with people frustrated by constant security vulnerabilities enabled by a group of ad companies who don't care about anything but money.
> It will take many billions for the industry to do a U-turn, and switch back to dedicated servers as a golden standard, and putting a leash on ambitions of browser makers.
WRT Dedicated servers, if anything the industry needs that push now anyway. I've yet to see a 'Real' cost projection on a SaaS 'rewrite'; i.e. if the current thing has been in use 5 years longer than it should have, your cost models should go out 5 years longer than you intend your new solution to exist.
Which would be ironic; my pain is in the beancounting side (usually what an org really cares about) yet security will be the more likely scenario.
I think the cat is out of the bag for the browser market though. Users have on the whole gotten 're-enclosed' thanks to modern laptops tending towards small eMMC or SSD sizes.
But FFS, this sort of thing is the reason WASM scares the daylights out of me.
It's actually much easier to render deleted data unrecoverable on a flash based SSD than a hard drive, if you're willing to use only SSDs that eschew a few very common performance optimizations. So that fits right in with the theme.
He isn't necessarily talking about the "untrusted by corporates" situation, maybe just "untrusted by the user".
(Personal pet peeve about "trusted" and "untrusted" --- never neglecting to mention the by who!?)
I can't get behind trusted computing until there is a strong movement and culture behind allowing user freedom on trusted computing platforms. As it stands, that isn't the case at all. Just look at the iPhone and Notarization on the Mac.
However, the data is in all in the document source; the JS just plays with the visibility. I.e. there are no dynamic tricks to populate that.
What you see here is yet another attack on a relatively unused Intel-only x86 extension.
The bigger problem is that Unixes and Windows are pretty bad at sandboxing syscalls by default.
People in hardware engineering community knew of their existence long ago. It was just them not being practically exploitable that kept them from headlines.
But now we have a multibillion buck virtualised hosting industry, and JS with JIT in every browser — a million buck incentive for black hats to poke into architectural vulnurabilities
Google wasn't nearly as big a deal back in the 90's (hell, they were only really around the last couple years of the decade) so that's a weird association to make.
No, unfortunately. Side channels will work even across numa domains, and cores with own memory.
Side channel free hardware is extremely hard to do even in simplest ISAs specially made for that. Look at how credit card industry keeps struggling making safe smartcards.
And we should be able to eliminate anything weaker.