Hacker News new | comments | ask | show | jobs | submit login
Things you must know before you can understand Meltdown as a developer (razorpay.com)
235 points by captn3m0 on Jan 12, 2018 | hide | past | web | favorite | 94 comments

As others have already said, if you're a programmer, please just read the original papers:

https://meltdownattack.com/meltdown.pdf (start with this one)


They are extremely well written, clear and to the point. Understanding them will take you less time than trying to get rid of all the tortured analogies and unnecessary simplifications people have been trying to make up over the past week. It's bad enough that we face the daunting task of explaining this stuff to people who don't care about computers, there's no need to perpetuate misunderstanding among those who deal with computers for a living. Just read the real thing.

And on the subject of explaining this to others, it might surprise you how far you can get if you try to honestly explain how the attacks work. I refuse to use the silly train station metaphors, so I tried to describe the basic idea of how speculative execution works in out of order CPUs to my parents (who can browse the Internet, with some effort, and were patient enough to listen to me for 10 minutes or so). I don't think I got the notion of return-oriented programming across very well, but the basic idea of Meltdown and side channel timing attacks in general is actually very easy to convey on the basis of a reasonably simplified picture of a CPU - you need the explain the basic role of cache memory, virtual vs physical addressing, the TLB and the basic notion of branch prediction. That's all you need to understand the principle of how the attacks work, if not the details of the implementation.

Amen -- and thank you. Of these, the Meltdown paper is especially well-written. One of my laments of this whole fracas is that Spectre has drowned out Meltdown to a degree (and not without good reason). But Meltdown is in fact the much more egregious of the two: it represents an indisputable flaw in Intel's architecture; the system software work to workaround it is gnarly (and, depending on implementation choices made, potentially rife with serious performance side-effects); and exploits are readily available. This is not to minimize Spectre which is serious and difficult to mitigate, but rather to counter some of the implicit minimization of Meltdown.

Point is: keldaris is exactly right; please do read these papers -- and read them in the order of Meltdown before Spectre. If you completely understand the Meltdown paper, the Spectre paper will be much more accessible.

I agree with the point that it's advisable to start with the Meltdown paper, and have amended my post accordingly. That flaw is much more concise in its essence and much of the context is shared, so it's an easier read.

>" It's bad enough that we face the daunting task of explaining this stuff to people who don't care about computers, there's no need to perpetuate misunderstanding among those who deal with computers for a living. Just read the real thing."

How did the author of the blog post perpetuate a misunderstanding? What is wrong with reading multiple sources and discussions on an a subject? Isn't that what we are doing here to some extent?

Because a definitive white paper was published this should be our only allowed source of understanding?

Its quite possible that someone could read this and then the more lengthy white paper and have them be complimentary. Not every programmer starts with the same level level or hardware and microarchitectural knowledge.

>"And on the subject of explaining this to others, it might surprise you how far you can get if you try to honestly explain how the attacks work."

Isn't this exactly what the author of this article did? In seeking to understand the problem hey formulated their understanding into a blog post.

It seems to me that's exactly what they did and there were no "tortured analogies" in this individuals blog post.

> How did the author of the blog post perpetuate a misunderstanding?

By misunderstanding the fundamentals of how Meltdown works - specifically, by confusing out-of-order execution with speculative execution, and then basing their explanation on that misunderstanding. My further explanation elsewhere in this thread: https://news.ycombinator.com/item?id=16133668

I used to like reading papers but its been quite some time since I gave myself the dedicated time to do this. I only allow myself 5 minutes on any topic before I'm off to something else. Someone posted the links to the papers and after looking at the intro I realized I had everything I needed to understand the meltdown paper (and it seems, spectre) from my grad architecture class (shoutout to tomasulo!). Sitting down in the evening and reading through the meltdown was such a nice change of pace from my normal routine. I took notes and wrote down questions about parts that didn't make sense, and when I got to the end I found it so much more rewarding than reading some pop tech synopsis. I don't mean this in a superior way or that I'm special because I understood the paper, it was more about stretching my CS legs in a way I don't on a daily basis as a developer.

I might have less of an understanding about the real world implications until I look at the impacts, but people around me have been talking about it (at work, etc) and I realized they all had a pretty limited understanding of the vulnerability and how it worked (not because they aren't smart, they just didn't have time to read the actual paper and are more focused on how this affects us at work).

I want to make this a part of my normal routine, every week or so to pick an interesting, well written paper on a meaningful CS topic and have a leisurely, scholarly read like I used to do in school (well, in school there wasn't too much leisure to it).

Don't be too quick to dismiss metaphor based explanations. You'd probably be surprised how many developers need the metaphor.

Similarly, many folks explaining this are likely doing so as much to increase their understanding as anything else. If there are problems in their explanation, help make it clearer.

None of this is to say that folks should skip the originals. But it is also not helpful to constantly send folks back to them.

If I see a programmer saying he/she tried to read the original papers, but couldn't understand this and that, I'll do my best to help them. I don't have a problem with people using metaphors to explain specific points, but I do think people who replace trying to work through the originals with repeating a lazy metaphor or two from the popular press are doing themselves and others a huge disservice.

This isn't something unattainably complicated that you need years of education (on top of a typical CS education or its self-taught equivalent) to understand. I think it's worth pointing out to people that they shouldn't be afraid of reading the original material. And frankly, I've found many of the popular metaphors harder to understand than the papers.

As I am trying to say in the sibling post, though. Metaphors aren't necessarily lazy. It is quite literally how many people think through these problems. Easily put, your best to help them may be to get common footing using one of these "lazy metaphors."

And again, I'm just cautioning if you are offhandedly dismissing metaphor. Tone can be lost in forums, obviously, so don't think I am "correcting" you. I do not intend it that way.

As a physicist, I'm quite used to having to resort to metaphors to convey some ideas to a non-technical audience. If a non-physicist asks me to explain AdS/CFT correspondence, I'm not going to be able to explain precisely why the correspondence requires the particular geometry of an anti-de Sitter space to work in a few minutes because the mathematical background required is just too deep. So I don't think metaphor-based explanations are necessarily bad, but I do think they are overused.

It's one thing to use a metaphor to convey some essential property of a complex idea to non-experts. It's quite another to see domain experts argue about the (often non-trivial) details of strained metaphors far beyond their area of applicability. Without demeaning the usefulness of metaphors, I'm essentially trying arguing that: 1) domain experts should be far more cognizant of the limitations of the metaphors they employ and 2) domain experts should at the very least try to read the original material before perpetuating solely metaphor-based explanations.

I don't really disagree with your two points. It is the rhetoric of "resort to metaphors" that rings a bell to me.

I agree the shared metaphors of notation and common jargon should be preferred as a starting point. I just don't have faith that they are the clear way out of misunderstanding. More, I think it is unrealistic to think this is something that should only be done with "outsiders". Rather, I think we are all using some common language, and shaking up to new language can shake out misunderstandings that were simply not being voiced.

Off topic, but can you recommend any resources for getting into the subject of AdS/CFT correspondence? I’m loosely familiar with it, but I’d love more depth. Thanks for any links or hints!

Since you asked for more depth, I'll focus on the (comparatively easier) task of giving some links to technical literature. First of all, if you already have a traditional QFT background, there's the two extremes of either just taking a simple toy example to play with [1] or a fairly rigorous well written lecture course [2]. For a more general physics/math background I'd recommend a slightly more pedagogical treatment [3], but it's still accurate to the subject matter and therefore contains graduate level mathematics. There's a lot more out there, but these are just some of the descriptions I've personally skimmed and found them to be well written and useful.

On the other end of the scale, there's a huge amount of popular documentaries out there that purport to say something about AdS/CFT. Most of them either contain nonsense or don't say very much at all. I don't really know how useful the popular analogies really are, but in the spirit of making at least one recommendation that doesn't contain any math, Susskind's public lectures are easy to watch (here's one [4]), and Susskind is careful not to talk nonsense.

There isn't really much on AdS/CFT that's in between graduate level mathematics and math-less popular stuff. The reason is that it's a fairly irreducible concept in the sense that you really need to understand what an anti-de Sitter space is and how conformal field theories work to see why there's a mathematical correspondence between them, and this isn't truly analogous to anything simpler. That's why it's very hard to explain beyond the level of vague metaphors without being rigorous about it.

[1] https://arxiv.org/pdf/hep-th/0403110.pdf

[2] https://arxiv.org/pdf/hep-th/0201253.pdf

[3] https://arxiv.org/pdf/1310.4319.pdf

[4] https://www.youtube.com/watch?v=2DIl3Hfh9tY

This is precisely what I was hoping for, and I just cannot thank you enough for taking the time to do this. I really appreciate you giving me a full range of resources too.

Glad I could help! AdS/CFT is a very elegant idea in modern physics, I hope you enjoy studying it.

The problem is that some of the metaphor based explanations are wrong, and are decreasing understanding. When the originals are clearer than the attempted simplifications, then I do think it's helpful to send people to them.

They are clearer to you. They are obviously not clear to everyone. As evidence I give you people that read the originals and form incorrect metaphors. :)

That is, these folks are building these metaphors as their attempt to understand the issues. If they simply silently kept them to themselves, you wouldn't even know they needed them corrected. Be clear on this point, people think in these metaphors whether they are read to them or not. That you did not is a trait of you. Not a universal.

So, please correct them. And keep encouraging folks to go to the originals. But expect that not everyone can read them as clearly as you can. And use the evidence of the poor metaphors to confirm that. :)

That is fair, but what we're left with are many more wrong explanations, and many of these wrong explanations are not more complicated than the correct one. And my complaint is not just the wrong metaphors, but the wrong technical explanations (such as, unfortunately, the submitted one, which I explain in another comment on this thread).

Agreed that that is problematic. I'm asserting that it is just visibly problematic. That is, without people making these posts, it is likely many people were still reaching these viewpoints.

Having the update at the post that says to read the original is vital. Getting more of the "peer review" mentality in all posts is also key, provided that people don't treat their posts as immutable and actually correct things that are in need of correcting.

I understand your points, I'm just skeptical it means we end up with less total confusion. Particularly since I don't think most of the confused texts will be corrected, lying in wait to confuse others.

Yeah, a cornerstone of my assertion has to be that references to original works are present and that documents are updated on errors found. Two things that I grant are not always the case.

I don't feel this is much different from any other pedagogical practice, though. How many of us come at "imaginary" numbers with a given metaphor that severely fails us in some understandings?

My primary concern is not actually bad metaphors, but out-and-out wrong explanations (such as this one).

I have a question: the attacks were apparently found by the authors of these papers and project zero very close to each other. Why is it that a bug that has existed so long was suddenly found by both groups so close to each other? Are there other papers or ideas that this is based on, or was there some other reason they both tried to look for these kinds of bugs?

This happens all the time. In this particular case, speculative execution timing attacks were "in the air" for the last year. A little while ago the C.W. was that Anders Fogh had kicked this off with his blog post from last summer, but Fogh actually posted a chronology taking this work all the way back to December '16:


Once there's blood in the water, people race to find exploitable flaws (that's the goal of the game), and so it's not surprising that you'd get multiple teams disclosing, especially with something this egregious. Also: there's a Nyquist Frequency thing happening here: remember that we're dealing with months-long embargoes. So there's a lot of time for people to have found these bugs "separately", and all we're really seeing is a colliding disclosure.

But having said all that: straight-up collisions happen a lot. We all have favorite stories. My favorite is when Vitaly McLain (then at Matasano, now one of my partners at Latacora) found an nginx bug that was identical to Heartbleed, 2 years before Heartbleed was disclosed. A fantastic bug. We were on a client engagement, so we had to coordinate with the client before reporting it upstream, and in the one hour it took to do that, someone else reported the same bug.

Side channel timing attacks have been an active research topic for many years. I'm not a member of that community and I'm sure there are many people on HN who are far better placed to explain the context than I am, but from what I've seen the momentum around practical side channel exploits has been building up for over a year now, there's been a number of public talks on the subject at conferences and consequently several research groups were motivated to pursue similar lines of research at roughly the same time. It's also a fairly tight knit community and I'm sure they exchange information informally as well.

I know you asked for specific citations, but I'd rather let someone who actually works in this area respond. I've read a few of the earlier articles on these subjects, but I'm not qualified to detail the full context surrounding the earlier research.

It could be the case that one group found out about the other's work and decided to pursue a similar research direction with the hope of being first to publish.

Putting my tinfoil hat for a minute, I suspect that it was well known and in use by various 3-letter agencies for years, and somehow they got wind of a foreign nation either exploiting this as well or about to expose it publicly. Their only move was to let a respectable entity like Google "discover" it.

Things you must know before you can understand Meltdown:

* The memory hierarchy (registers, cache, memory); really all programmers always need to know the memory hierarchy and Meltdown just sort of reinforces that.

* The basics of kernel memory management (kernel memory is mapped into userland processes and protected by page table permissions checks).

* Very basic assembly language (basically what a variable assignment and an "if" statement compile down to).

* The idea of pipelined CPUs, the idea that on modern CPUs the registers you see in assembly instructions are actually renamed from a larger invisible register file, and the distinction between instruction execution and retirement.

If you've got this I think you can just read the paper: https://meltdownattack.com/meltdown.pdf. It's really well written. In particular: I don't think you need to understand much about timing attacks. The Flush+Reload paper (you can just Google it, it'll be the first result) is also really well written, but you'll be fine in the Meltdown paper without having read it.

I tried reading the flush+reload paper several times, and watching a couple of the author's talks on it as well. I still haven't come out with a halfway decent understanding of how the timing attack works...which seems to be the most difficult and interesting part to me. It seems like its well understood in the security community, so it gets glossed over when referenced. How they actually manage to read data out of an evicted cache line remains a mystery to me.

They're not reading data out of the cache line. Often the contents of the cache are public anyway.

What they're detecting is whether a piece of memory is in the cache or not. This lets them infer the contents of some other piece of memory.

For example, an if-statement might check whether or not a secret bit is set, and that might lead the process to call function A or function B. By detecting whether it's A or B that lands in the instruction cache, you can infer the value of the secret bit.

Is it the timing mechanism you have trouble with, or the timing target? Flush+Reload is (to me) an unusually clear paper (it's an engineering paper, which is probably why it wound up at Usenix). But even in the paper, the actual target (not just understanding square-and-multiply but also how that gets translated into cache hits) is tricky.

The nice thing about Meltdown and Spectre is that the cache hits are less tricky to understand; they're engineered specifically to make the exploit work.

[I had to go back and reread it a couple of times...naturally :)]

I guess part of what bothered me is what makes it well written; there is so much of the discussion spent on background, which felt like stating the obvious to me. It wasn't clear to me how specific the conditions needed to be for the attack. They use GnuPG as an example, and ostensibly rely on knowing the algorithms that the decryption and encryption functions beforehand. With knowledge of the implementation, they're able to trace execution, and subsequently infer each bit of the victim data that they want to probe. They also need to know the victim's cache characteristics; hierarchy and timing.

It's a far cry from arbitrarily reading memory on an arbitrary victim.

I don't understand one part. If you read from an arbitrary memory location (during speculative execution, I get all that) how does that read pull data from a different process? Aren't all addresses virtual until they go through the MMU and get translated to a physical address depending on the process?

Or does this work only because the kernel exists in the same virtual address space, hence KPTI as a mitigation?

Yes, meltdown only works if the kernel is in the same address space. Also, the kernel maps the whole physical memory as part of its address space somewhere (kernel phyiscal map), so if you can read kernel memory, you can read all phyiscal memory.

And you are right, KPTI is a full fix for Meltdown (but not Spectre).

It is typical for 64 bit machines to have a direct mapping of physical memory at some virtual offset. Ie, virtual 0xfffff...700000 corresponds to 0x0 physical. This simplifies some things since you can allocate and access any physical page without creating new mapping.

32 bit machines do not typically have sufficient kernel address space to do the same.

(Oh, linux still uses direct map on 32 bit machines even today, but only maps some memory? I thought that was abandoned, but wouldn't really know. Anyway, a much better explanation of all things direct map is https://www.sceen.net/mapping-physical-memory-directly/)

>"This simplifies some things since you can allocate and access any physical page without creating new mapping."

I am curious what types of things does this simply for the kernel? When is a physical page allocation ever done that doesn't need to be entered in into a page table entry?

ptrace of another process, for instance.

Can you elaborate? If I strace a process which makes a ptrace system call that is just another userland process and my userland process has a page table entry just like any other userland process.

You call ptrace and ask to read another process's memory. The kernel has to turn the requested VA in that address space into a physical page (walking those page tables, not yours) then copy the memory (from some VA in kernel space). This is quicker if you can turn a PA into a VA by simply adding an offset.

I see. I misinterpreted your previous comment. Cheers.

Tried to write a meltdown explanation for "everyday" developers. There are some loose analogies and inexact writing. (Please point out mistakes, they're mine)

Here’s the thing. I read the Meltdown and Spectre papers after getting frustrated with the hand-wavy descriptions people were writing about the vulnerabilities.

I wish I’d just read them in the first place. They aren’t that long and they aren’t that hard to follow. So I recommend developers read the original papers.

That said: after reading the papers trying to explain the issues to someone else will test your own understanding, so I don’t mean to dismiss what you’ve written.

Aside: I find “so you don’t have to” in the title to be a little off putting.

I agree, I sat around at work (@Google) for an hour having my brain muddled by 3 or 4 extremely more-intelligent-than-me engineers giving their explanations, and then sat down and read the very clear paper (complete with example C code) and understood it myself in 10 minutes.

Would you please link to the original paper? Been looking for it myself

Great stuff! Thanks

Here's the PDF: https://meltdownattack.com/meltdown.pdf (I found it extremely readable as well, adding a note on the post)

I might be wrong about any of these points. Here goes:

* I don't think Meltdown depends on eager (both-branch) speculative execution.

* Memory access during speculative execution almost has to work (and does on AMD and ARM), so that's not the problem. The problem is that permission checks on memory pages are done asynchronously on Intel, and may not abort execution until after footprints have been left in cache.

* You can't use memory writes to store the locations of transient memory reads because the instructions are transient, will never be retired, and so can't affect the architectural state (things overtly visible to programs). It's not that the CPU designers realized and specifically prevented that line of attack from working.

* It's not that cache state doesn't seem to be rolled back; it's that you can't roll it back. Modern computers are themselves small distributed systems. Changes on shared caches have to be coordinated. You'd be trading one race condition for another.

* Exception suppression isn't necessary for the attack because otherwise your process crashes. That's fine: you just run more processes. It might not be necessary to suppress exceptions at all, except that exception handling adds overhead and thus noise to your measurements. Also: the suppression technique you describe here is an oversimplification of Spectre; in reality, Meltdown deals with this with signal handlers or (probably better still) TSX.

Finally: I'd recommend not telling people to avoid the actual paper. Your summary is about as technical as Meltdown's is anyways. It's a great paper; more people should read it.

I don't think Meltdown depends on eager (both-branch) speculative execution.

I think you are being overly gentle responding to this flaw in the explanation. Not only does the bug not depend on both-branch speculative execution, I don't think there even exist any x64 processors that do both-branch speculative execution.

I'm almost certain that all modern Intel x64 processors do branch-prediction and speculatively execute only the branch that they predict as most likely. If that guess is later proved wrong, they throw away the executed but not-yet-retired instructions and execute the correct path.

Am I wrong about this?

I think the specter paper stated that the CPU doesn't run both branches at the same time. Instead it picks one based on speculation and runs that branch. If it picked wrong it rolls back and only then does it start running the other branch.

The article states different. If I'm right that would also require some specter-like attack to make sure the desired branch is actually executed speculatively. Alternatively, maybe you can catch the exception and use that to indicate the desired branch was run.

You're correct that's how speculative execution works: it picks one branch, it does not pick both branches. But that's the Spectre vulnerability, while this article is trying to explain Meltdown, and Meltdown does not involve speculative execution. (See my other comment in this thread for a full explanation.)

Thanks, you comment was really clear :)

Nice write up. One of our guys wrote a similar piece. https://www.carvesystems.com/blog/meltdown-spectre.html

One of the biggest challenges describing low level systems vulnerabilites like these is you have to actually learn at least a little about the CPU internals and high level explanation only gets you so far. I think adding a "why this matters" to your post is helpful.

This is helpful, thanks. I'll update this post with your recommendation.

Planning to read the Spectre paper next week and do another blog post.

Edit: Just read your blog post, definitely need to add a "what's this about", and a "what you should do" section.

The first section on cache timing seems off.

"The one which is being read is cached, and the exception will be raised much faster as a result."

I only read the project zero post, not the paper, but I don't recall it having anything to do with timing of exceptions. I don't see the point or relevance of this section.

Might be missing something.

Edit: mixed up Meltdown with Spectre, I'll take a closer look at this later

The "flush+reload" mentioned in the Project Zero blog[0] is a variant of cache-timing attacks. The original Flush+reload paper has more details[1], but I intentionally didn't delve into the specifics of Flush+Reload. The description I've used is more closer to Evict+Time actually. If you're interested, I recommend this presentation[2] for more details on the category of attacks.

[0]: https://googleprojectzero.blogspot.in/2018/01/reading-privil...

[1]: https://eprint.iacr.org/2013/448.pdf

[2]: https://conference.hitb.org/hitbsecconf2016ams/materials/D2T... (Youtube: https://youtu.be/UH6dFbiX_hM)

The attack relies on side channel attacks, specifically Flush+Reload, which uses timing to find accessed memory locations in another program, as stated in both the paper and this summary. It is definitely relevant.

How easy or how likely is a meltdown attack likely to be successful against a moderately protected PC or say a VMware Cluster...these kinds of things seem hard to pin down...and if all one can do is READ - exactly what is to gain here? It seems that the machine would have already had to have been compromised in another way to get the memory that has been READ off and out of the computer system.

You browse the web on your moderately protected PC? You likely run JavaScript. Nice logins and passwords to all your sites and banking stuff you have there...

You connect to a wifi hotspot at the cafe with your moderately protected laptop or phone? It likely runs the JavaScript on the connect page, and all the browsing you do afterwards too. Nice passwords and logins and I see you use this laptop of private banking too... Thanks!

You run your moderately protected VM on a cloud provider? So do I. In fact, mine runs on the same hardware as yours ... Nice private key you had there...

The problem is that it bypasses sandboxes and isolation features... normally, JavaScript running in a VM in a sandbox in your browser cannot read all of your memory. With meltdown, that could be possible. Although for that scenario, you need to combine Meltdown with Spectre variant 1, allowing you to read arbitrary kernel memory from JavaScript in a browser.

There are known PoCs for Meltdown using JS, which is what made this so scary. Heartbleed was far worse in comparison since it was remotely exploitable. But the JS vectors for Meltdown make it scary.

Where? We've been told it is possible, but I have yet to see a JavaScript exploit that wasn't basically a canned demo.

I doubt it is possible to be done on JavaScript. Timing cache access is a challenging task for such high level language. The key to the attack is to figure out latency of memory access.

A JavaScript app that is dealing with 100 layers of intermediate code before it actually gets to the physical memory could not see a difference between reading from actual memory or from cache. It is too slow to notice any change. Should be a pure assembler code to reliably estimate the effect of caching.

If you are already running untrusted binaries, there are bigger issues. Without a JS exploit, I'm not sure this is a big problem.

And we haven't seen a real world binary version either. The versions I've seen all take running starts so to speak.

Some CPUs can fetch data from memory they are not currently entitled to fetch. The permissions checking is done in parallel with the fetch, so a fetch and even some use of the data can take place. The result can't propagate back as a result to the program, if the retirement unit is doing its job, but can affect cache loads.

So how early does that chain of events have to be stopped? If it's stopped before the unwanted fetch, security is sound - the CPU never pulls in the data it shouldn't see. Future CPU designs are probably going to have to do that, even at some cost in performance (but look for complicated explanations from Intel as to why this isn't really necessary). That may require more permission info in the various tables and caches of the memory system.

Even if the memory interface looks at page permissions earlier, there's the possibility of using this attack to peek at data in the same address space, data protected only by checks in the code. This may allow snooping around within application programs such as browsers.

It used to be that you only worried about timing issues for speculative execution in crypto code. It's important that strong encryption code take constant time regardless of the data. Otherwise, timing measurements of known-plaintext attacks may yield info about the key. Now it's a broader problem.

Bleah. Fortunately, my CPU designer friends are all retired now and don't have to deal with this.

Great analogy, explanation, and illustrations. Thanks for sharing.

How many bits per second (or kbps, mbps, etc) of memory reading is possible with Meltdown when run from JS vs running natively?

Somewhat related, is it possible to neuter the JS engines in Firefox or Chrome so that they don't JIT JS and would doing so have any real world impact on mitigating this attack? If it relies on speedy execution to be possible maybe a solution would be to have a NeuterScript extension that deliberately slows things down.

To do it fast enough will peg the browser to 100% cpu for that thread. Saw a demo on twitter where they were not reliably able to extract information from the browser. Native, as in not running inside a browser, has been shown to be able to extract information reliably. My guess is the extreme amount of overhead that is required to run JS in an isolated-ish way.

I think the point is that a JS based attack, while possible, would be much use outside proof of concept

Related to your second question: both Chrome and Firefox are deliberately increasing resolution of performance.now APIs to reduce ease of precision timing attacks.

Unfortunately, this gets some big things wrong. Meltdown is not about speculative execution. (Spectre is.) Meltdown is about out-of-order execution - no branches required. The authors are clear about this in the paper. From Section 2.1:

"In practice, CPUs supporting out-of-order execution support running operations speculatively to the extent that the processor’s out-of-order logic processes instructions before the CPU is certain whether the instruction will be needed and committed. In this paper, we refer to speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions."

In this explanation, the author starts by showing two different code branches, which is misleading. Meltdown does not require code branches - which is what makes it so surprising. This is the C code example from the paper:

  // the line below is never reached
  access(probe_array[data * 4096]);
No branches: you have an exception, and then in the code following that exception, you have some memory access. Despite the exception, the access happens because of out-of-order execution. The actual exploit is, in assembly:

  ; rcx = kernel address
  ; rbx = probe array
  mov al, byte [rcx]
  shl rax, 0xc
  jz retry
  mov rbx, qword [rbx + rax]
The exception is raised on the mov command, as it loads a kernel address. This exception will eventually cause the processor to abandon all of the current code it is executing, and the program will terminate from a segmentation fault. But. There is a race condition: before the processor deals with the exception, but after the memory has been accessed, the second mov instruction executes, which uses the data which caused the exception. This shouldn't matter, as execution is abandoned, but data is brought into the cache based on this value, and using side-channel attacks, we can figure out what this value was. From the paper:

"To load data from the main memory into a register, the data in the main memory is referenced using a virtual address. In parallel to translating a virtual address into a physical address, the CPU also checks the permission bits of the virtual address, i.e., whether this virtual address is user accessible or only accessible by the kernel. As already discussed in Section 2.2, this hardware-based isolation through a permission bit is considered secure and recommended by the hardware vendors. Hence, modern operating systems always map the entire kernel into the virtual address space of every user process.

As a consequence, all kernel addresses lead to a valid physical address when translating them, and the CPU can access the content of such addresses. The only difference to accessing a user space address is that the CPU raises an exception as the current permission level does not allow to access such an address. Hence, the user space cannot simply read the contents of such an address. However, Meltdown exploits the out-of-order execution of modern CPUs, which still executes instructions in the small time window between the illegal memory access and the raising of the exception."

I find the paper to be very readable. They give a good overview of modern computer architecture, and then walk through all of the steps of their attack. I highly recommend reading it: https://meltdownattack.com/meltdown.pdf

Thanks for pointing this out. I'll be updating the post shortly with changes. I've added a note about going to the source and reading the paper alongside.

For others on this thread: +1 on the above recommendation for reading the paper itself. It is very well written and accessible. If you've read the blog post, you know pretty much everything you need to understand the paper.

Note that this includes your explanation with the check_function() - that is not part of the exploit. The branches in their assembly are only about dealing with the zeros.

And, to reiterate: any explanation of Meltdown that depends on branches is incorrect. It's not enough to just use the phrase "out-of-order". All of your examples with if-statements need to change.

What you are describing is literally speculative. It's executing the 2nd mov, because it speculates it will occur. The choice to throw an exception or not is a branch.

"Speculative execution" is a term-of-art, with a specific meaning. I can serve you a warmed up Hershey's bar, and it is literally "hot chocolate", but it is not what anyone would expect from that description. When discussing technical matters, it is important to use the right technical terms - hence why the Meltdown and Spectre authors were so careful.

But more important than the term is that the submitted description explains the Meltdown attack in terms of branch instructions, which is not how it works. A reader of the submitted description will come away with an incorrect understanding of what actually happens.

Not really - as I said, any instruction which can theoretically raise an exception is a branch. It isn’t the proper terminology but it gives the right idea.

In the domain we’re in, “branch” means a branch instruction. If we used your definition, then any load and store instruction would count as a branch, which is not how people use the term. It gives the wrong idea, exemplified by the fact that the submitted article gets it wrong.

If you aren’t a cpu designer than any split between paths of execution is a branch. It gives the right idea, even if the terminology isn’t precise.

I'm not a "cpu designer," but I am a computer scientist, and when I say "branch," my colleagues expect that I'm talking about an actual branch instruction.

This is not a great hill to die on. There are "exception branch tables" and "interrupt branch tables" to configure the implicit branch taken by (say) an interrupt, which acts like a sort of invisible branch; also, on some architectures, rather than "call" you have, literally, "branch and link". You made the point well above when you pointed out that the code paths don't depend on conditional branches that are predicted and explicitly speculated on. I don't think the rest of the semantics are worth litigating.

It is an actual branch - just because it has a special instruction is an implementation detail. In computer science there is no difference between throwing an error and skipping the rest of the instructions in the stack.

If you wrote a vm with a cpu simulation you would implement it as a branch.

Thanks for this, and for recommending going straight to the paper. It's vastly more readable than most of the simplified explanations of the attack.

Here is the same in 10 lines of pseudo-code


> var BASE_ADR=zzzz;//Make BASE points to a random 256 bit currently uncached memory block that appplication has access to;

That should be 256 bytes.

yep, corrected. at the end it is a bit more complex than that as cache is populated in 32-128 bytes increments but I leave it out for simplicity

Suggestion - this logo:


Should probably say "Intel" somewhere ;)

That doesn't seem completely fair. The ARM Cortex-A75 is affected. Furthermore, the A72 and A57 also suffer from a different variant of Meltdown.

I have a couple question about Meltdown and the Intel chips. My understanding is that a key part of this is that upon speculative execution the page table permission checks only happen when the "transient"instruction is retired.

Was this simply a performance engineering trade off made by Intel? Would checking the PTE permissions on speculative execution result in giving up any performance gained by the speculative execution?

My naive expectation would have been that the CPU maintains some kind of process level isolation.

My new understanding is now that the concept of a process and isolation of processes is handled by the kernel.

This is probably a silly question, but maybe we could handle process isolation in the CPU somehow?

I doubt you’d want to implement the process scheduler in hardware, it would be very inflexible.

I am surprised this hasn’t been explained in terms of a vulnerability chain. Ie, break it up into parts. As soon as you have an oracle providing cache timing info you have a vulnerability.

Basic Bayesian analysis suggests that there is more fruit to fall off the tree.

I agree: as long as side channel attacks are possible, there's always the possibility that someone else will find some other vulnerability that can be combined to create an exploit. The paper (https://meltdownattack.com/meltdown.pdf) does present it in that way: they show you the parts, then show how they fit together for a workable exploit.

Cache line is 64 bytes on x86-64 if I am not mistaken, not 4096 :)

Nice read anyway.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact