Hacker News new | past | comments | ask | show | jobs | submit login
Surprising new feature in AMD Ryzen 3000 (agner.org)
411 points by diffuse_l 56 days ago | hide | past | favorite | 148 comments

There have been rumors that Zen could do memory renaming [1], this pretty much confirms it.

[1] Basically the same as register renaming, but instead of using the register file to rename architectural registers, it can rename memory instead.

Well they weren't just rumors, I measured this back in February and mentioned it in passing here [1].

Even at that point something similar was described in Wikichip (which I quote at the above link).

Long before that, going back to Zen 1 I saw talk of the memfile, although I never saw a microbenchmark measuring the effect on on Zen 1: it seems much harder or impossible to trigger there (on Zen 2 it is very obvious: any basic store-then-forward loop using the same addressing expression will show a latency of 1).


[1] https://gist.github.com/travisdowns/bc9af3a0aaca5399824cf182...

Thanks; I either missed the discussion or forgot about it.

Interesting that the forwarding is claimed to be done under speculation. I wonder if it is really necessary. Also, is it value speculation or does it checks the state of the cacheline?

Edit: thinking about it more, this is probably just the existing alias predictor that only cares about adresses.

I think it has to be speculative?

The optimization has to happen early, in the front-end (around allocation) at which point you'll not know if the load actually aliases the earlier store [1]. To get the zero latency, you take a guess and correct yourself if you are wrong.

The non-speculative approach is basically the existing pre-memory-renaming approach: you probe the store buffer looking for earlier stores to the same address. This has also gotten much faster (3 cycles total on current Intel), but that's still a lot slower than 0 cycles (or 1 cycle total on Zen 2).

> Also, is it value speculation or does it checks the state of the cacheline?

I assume memory ordering is handled in the usual way, with an ordering nuke if a concurrency related misordering is detected. About how it's verified, I think it could either verify that the load actually came from the corresponding store (e.g., do the load and check that the SB entry used was the same as the predicted one), or it could verify that the value was the same: i.e., do the load and check that the value was the one used earlier based on the prediction.

The latter would succeed in some additional cases. It should be easy to test the two cases (essentially, Agner's add test with different register, but add 0 not 1).


[1] As a special case, you could prove that some loads alias some earlier stores when the addressing expression is the same (or can be proven equivalent), and there is no intermediate store, but evidently the AMD approach goes a ways beyond this since it allows for intermediate stores.

>The optimization has to happen early, in the front-end (around allocation) at which point you'll not know if the load actually aliases the earlier store

Yes, this is what I meant by alias predictor. It is probably a variation of the exiating aliasing predictor that allows reads to be speculatively reordered before previous stores of unknow address, except that on this case you would rollback on address mismatch instead of match.

Well the existing alias predictor is in quite a different part of the pipeline: around the scheduler and execution engine and it gets feedback from the actual aliasing behavior.

That said, I also thought they might feed this info back to the front-end in order to implement this.

However, Agner's testing seems to show otherwise: it seems that "does it alias" is determined basically by a symbolic comparison of the addressing expressions. That is, [rax] matches [rax]. OTOH, [rax] doesn't match [rcx] even if those registers hold the same value. Also [rax + rcx * 2] matches [rax + rcx * 4] even though the scale factor is different! Perhaps it was too much work to compare the scale factors.

The example 20.3 where he incremented the value in between the load and the store using a different addressing expression (but the same address) and it consistently took a misprediction seems to indicate that they are not using an aliasing predictor in the usual sense since this always-fails case should be easy to predict.

Overall, quite interesting and some surprises there.

Wonder what side channel attacks this will result in.

If the memory renaming cache isn't cleared, I can foresee Spectre-like attacks against that micro-architectural buffer. However, for a new feature like this hopefully they've figured out how to not leave that open…

The memory renaming cache is almost surely just the register file.

Figured I might as well mention it in case AMD was doing something totally wild, but I would agree that using the register file sounds likely :) I wonder how they manage the pressure on it between renaming and memory mirroring…

In principle if you are using the GP PRF it is possible to implement it so that there is little additional pressure: the store instruction already has it's data input a register which has been renamed: now you just need to organize it so that the subsequent load that targets the same location is a no-go: simply alias the arch reg targeted by the load onto the existing register used by the store (much like mov-elimination)

So except for a small window in the middle you have the same pressure on the PRF.

I don't know if that is how it is actually implemented, of course!

A problem for every solution.

It only (well, “only”) works for the stack, though?

No, I tried it with [rcx] as well and it works fine, i.e., no particular dependency on [rsp], at least in Zen2.

Although I guess it depends what you mean by "stack". My test shows that it isn't tied specifically to [rsp], the register used to address the stack, but I suppose it could still be tied to the region of memory the CPU heuristically determines is part of "the stack", regardless of what memory is used to access it.

It's interesting that the author uses a forum as a personal blog

I've seen people mocked for it before. I think the people doing the mocking were small-minded idiots who couldn't understand that a forum with one top-level poster is isomorphic to a blog with comments.

But I must admit, the initial impression of a 'forum' where all the topics are from one person looks a little bit like it might be a crazy person debating with themselves. (In this particular case, it looks like people other than Agner also create topics). Ideally you'd maybe re-skin the forum?

I tried to start up a Slack for a hobby decidedly lacking in technophiles. A few people joined, and I put some seed comments into different channels to try to drum up some conversation. Did not work, so I stopped posting so I didn't become that crazy guy talking to himself.

KISS at its finest.

You can inline images and code easily. Easy to self-host, so you can own all the data, and you get solid support for comments so you can avoid data-peddling companies like Disqus.

> Easy to self-host

I wouldn't go as far. Forum software is notoriously annoying to configure securely and to maintain.

> you get solid support for comments

... at the price of people having to create Yet Another User/Password Set, and likely bother you at some point for user-admin tasks.

You definitely keep control of the data though, and some people really like forums.

> Forum software is notoriously annoying to configure securely and to maintain.

I'm not familiar with setting up forum servers specifically, are they difficult to setup for just one user? I can totally see how having proper authentication, emails, caching, etc getting complicated, I'm just curious if it's any user for personal use? Obviously you lose the ability to receive comments though.

A lot of forum software is "old school" (i.e. some php, perl, or even vb), which means it's a missed patch away from being compromised. Because they are "standard" engines, they are discovered and attacked often and automatically by tools, like Wordpress - they are a prime spam target for linkfarming. They typically ship a lot of dangerous features like file uploads, and overall security practices are often subpar (integration passwords saved on disk, etc).

Setting a forum up for one user seems like a big waste of time, when there are plenty of perfectly useable blog engines out there that are simpler and more secure to run. Unless, of course, one is already an expert in a particular forum engine.

> KISS at its finest.

It’s certainly an interesting way to do it. I’m not sure id set up a forum just to write notes though, there’s far simpler methods. I think in this case theres already a community on this forum and it’s easier to just write there than anywhere else.

For similar reasons, I'm probably gonna set up a small wiki instance as my personal blog. Nobody else can edit (unless I feel like extending an invitation), but it's a CMS I already know, markup I already know, has robust history tracking which I value, and makes categorization and linking trivial.

Whatever works? My personal favourite example of "hmm, I guess you can do that" is david tolnay's publishing a blog as a rust crate; all crates have their docs parsed and uploaded to docs.rs, so, well, why not¹:


1. lots of reasons, probably.

Ya, that is the first time I have seen that. Interesting idea.

The author also has a standard HTML page for content that doesnt require comments: https://www.agner.org/optimize/

Seems like a reasonable approach to allow and manage comments...and build and manage a community. It's a nice hack.

It works the best if you already run a forum and you want to write some related articles. Just open a subforum, and you get a blog with community traffic.

I did this ages ago for a website for my engineering class (now long gone). The focus of the website was a punbb instance, but it also had a front page with a weather widget, an announcements blog, etc. The announcements were just a date-ordered list of first-posts from threads on one of the forums. It was a simple enough thing, and it was easy to click through to the actual forum thread for the "comments" on any given story.

And as an added bonus, it isn't hosted on that awful medium site.

I did something very similar with Drupal in the past. Everything was in the underlying forum, but it was possible to promote some topics to homepage with full text and create RSS feeds for those. The site since has been shut down, but I think it was a neat solution.

The popular news site https://macrumors.com, for example, also stores its entries in an underlying forum.

The mega-popular webcomic Homestuck also used a phpbb forum to store all of its pages and images.


He also has an old version of the blog which is also a forum: https://www.agner.org/optimize/blog/

Interesting how a comment on style takes precedence over other comments on content

Trying to understand this.

Using latencies from Zen 1 instruction table (see https://www.agner.org/optimize/instruction_tables.pdf):

    mov dword [rsi], eax ; MOV m,r latency is 4
    add dword [rsi], 5 ; ADD m,i latency is 6
    mov ebx, dword [rsi] ; MOV r,m latency is 4
Total = 14

Each instruction depends on the result of the previous, so we need to sum all the latency figures to get the total cycle count. Is this right? How does Agner make it add up to 15?

Then for Zen 2:

    mov dword [rsi], eax ; MOV m,r latency is 0 (rather than 4,
                         ; because it is mirrored)
    add dword [rsi], 5 ; ADD m,i cannot find an entry for this.
                       ; Looks like there's a typo in the doc.
                       ; I guess the latency is 1.
    mov ebx, dword [rsi] ; MOV r,m latency is 0
Total = 1

Again, how does Agner make it add up to 2?

And for Intel Skylake:

    mov dword [rsi], eax ; MOV m,r latency is 2
    add dword [rsi], 5 ; ADD m,i - latency is 5
    mov ebx, dword [rsi] ; MOV r,m latency is 2
Total = 9

I think the Zen 2 reasons about the operations and operands and converts them into something like these micro-ops and then schedules them for a functional unit and the renamer:

  functional unit       renamer
  add tmp_reg, eax, 5
  st [rsi], tmp_reg     mov ebx, temp_reg
Consequently, I don't think you can use the latency tables directly to get "2". I think Agner probably got those 15 and 2 aggregate numbers from measuring them directly rather than calculating them from the latency tables.

BTW, I'm hand waving on the micro-ops and FUs. There's probably some address generation, ... going on that I'm leaving out. I don't even think the renamer requires a translated micro-op. You could replace tmp_reg with renamed(ebx).

Now the downside. What happens at an exception? You have to back all of this optimization out.

> Now the downside. What happens at an exception? You have to back all of this optimization out.

Just what always happens at an exception. You just replay everything up until the exception occurred.

The author wrote in another thread that

"If anybody has access to the new Chinese Zhaoxin processor, I would very much like to test it."

Will be very interesting to see how much actual changes Zhaoxin made to the VIA cores. I'd expect it to be minimum.

Another great thread from the same author: https://www.agner.org/forum/viewtopic.php?f=1&t=6

Interesting, L1 caches are fast and even if compilers do register allocation they kinda rely on it being not-too-shitty so when spilling (and many compilers for higher level languages doesn't always invest too much time in reg-alloc since they might need to de-opt soon).

I'm curious if this change is an effect of more transistors (more space for a bigger register file) or if they're taking advantage with the microcode translation of the fact that most code doesn't use the SIMD vector registers and re-use unused parts of the register file for these memory aliases.

I'd bet dollars to peanuts that the temporaries are stored in the gpr prf. It's big, and it's sized so it's not usually a constraint, so they can just push memory operands in there when there is free space and when it would be useful. And most importantly, unlike the vector registers (or any other structure already in the CPU), it can directly feed the ALUs without any new datapaths.

gpr prf? General purpose register something register file? Private?

Probably physical register file as referenced by https://en.wikipedia.org/wiki/Register_file. PRF in this context refers to the actual memory that contains all of the register data, which can be much larger than the number of architectural registers to provide for register renaming, which can help performance (basically, the mapping from architectural register names to actual physical registers continually changes so you relieve the bottleneck of waiting for a particular physical register to become available).

Ah, probably. I thought "private" might also make sense, in that it's an implementation detail hidden behind the architectural register set.

PRF = Physical Register File. In modern CPUs, there are no separate architectural registers, all register state is held in a single very large register file (for example, 180 entries in Intel Skylake) that contains all register data, and the architectural register names are pointers into this file.

This is different from previous approaches where in-flight register data was held in the ROB and later written back into architectural registers.

I highly doubt the chips are using vacant portions of the SIMD register file to hold scalar values, especially memory addresses. The SIMD registers probably have no easy connection to ports where load/store units ingest addresses.

I love this kind of technical analysis. Is there any repository containing more of these kind of analysis?

I wonder how many little wins like this contribute to AMD's immense efficiency over Intel's current chips?

I think the area where Intel is currently lagging is not so much the internal architecture of the chips, but the fabrication process, where they have fallen behind AMD (who is using 3rd party fabs): https://www.forbes.com/sites/linleygwennap/2020/07/28/intel-...

Bigger (or rather, smaller) than the fab process was the clever use of chiplets to improve yields.

Looks like the single greatest win to me. This explains now a lot.

As a Zen2 owner I'm very disappointed in VPGATHERDD througput, that's so 2013. On the other hand I like the loop and call instruction performance a lot.

That gather needs to issue 8 independent loads. It's never going to be fast, and I think there's a strong argument that you don't even want to spend the transistors on all the extra load/store units required. The goal of scatter/gather instructions is that they should be demonstrably faster than assembling the values in scalar code, and beyond that... meh.

If you're doing random access to memory like that, you're probably out of the realm of what is appropriate in vector code and should be looking at other hardware (c.f. a GPU's texture units) to manage your memory access.

Don't GPUs optimize gathers within the same cache line to effectively be a single fetch from memory and then shuffle? I would assume that's the purpose of VPGATHERDD: not so much for a vector of addresses like (0x1000, 0x2000, 0x3000, 0x4000), where there's no alternative other than to issue 4 loads, but rather for a vector of nearby addresses like (0x2010, 0x2004, 0x2008, 0x2000), where the CPU can coalesce the fetches into one (like PSHUFD with a memory operand does). Gather instructions are especially good when your addresses are usually in the same cache line, but don't have to be—stuff like mipmapped texture lookups in fragment shaders.

I don't think that's the case on most CPUs; VPGATHERDD on Skylake, Icelake, etc, all issue the same 4/8/16 port2,3 uops regardless of what the addresses are.

Its theoretically possible to run gather as fast as one cache line per cycle instead of one SIMD lane per cycle. I don't think anyone has thrown that much permute hardware at the problem, though. Its only profitable if you believe that scatter and gather do have cache locality even when they don't have regularity.

I believe the Xeon Phi series implemented gather this way.

Isn't AVX512 basically cacheline-instructions?

That's the way normal SIMD loads work, yeah.

But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.

That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.

I looked Agner's tables, and was curious how Intel fared with it. All numbers are reciprocal throughput. So how many cycles per instruction in throughput. zen 2 has it's gather variants mostly 9 and 6 cycles and one variant with 16. Broadwell has only 6,7 and 5 cycles. Skylake has mostly 4 and 2 and one variant with 5.

Now I was surprised by Agners figures for zen2 LOOP and CALL which both have reciprocal throughput of 2. Being equal to doing with just normal jump instructions.

Skylake on the other hand has 5 or 6 for LOOP and two CALL variants with 3 and one variant with 2.

VPGATHER has always been pretty trashy on AMD. Do they microcode it?

Edit: The answer from LLVM's source code is yeah, gather loads are microcoded on Zen.

Surprising... and a little scary. This is not something I would've expected to be done in the current world of multiple cores. I wonder if things like volatile and lock-free algorithms would behave any differently or even break.

> I wonder if things like volatile and lock-free algorithms would behave any differently or even break.

I don't understand how do you think it could change or break a volatile or lock-free algorithm?

It's a transparent optimisation - it doesn't change observable behaviour. Just like how register renaming, top-of-stack-caching, and so on and so on don't change observable behaviour.

The only difference it would make is timing.

So there is logic to determine if a different thread has modified the value between those instructions?

Before you ask if there is logic to determine if another thread modified those values, first ask yourself if you can write a program where you can observe the difference if there was this logic or not.

I can't write such a program. Can you?

How would you ensure that another thread was scheduled between these instructions in order to observe some intermediate state? If I just told you the OS would never schedule between these instructions again is that breaking any contract or something you can observe somehow? How would you tell?

If we can't such a program, then the logic isn't needed.

If you have multiple cores then the OS’s scheduler doesn’t matter much.

In one thread, do a loop of:

  movq [rsi], 10
  addq [rsi], 5
  cmpq [rsi], 15
  jeq loop
(Or however you write that with correct assembler).

In another thread with the same value in rsi, do:

  movq [rsi], 100
  jmp loop
Ensure these threads run on different cores. You would expect the loop in the first thread to eventually terminate. The probability distribution of the time it takes to terminate would probably be informative.

Would it terminate on this and chip? How would the distribution be different?

If you are relying on that behaviour then your code is already broken on most modern x86 processors due to their memory ordering implementations.

The store to the address at rsi will actually go in to the processor's store buffer and the subsequent add is allowed to retrieve the value from the store buffer _before_ it's drained to main memory.

Under TSO, AMD's memory mirroring implementation here is totally fine.

It could potentially change observable behaviour if you could turn it on or off but this is irrelevant given you wil see changes in observable behaviour for that same snippet of code depending on which microarchitecture of AMD or Intel processor you choose to run it on.

It's worth having a read on the details for different memory models and how they relate to implements. This seems like a good guide: https://www.cs.utexas.edu/~bornholt/post/memory-models.html

> You would expect the loop in the first thread to eventually terminate.

Why would you expect it to eventually terminate? There is nothing in the architecture that should lead you to expect that.

The architecture does not guarantee that instructions from the first loop will be interleaved with the second in any particular way, no matter what cores they're running on.

> The probability distribution of the time it takes to terminate would probably be informative.

But again... the architecture does not guarantee the set of or probability of distribution of interleavings. You cannot write a sound program using this information.

You need some form of barrier (lock/mfence/cpuid) prior to compare instruction to make the snippet correct.

It's easy to write that program. For example, a Peterson lock, which has a busy-wait loop:

    while (flag1 == true && turn == 1) { /* wait */ }
the loop terminates by another thread writing to either variable.

If the variables were mirrored to registers, and that mirroring was not invalidated by another thread's write, then the loop would never terminate.

Of course Zen2 did not break Peterson locks, so the GP's question is cogent and has the answer "yes, there is such logic."

In order for memory mirroring to be enabled, the memory must probably be read first since its value must be used. The processor must then ask the cache controller to acquire the cache-line in a "no writer / multiple reader" mode ("S" state). If another processor wants to write to this cache-line it must first take the cache-line in "single writer / no reader" mode ("M" state) which invalidates the cache-line in all other caches. In the processor performing the memory mirroring, if the cache-line goes out of the "S" state then the processor cannot read the cache-line anymore and memory mirroring has to stop since it reads this memory. In order to continue it must ask for the cache-line again, read it again, and then get an updated value.

It's just a guess, but that's probably what happens in order for the memory mirroring not to break the cache-coherency. So "yes, there is such logic", but it already existed in the processor to avoid the same problem when data is in the cache. And so, there would be no additional logic to handle this.

But the read of turn will not be eternally mirrored to the original write - at some point the rename file eventually moves on when the original store (an observable side effect) retires like a register renaming.

Notice that your code has a backward jump between the write and the read and the example in the article (and the code of the other person I replied to) does not.

There is no logic needed to interrupt the renaming such as a signal from another core holding the cache, the rename just naturally retires beyond a certain point.

Read Morris Marden's 2014 patent on it for more details.


Hmm, what's the advantage of doing memory renaming for unretired stores? Won't the value just be forwarded from the store buffer?

The patent is a great resource, thanks for that. But I think it supports my point when it suggests that memory renaming may persist indefinitely:

Alternatively, reclamation of the original physical register could be delayed to facilitate later memory renaming.

and later:

...the MRT may always retain physical registers of N past store operations...the physical register is reclaimed to the free list when it exits the MRT, for example, when it is no longer one of the N most recent store operations.

So it's possible. I don't think we know from Agner's work whether AMD went this route or not.

Store forwarding is slower, it is usually as fast a reading from L1 directly in the best case (4-5 cycles).

This optimization has a latency of 0-1 cycles.

That's a very useful way to approach the question. Thanks for reminding me of that technique!

No - there doesn't have to be. As long as there are no barriers or other instructions which restrict memory ordering relative to particular instructions in the dynamic instruction stream, it doesn't matter what the actual interleaving is between different cores as long as there is a given execution order which is valid for the observable side effects.

X64 has stronger guarantees than that without barriers doesn't it?

x86 guarants that memory writes to different locations are seen in order by other cores.

If you write memory first to address A and then to B. Another core never can see the B change without also seeing the A change at any moment in time.

But in this case there is only a single memory location. So there is no ordering that could be violated in the first place.

#StoreLoad still requires a barrier even on x86.

Yes, quite a bit stronger.

In guarantees the order of visibility of writes to shared locations. It dose not guarantee the granularity of updates to the shared locations, which is the essential difference here. How do you observe the difference between the three instructions just naturally happening to all run before your next instruction completes, and them eliding the memory operation as is happening in this optimisation?

It's like the ABA problem - how do you differentiate between just never happening to see a B out of chance, and B being elided instead? You can't.

The cache-coherency must already ensure that. Memory mirroring conditions should include the permission to read (or write, depending on usage) the data cache-line.

This was my thought initially. The decode logic needed to get this right in all cases seems really hairy. Beyond that though it seems like a good optimization to me. Fogg seems to think it takes a lot of transistors to implement but it doesn't strike me as the case. It's effectively a renaming operation where the 2nd and further references to the shared operand get rewritten to some other register.

As far as volatile: this isn't changing anything to do with memory operations. It's just caching a computed address in a special place so the CPU doesn't have to decode and recompute it for the very next instruction.

Volatile, at least in C and C++, is already broken when used for thread coordination. Lock-free algorithms need to use atomic instructions, which with work.

volatile was never meant for thread coordination.

And, yet, people still try to use it for that. So it's worth pointing out that it doesn't work.

I think volatile in general is pretty useless, and it'd be better to replace it with atomic style functions:

    x = volatile_read(&v);
    y = use(x);
Which is both clearer about what's going on, and doesn't cause optimization barriers to happen everywhere, just around the specific loads and stores you care about.

Volatile is for memory-mapped IO. It's not particularly useful anywhere else.

Thank you for repeating me again. Though it's not even what I want then. I think the only time I've wanted volatile as is currently specified was around setjmp/longjmp -- and I think I'd prefer that setjmp/longjmp were specified in a less error prone way.

If you're paying attention to the standards community, what I'm wishing for was mostly sketched out in ISO/IEC TR 18037, section 6. (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1169.pdf)

Which is obvious when you consider that C didn't get any sort of thread-awareness until C11.

I don't see your point. In C and C++, volatile is just a compiler hint to disable some classes of optimizations. It was never intended to be used to sync threads because it offers no such assurances.

Congratulations, you saw my point.

People try to use it for thread coordination. It's always been wrong.

I know a FreeBSD core committer who is convinced that volatile is good for thread coordination.

Yes, he is wrong.

> volatile is good for thread coordination.

He has a point, though.


> Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a non-volatile glvalue (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

Which bit of "This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution" is unclear?

In fact, Standards notwithstanding, real compilers routinely optimize away volatile operations if they deduce nothing can see the object -- e.g., it is on the stack, and its address has not been taken. And they will happily re-order reads from and stores to other objects around a volatile operation. Furthermore, caches don't know squat about volatile, so they will eagerly re-order the volatile ops, besides.

So, no, he did not have a point. He was just wrong, wrong, wrong. But it never caused him any personal distress.

volatile is required (but not sufficient) for thread coordination, at least in C++03 and C99. That's why boost atomics uses volatile extensively.

You can use volatile as part of a suite of compiler specific tricks to implement higher level load acquire/store release operations on x86 (and load consume everywere else), but portably you can't do much.

Using volatile as if it simply means atomic is wrong.

But that is what he insisted.

The x86 memory model already allow (and all current implementations alrwady do) forwarding from the store buffer (which is core private). As far as I can tell, this won't break any code which isn't already broken.

I sense this question is pretty elementary, but maybe someone can point me in the right direction for reading:

"When the CPU recognizes that the address [rsi] is the same in all three instructions..."

Is there another abstraction layer like some CPU code that runs that would do the "recognition" or is this "recognition" happening as a result of logic gates connected in a certain static way?

To put more broadly: I'm really interested in understanding where the rubber meets the road. What "code" or "language" is being run directly on the hardware logic encoded as connections of transistors?

CPUs run Tomasulos algorithm ( https://en.m.wikipedia.org/wiki/Tomasulo_algorithm ). Then the outputs of Tomasulos algorithm allows for out of order execution across multiple execution units.

Note that Intel chips have many (at least 10) execution units, and AMD also has 10. You need to split instructions to the different pipelines for efficient execution. I forget exactly how much... but far more than one pipeline in any case.

If pipeline0 is busy doing a divide, run additions in Pipeline5. Tomasulos algorithm converts typical assembly language into this multi-pipeline form. It also keeps track of which pipelines are busy, and finally puts things back together in the right order when done. (Division can take 80 clock cycles, while addition takes 1. You gotta stall the results of the addition instruction until after division is complete. If division came first in the machine code)

Thanks for this. I had no idea what to search for but this is a good start.

I guess it's unsurprising to find that even at the single cpu core level, there's another layer of parallelism. And that will of course complicate things further.

Note: Parallelism comes in many forms on CPUs. Pipelines, superscalar, out-of-order, and speculative. Of these concepts, Tomasulo's algorithm covers Pipelines, Superscalar, and out-of-order. So its the "most complex example", but that should get you up to speed the quickest. (Branch prediction / speculative is its own thing, but I've always found the concept to be simple to grasp: you're executing instructions in parallel, so you need to guess the results of a branch before the branch is even calculated).

When it comes to actual execution of the final microop itself, its surprisingly simple. You put the bits in the right location, then a circuit magically calculates the result.

XOR and Add are simple enough (https://en.wikipedia.org/wiki/Carry-save_adder). A multiplier is harder to think about, but yes, its just a bunch of wires hooked up to NAND gates: https://www.slideshare.net/suryakrishna3785/wallace-tree-mul...

Division and Modulo are traditionally executed with microops in a multi-step process. Addition becomes subtraction through the magic of 2s complement. AND, OR, XOR, are trivial logic gates. Memory lookup is complicated, but mostly due to caching. The core concept of memory lookup (put value onto address pins, wait for memory to respond with the stored value) is simple enough.

Memory is traditionally on a "bus". The idea of a "bus" (multiple different CPU components talking on the same set of wires) is hard to grasp. But when you learn about tri-state logic (https://en.wikipedia.org/wiki/Three-state_logic), the 5V state, 0V state, and "High Impedance" state, it becomes simple.

"High Impedance" means that someone connected to the wires simply won't read, or write, on those wires. The CURRENT (amps) going in, or out, is set to zero through the magic of transistors. If you have 5 different components (C0, C1, C2, C3, and C4) on a set of wires, if C1/C2/C3 are in "High Impedance", then C0 and C4 can talk without any issues (because C1 through C3 won't use up any electricity from the wires).

The key is coordinating who can talk and at what times. But that's where "link level protocols" start to come in. Tri-state logic (in particular: high impedance) is what makes it possible.

And there ya go. A fully working CPU. That's all it is.

Nice description! Even a simple ALU has a lot more complexity than you're giving it credit for, though. It's not "just" a bunch of gates, but is usually highly optimized with dynamic logic and pass-transistor logic. Addition in particular is not "simple", but uses various tricks to avoid carry propagation delay. And the various ALU operations are smashed together in clever ways to avoid separate circuits for XOR, add, AND, OR, etc. Even XOR is not a "trivial logic gate"; building XOR from e.g. NAND gates is surprisingly inefficient, so processors use a variety of special tricks for XOR. My point is that actual ALUs haven't matched the simple description since about 1972.

> Addition in particular is not "simple"

Well... optimized addition isn't simple. But any beginner can implement naive carry-save adders on a breadboard (or Verilog/FPGAs) and get a functional Adder (maybe even ALU) in a weekend.

But you're right. To optimize addition (in particular: the propagation of carries), you need to use a Kogge-Stone Adder, which is pretty sophisticated (https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder). And that's from my Bachelor's degree: I'm sure there are newer algorithms that are more sophisticated than Kogge-Stone.

There's a lot of issues at play here at the electricity level. IE: Fanout. All components can only carry so much current (amps). For example, a transistor may only be able to feed 5 to 10 other transistors, so building "Buffers" to increase the current so that you can actually turn on all the transistors is a thing.

Each CMOS transistor has capacitance: so you need a certain amount of electrons before the transistor turns on. Given a level of current (amps), this means you need to wait for enough electrons to get onto the input before the transistor responds.

Doing all of this with the minimal energy usage and maximum speed (minimum latency) is surely complicated. But the "naive" version suitable for beginner play is pretty simple. Like raytracers: you can write a raytracer in just 100 lines of code (it won't be as fast as a professional raytracer, nor have as many features, but you'd get the gist in a single weekend project: https://github.com/matt77hias/smallpt)


In any case, once you get to the final decoded instruction, its "just" a circuit. Sure, Kogge-Stone and Wallace Tree multipliers need a bit of study before you understand them. But they're simple black-boxes. Stick the bits into the correct wires and then a few hundred picoseconds later, you got the result coming out of another wire.

I had and lost a big response I typed up. And I don’t have the time to replicate it. But thank you both for such thoughtful discussion. This is giving me a really good broad sense of the topic. I’ve been implementing a game Boy emulator for fun and it’s raised so many questions.

I found that getting timing correct was the hardest part (at university, where everything was done by hand — later I worked in a fab where we could make that easier for people using tools).

The original problem that Tomasulo's algorithm (register renaming) solved was that the IBM 360 Model 91 floating point unit only had 4 FP registers. Register renaming is also important on x86 (starting with Pentium Pro) which only has 8 GPRs and x86_64 which only has 16 GPRs.

These (8 and 16) are still small numbers. I wonder how important renaming is on ARMv8 or RISC-V with 32 GPRs. I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The example that Agner found is a form of register renaming. So is this AMD trying to make its legacy code base go faster (AMD+Intel do a TON of that) or is it something a compiler can actually target? I think the former.

Of the 16 GPRs on Icelake / Sunny Cove (Intel's next generation desktop core), there are 352 renaming registers available.

Of the 32 GPRs on ARM's Cortex A78 application processor, there are 160 renaming registers.

> I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The benefit of register renaming is that idioms like "xor eax, eax" automatically scale to whatever the reorder register size is.

"xor eax, eax" REALLY means " 'malloc' a register and call it EAX". Because the reorder buffer changes between architectures and even within an architecture (smaller chips may have smaller reorder buffers), its best to "cut all dependencies" at the compiler level, and then simply emit code where the CPU allocates registers in whatever optimal way.

The compiler doesn't care if you have 160-renaming registers (ARM A78), 224-renaming registers (Intel Skylake), or 300+ registers (Intel Icelake). The "xor eax, eax" idiom on every dependency cut emits the optimal code for all CPUs.


You should have a large enough architectural register set to perform the calculations you need... in practice, 16 to 32 registers seem to be enough.

With the "dependency cutting" paradigm (aka: "xor eax, eax" really means malloc-register), your compiler's code will scale to all future processors, no matter the size of the reorder buffer of the particular CPU it ends up running on.

EDIT: It should be noted that on Intel Skylake / AMD Zen, "xor eax, eax" is so well optimized its not even a micro-op. Literally zero uop execution time for that instruction.

I will add a reference to one of the highest rated stack overflow posts about branch prediction (speculative execution):


This is THE textbook on the subject: https://www.amazon.com/Computer-Architecture-Quantitative-Jo...

There are older editions available free online that cover the same concepts, they just don't have the very latest info.

On most x86 chips since the Pentium Pro (and the K6 for AMD, IIRC) the machine code instructions which are visible to the programmer are turned into "micro operations" by the CPU's instruction decoder. See here for a introduction: https://en.wikipedia.org/wiki/Micro-operation

Thanks for the link! So on older chips, like maybe an 8080, I could expect to see literal hardware implementations of instructions?

Ken Shirriff has an excellent series of blog posts on early Intel chips (running from 4004 to 8086 I think). Don't believe he's done one on the 8080 but the post on the 8008 is great [1] and I'd expect that the 8080 (which followed it and was designed by the same team) is very similar.

In short no microcode but there is a PLA (Programmable Logic Array) which helps to decode the instructions.

[1] http://www.righto.com/2016/12/die-photos-and-analysis-of_24....

Thanks for the nice comment! I haven't looked into the 8080 because other people have examined it in detail. See https://news.ycombinator.com/item?id=24101956 for an exact Verilog representation of the 8080.

Modern CPUs don’t have RISC cores, complex instructions are all the way down.

On current-gen CPUs, only rarely used instructions are decoded into multiple micro-ops, the majority of them decode into a single micro-op, even when they do many things at once like memory load + compute. More often the opposite happens, multiple AMD64 instructions are merged into a single micro-op, this is called “micro-ops fusion”.

> Modern CPUs don’t have RISC cores, complex instructions are all the way down.

Proof: Intel's LEA instruction is executed as one uop. That's a super complicated instruction that you'd expect to be split into other uops, but... lo and behold, complex all the way down to the end silicon.

Even a "risc" machine: ARM has macro-op fusion, with AESE / AESMC instructions. (Macro-op fuse AESE / AESMC into a singular AES macro-op. A full AES cycle includes both steps, so its very rare to ever need to only do one of the steps).

As such, I've considered "RISC" to be a misnomer. Even RISC-V and ARM are now doing macro-op fusion (and micro-ops). So what exactly is RISC mean anyway?

You can't do an operation on something in memory directly?

That's just "Load/Store Architecture" though. RISC machines are usually load/store, but so are a bunch of other things, like GPUs and such. Hardly unique to RISC.

> multiple AMD64 instructions are merged into a single micro-op, this is called “micro-ops fusion”.

very minor nit: it is actually called 'macro-op fusion', 'micro-op fusion' is when ops from a single instruction are put back together in part of the pipeline.

Otherwise, spot on.

The K5 was actually a RISC CPU with an x86 decoder on top. So we can count that one, too?!

I wonder if it’s worth the effort to add functionality to take advantage of this in GCC, clang for just these CPUs?

I think the point is that most ordinary C-code should already conform to this access pattern. The side comment about aliasing is a great big hint in that direction.

(This article is grist to the mill of my "every CPU eventually evolves to become an interpreter for its dominant programming language" thesis.)

Yes, AMD probably added the optimization because it was a common pattern emitted by compilers.

I wonder if this causes a performance penalty on position independent code, which seems to be used a lot on non-Windows platforms [].

[] It always struck me as one of those propeller-head features that GCC & co. love but which the MS and Intel compilers avoid 'just to be on the safe side' :]

He mentions this doesn't work with RIP-relative addressing, but I don't see how it would cause a penalty otherwise.

re your thesis, can't wait for a WASM ISA on a commercial CPU and canvas instructions for its graphics coprocessor

you'll need to wait for WASM to become "the dominant language" on a commercial CPU. definitely not next year.

AFAIK there's existing microarchitectural optimizations in gcc and clang already, so it's reasonable to assume that AMD would add just such a feature.

But most people don't enable these features when building because they often make the resulting executable/library unusable on general CPUs. However, businesses/individuals with more specialized needs and better control over their targets often do capitalize on these.

this is the purpose of the gcc -mtune flag, which does have an actual purpose beyond being used in the utterly redundant -march=native -mtune=native.

Back in the days of compiling MySQL myself, it was worth doing. But at some point They hardware cluster was not identical, and so I scripted builds and each machine built its own.

    -march=arch -mtune=arch
has existed for decades, and the performance improvement is measurable. But mainstream distributions often cannot take advantage, since they need to support all CPUs, and optimization for one is often a deoptimization for another. It's also a reason why Gentoo exists.

Interestingly, Intel's Clear Linux - a optimization-oriented distribution - uses

    -march=westmere -mtune=haswell

Westmere is pretty old but significantly is the first generation to support virtual machines with little overhead. If clear Linux allowed something newer then it would rule itself out for a lot of existing machines (such as the 20+ I have running HPC jobs with surprising reliability)

Yeah, it's a reasonable tradeoff - be compatible with 1st Gen (basically all post-2010 modern x86_64 Intel processors), but try fine-tuning it a little bit for Haswell.

probably yes. It means that spills are cheaper.

Waiting for next side-channel attack..

Could you explain the connection between this feature and a possible side channel attack?

It might also depend on how AMD clears this temporal register across contexts, and the method it's updated.

The first I could think of (in the 5 minutes I've read this) is that it could be potentially ASLR breaking. As Agner says, it's highly useful for stack operations, but that also can help an adversary derive where the stack is.

Second, I could see an attack where the CPU predicts a sequence will use this register (since the address in the register could be speculative) and prefetch (since it kind of looks like it's prefetching) but it ends up being wrong and not killing/clearing the register, or not preventing its usage speculatively.

But it all depends heavily on lots of things. I feel like it's pretty similar to spec v4 in that exploitation would depend on the memory disambiguator, but we'll see.

Agner Fog mentions that the CPU assumes different registers have different values and the calculation has to be rerun if it turns out a write with one register invalidated a cached read with another register. That means you can measure this penalty and infer that the registers had the same or overlapping addresses.

This is no different than the existing memory aliasing predictor that has existed for a long time already though, so it doesn't add any new speculation opportunity as far as I can tell.

A thread in the same process could observe the timing change and infer certain things about the memory address. This is mitigated by the fact that intra-thread and host-guest communication is already a side channel nightmare and everyone now only considers process boundaries to be defensible. It's sort of like worrying about someone being able to cut keys from a picture you posted on Facebook when all your house locks can be raked open with a $5 tool from Amazon.

Ok, so most my asm coding and knowledge of exactly what the CPU was doing ended sometime between the Z-80/68K/8086 timeframes. Are there any good books/resources on all the modern trickery that CPUs now utilize?

Well, there is the Hennessy&Patterson book, "Computer Architecture - A Quantitative Approach". The newer editions include modern features and follow industry's developments.

This will of course not be able to tell you all the black magic that is happening inside AMD's and Intel's newest designs. The software optimization manuals for these processors do include some more interesting insights, but especially the Intel ones are not the most entertaining read...

The best resource for details of any specific x86 cpu is Agner Fog's 3rd manual at: https://www.agner.org/optimize/#manuals

This hidden copy feature, do you think someone can exploit it in a similar way as recent exploits like meltdown and spectre?

The attacks for meltdown and spectre are completely different to this, I’m fairly certain that explicit features that could break memory protection would be tested. The point of the other attacks is to trick branch prediction to circumvent these protections. It is of course possible this could be used in a way I haven’t foreseen...

It looks like there is a significant miss penalty for aliasing. Does anyone know if Rust's ownership rules would help avoid these penalties.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact