Hacker News new | past | comments | ask | show | jobs | submit login
An optimization guide for assembly programmers and compiler makers (2018) [pdf] (agner.org)
271 points by muricula on Feb 11, 2019 | hide | past | favorite | 59 comments



The page containing the link to this pdf "https://www.agner.org/optimize/" has links to four other pdfs describing other topics on optimization.


Clickable and non 404: https://agner.org/optimize/


Sorry, I had forgotten that HN doesn't like quotes around links.


  I like this site very much, its simple and old school.
Take a tour around the site for some other aspects of agners work, there is a good treatise regarding the fundamentals of digital electronics, as well as a proposed archetecture standard [ForwardCom]. some [as in a couple] of the links on agners site are stale or dead but do give you the general idea what to look for.

https://www.agner.org/digital/digital_electronics_agner_fog.... [PDF]

and:

https://forwardcom.info/


There is also uops.info which has a different, more formal characterization of latency:

https://arxiv.org/abs/1810.04610


And llvm-exegesis:

https://llvm.org/docs/CommandGuide/llvm-exegesis.html

"llvm-exegesis is a benchmarking tool that uses information available in LLVM to measure host machine instruction characteristics like latency, throughput, or port decomposition.

Given an LLVM opcode name and a benchmarking mode, llvm-exegesis generates a code snippet that makes execution as serial (resp. as parallel) as possible so that we can measure the latency (resp. inverse throughput/uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. The result is printed out as YAML to the standard output.

The main goal of this tool is to automatically (in)validate the LLVM’s TableDef scheduling models. To that end, we also provide analysis of the results."


Sometimes I think I'm a 'full stack' developer, and then I read something like this. Sure I can dabble in simple assembly, but there's always a bigger fish.


Well, that's pretty humble of you. :)

I advocated for the term to only be used by folks that can build their own computer, language, OS, and apps. Look up Chuck Moore and Niklaus Wirth for examples focusing on simplicity. For full-stack architect w/ team building it, look up Bob Barton of Burrough B5000. I cant remember if early folks writing Lisp interpreters built the CPU's, too. I know Burger did for a Scheme CPU, though.


Not quite down to the hardware level, but the VPRI STEPS project had that kind of thing in mind, see (PDF): http://www.vpri.org/pdf/tr2012001_steps.pdf

The focus was on expressive power more than simplicity, though.


Well, remember what Carl Sagan said about full stack development:

  If you wish to make an apple pie from scratch, you must first invent the universe.



Heh, the people who can dabble in Assembly for a living won't be able to (typically) setup a major Web PHP + Database and all that. Assembly optimization is a different skill than web development, or other stuff in the stack.

We all gotta pick our specialties. For me, low level high-performance stuff is more of hobby project level. I got a few cool things I've learned by myself but it isn't really what I do for a living.


I dabbled in assembly, for a living, at a major software firm. I can still do it after brushing up a bit. And I have no trouble setting up a pg database with web frameworks and writing optimized code.

Web isn't some magical beast- its just rote programming.

Actual business logic is where the categorization starts.


It's all plumbing and bookkeeping.

If you want to be humbled, go to a conference on high energy physics.


They've spent their lives studying high energy physics. It's going to seem impressive to a non-expert, but to them it's mostly plumbing and bookkeeping.


sooo... where is SUSY .)


Not true at all; someone with the brains to understand how software works at the lowest level can figure out the higher levels easily, while the reverse apparently doesn't hold.


I've worked with enough backend developers who complain on how much tooling there is in frontend to know that this isn't true. And being a "fullstack developer" myself I've seen the stuff that sysadmins deal with and I know that thats a job I could never do. Knowing one part of the stack doesn't automatically qualify you for the other parts. Knowing assembly doesn't make you an expert in CSS, Webpack nor configuring IIS servers.


> Knowing assembly doesn't make you an expert in CSS, Webpack nor configuring IIS servers.

That isn't what the poster stated.


I friend of mine is a senior compiler engineer for a living (in a major player of the console gaming field). He said multiple time that he would not be able to keep up with the incredible mess that is web development nowadays (frontend in particular).


I've always said "full stack" == "iceberg tip".


The author is a cultural anthropologist, if it makes you feel any better.


Have you ever implemented optimized microcode for a pipelined, branch-predicting processor that you also had to build? MICMAC is the architecture we used once upon a time. I won all of the course's extra credit for both smallest microcode program and fastest (fewest micro ops) because I Huffman-encoded the macro ops of the test programs and used progressive decoding.

How about a Java to MIPS compiler? In C++ and then again in Java just to be self-hosting.

Or an all software OpenGL-compatible pipeline with painter-algorithm Z-buffering and trapezoid engine? Quaternions and all that state... fun-fun.

Another mandatory project in college was a non-pre-fork()ed select() caching HTTP/1.0 proxy in POSIX C89. IPC was a PITA back then.

There are still bigger fish, but one can be that much less of dilettante by digging deeper and mastering the why, not the what or the current fashion.


I still remember the olden days when

   XOR AX,AX
was faster than

   MOV AX,0
and Abrash'es book was the magnum opus of the pre-P6 era.

I guess I better throw most of those assumptions out for modern microarchitectures.


The XOR is still idiomatic because it is smaller than the MOV, because the immediate zero has to be encoded as the same size as the destination register.


Probably not directly faster, but "xor %rax, %rax" is still smaller (and thus faster if you're icache constrained) than "mov $0x0, %rax"


It is still faster because the register renamer handles it --- ditto for "sub reg, reg":

https://randomascii.wordpress.com/2012/12/29/the-surprising-...


"xor %eax, %eax" is even smaller (2 bytes vs. 3) and does the same thing since operations on the 32-bit subregisters are defined to zero the upper half [1]. This is what compilers tend to do when they want to zero a register.

[1] This is not true for 8- or 16-bit subregisters, causing all kinds of fun issues (see "partial register stalls", also discussed in Agner Fog's materials).


>"and Abrash'es book was the magnum opus of the pre-P6 era."

What book is that?


Probably Zen of Code Optimization. Good times.


The Zen of Assembler? Does that sound correct? That's the closest I could find.


Zen of Code Optimization 1994 ( https://www.amazon.com/Zen-Code-Optimization-Ultimate-Softwa...) is the follow up to Zen of Assembly Language 1990 (https://www.amazon.com/Zen-Assembly-Language-Knowledge-Progr...)


XOR was never faster, it is 2 bytes-vs 3 bytes instruction size.


I'm interested in seeing how the section about "Future branch prediction methods" (page 17) will change in our post-Spectre world.


This just came up today. Anybody know the answer? The problem involves a JIT (for opcodes, not scripting).

Chunks of code are generated. Sometimes, a chunk will need to branch to a chunk that isn't yet generated or even allocated. Chunks can be cleared away for various reasons like evictions to make room, so the branches to a chunk might need to not go there anymore.

One strategy is to use indirect branches. When the destination doesn't exist, the branch goes to code that will resolve that. It's a bit like how dynamic linkers work.

Another strategy is to replace the code. The processor may stumble a bit over what is essentially self-modifying code, but then it eventually runs faster. The degree to which the processor will pause is a key consideration here. Anybody know?

BTW, if you like digging into this stuff, we're hiring.


On a multithreaded implementation, you can do it with 2 stub functions and an array/hash of function pointers (hereto refered to as the "table") that itself acts as a spinlock array. Initialize the table with stub 1 and atomic write stub 1 to the table on eviction.

stub 1) upon entry, atomic write stub 2 into the table. Then compile or enqueue onto a compilation thread/process. At the end of compilation, atomic write the newly compiled address to the table.

stub 2) spinlock on function pointer array entry != Stub 2 address. At end of spinlock, call new function pointer. If you care about power consumption or timing attacks, the spinlock can have a backoff corresponding with the speed of your JIT compiler. This function should be able to fit into a single cacheline, so it should usually be hot.

inlined thunk: Given a function, read address from table and call it.

The client uses the thunk to call the function address in the table without knowing if it's compiled or not. On Intel, you shouldn't need a read fence due to MESI rules.

The branch/fetch predictor works by hashing the address of the branch instruction into its own lookup table. So by forcing stub 3 to be inlined, you're relying on the hash to prefetch the correct function address and call it directly.

You should also be able to put the JIT into a separate process that mmaps with the NX bit set, and have the client load without the NX bit set. So it would be difficult to exploit the JIT compiler itself (but the produced code could still be unsafe).

You might want to cachealign the function table or pack similar chunks in the same function table entry to prevent false sharing whenever there's an atomic write. Also, this table is going to thrash your cache by incurring an extra cacheline load for the trampoline that it does every time.

Anyways, that's my back of the napkin design for how to do it.


Thanks.

That is certainly a different way to think about threading. My instinct would be to keep a separate JIT cache for each thread, keeping the threads from stealing resources from each other but also keeping them from sharing the JIT effort.

I think your approach is pretty much how /lib/ld-linux.so.2 deals with the PLT. I didn't see an explanation for stub 3, but I guess you mean the part of the JIT output that does the indirect jump. I take it that you believe the speed advantage of a direct jump is not enough to overcome the occasional hit caused by modifying it, with all the invalidations (icache, trace cache, etc.) that it might cause. Being similar to the dynamic linker is probably well-aligned with what Intel is trying to optimize for upcoming chips. OTOH, the spectre problem might limit what Intel does.

Putting the JIT in a separate process would have to add latency. Upon hitting a missing chunk of code, translation can't really wait. The idea is to run modern stuff, such as a recent desktop OS, via the JIT. We do use more than one mapping on Linux, as required to keep SE Linux happy.


Yes, I'm assuming once code is JIT'ed, it will take a while before being JIT'ed again. But in the meantime the caches will cover some of the trampoline costs.

With regards to latency, if you have a dedicated JIT thread, you can read ahead and JIT the start of all the jump/calls up to the next current ret in parallel to executing the first chunk. Like a super fancy prefetch. Heuristically, I think it's safe to assume that most code is local but I could be wrong.

Also, if you load hwloc on x86-64, you can read your CPU topology at initialization time. If hyperthreading is present (which is somewhat common), you can set affinity for JIT threads to be on the same core as the emulated CPUs. This will minimize read concurrency overhead since you'll be reading/writing to the physical core's L1 almost every time. (This part is crazy, but you might even be able to get away without using atomics due to how cache-associativity works, but I've never tried it and it might not be guaranteed into the future.)


I found an old paper I wrote about my first DBT, and it has a nice figure about how I implemented indirect branch resolution: https://imgur.com/a/dCpY7V2. Feel free to ping me at pag on the Empire Hacking slack if you like talking about this stuff :-D


All JIT code runs into this issue. Modern CPUs are "Harvard" machines which split instruction-fetching and data-fetching. This means you need a memory fence of some kind to synchronize the separate L1 caches.

To properly handle all cases means forcing a memory fence whenever you are done writing code. I haven't thought too much about which fence is proper, but at minimum a write fence needs to ensure all data is written out as expected. You might need a full fence.

EDIT: Maybe a sfence at the location of the "writing" thread, while the JIT Code needs an lfence (to ensure all caches are properly updated). IIRC, fences on both sides are needed in these cases.

The "dynamic linker" idea seems better to my instinct, because it won't require these fences. Of course, testing would be required (Instincts are typically proven wrong in performance questions). IIRC, Intel Skylake and AMD Zen perform branch prediction on dynamic addresses, so you should get okay performance out of the dynamic lookup.


a memory fence is not required, as per my reading of the Intel documentation. i think you're just guessing and haven't actually written any SMC x86?

would you mind sharing what PRM sections back up this reading? maybe a careful re-read will help you figure out what "might" require a full fence.


I'm sorry, but I don't think I really should take your bait.

I'm generally willing to hold a technical discussion among peers. But it has to be on even terms with a decent amount of respect given both ways. Your tone in the post above, as well as your posting history, does not give me any hope towards having a good discussion with you.


bait? regardless of my tone, a memory fence is not required. no reading of the relevant documentation supports that, if you want to be wrong you're free to do so. but please don't pretend to educate others with wild uninformed guesses, that seems worse to me.

here's the documentation: https://software.intel.com/sites/default/files/managed/39/c5...

Section 8.1.3 Handling Self- and Cross-Modifying code

(* OPTION 1 ) Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;

( OPTION 2 ) Store modified code (as data) into code segment; Execute a serializing instruction; ( For example, CPUID instruction *) Execute new code;

this doesn't strike me as contentious, there is no guideline that states a fence is necessary. you're possibly making the assumption that fences are "serializing instructions" and, well, again i'd question what documentation backs that interpretation up. is a sfence or lfence "more" serializing?


> regardless of my tone

I can't ignore your tone. Sorry. Maybe other posters can, but I can't and I won't.

EDIT: I had a lot of other stuff here. But my attacks vs you were likely unfair. So I've erased them.

Nonetheless, as I stated in my first response to you: I have no hope to have a positive discussion from this thread with you. I do believe you might be a stronger assembly programmer than I am. Maybe next time we can have a good discussion. But the tone was not set correctly in the initial post, so I do not wish to discuss this matter with you any further.


im sad i missed your edit, apparently filled with unhinged personal attacks, but ive limited my commentary to the technical details and it's amazing that you still claim some higher ground while steadfastly ignoring the subject matter, on which you are incorrect and misinforming others.

im not claiming to be a "better assembler programmer" im literally quoting the only relevant documentation on a tricky area where you were spreading lies charitably dressed up as half-guesses. if you hadn't barged in with some grossly incorrect theory about fences we wouldn't be talking at all. i still think lying and misleading while putting on airs of educating folks is worse than, uh, quoting documentation? are you thinking the documentation is being too harsh on your pet theory?

will you at least agree to stop spreading this dangerous misinformation? neither fence is serializing (like... you wouldn't want them to be...) and someone building a JIT around your guessing is heading towards an erroneous implementation.


> ive limited my commentary to the technical details

Prove it.

> apparently filled with unhinged personal attacks,

> you were spreading lies

> lying and misleading

Too late. You already failed at this within your very own post. The above words are not technical in any means, and I predicted that this would happen. Let me repeat myself one last time:

> I do not wish to discuss this matter with you any further

Perhaps I was not clear? But let me say it one more time. I DO NOT WISH TO DISCUSS THIS MATER WITH YOU ANY FUTHER. Am I clear this time?

EDIT: That does not mean I'm unwilling to discuss things with you in the future. As I stated earlier, it seems like you're an avid assembly programmer, and there aren't many of those in the world. So maybe next time, I'll feel confident to have a discussion with you. But in this thread, at this time? I do not wish to hold any technical discussion with you today.


i legit don't know what you're expecting? your lie was central to the technical discussion and at no point did you acknowledge the error.

if you can't handle being called out for providing false technical information... don't hand out false technical information? seems easy. it's not, as you so desperately want, a personal attack to identify areas where you're misrepresenting the technology.

im curious, how did you want this to go? DT, incorrectly:"fences are required" jv6:"ah, well, hm, er, here's section 8.1.3 which doesn't mention fences at all and specifies two methods (neither of which is a fence) that ARE required to ensure correct operation" like can i even tell people the truth? or must your egregious misrepresentation stand without question for you to find the "tone" acceptable?


Clearly, you won't leave me along after my previous hints. My only conclusion is you probably want something from me. Lemme be direct about it: What do you want from me?

I know one thing you want: you want to talk about assembly language / various low level details. Unfortunately, I don't want to do that with you in this thread. So I will NOT give you that. Is there anything else you might want from me?

As for what I want: I want to say goodbye to you, at least for now. Perhaps you and I will have a more pleasant discussion in the future. But this current discussion was poisoned from the start and I do not wish for this discussion to continue.


you should probably check the Programmer's Reference Manual (PRM)

i mean, they don't try to hide this information? Section 8.1.3 gives two options that are guaranteed to work across future iterations of hardware. the sibling comment that assumes a heavy serializing like CPUID is the ONLY method is shooting from the hip and only really appropriate if you're playing tricks with what LA you're using for the write/read.

you'll probably want to skim "11.6 Self-Modifying Code" as well. SMC is guaranteed to work, not work fast. Most folks want that second one and blasting out icache entries is going to put a damper on that.


There has recently been some work in the Linux kernel to implement some infrastructure to allow hot indirections through function pointers to be replaced by dynamically rewritten branches - this is entirely done for performance reasons, so that's at least a data point in favour of the "rewrite the jump target" option.

(This is particularly motivated by the fact that Spectre mitigation requires ordinary indirect jumps in the kernel to use a "retpoline", which prevents the processor speculating the jump target but is very slow - are you using retpolines for your indirect jumps?)

The most recent "static calls" patch set thread: https://lkml.org/lkml/2019/1/9/1112


At the end of the first chunk just ret. When the second chunk is generated, also emit a thunk which calls the first chunk and then calls the second chunk. Should be quite nice and efficient and will leverage the return history buffer. You're right, this stuff is fun.


Also, self modifying code typically requires full pipeline serialization via an instruction like cpuid (x86) or isb (arm). There are comments about this in the architecture optimization manuals, but in general try not to go around munging code all the time. While I'm thinking about it, you should consider an out of proc JIT architecture like Edge has, as JITs are fragile and easily exploited [1].

[1]: http://wenke.gtisc.gatech.edu/papers/sdcg.pdf


Out-of-proc would be too slow to be worthwhile in this case. This JIT isn't doing a high-level language like javascript, where large blobs of code can be generated and the performance is limited anyway by calls into other things. This JIT is for machine instructions. It's something like valgrind or qemu, in which we run modern software.


Depending on the size of your JIT code, it's possible to be more cache efficient if the JIT is pinned to its own core and uses lockless ring buffers to enqueue one-cacheline-sized chunks back and forth. But this only works if you spend more time instruction fetching to the JIT and back than it takes to atomic queue/dequeue.

In terms of latency, having the JIT in a process or thread is semantics at that point so you might as well process isolate it.

It all depends on how often your JIT instruction fetches outside of L2 due to the size of itself and the code that it generates. You can approximately measure it using cachegrind and then multiply the Data + instruction L1/L2 miss values by whatever they cost on your architecture.


Can't you place a trap where the target code would go? It would still be self-modifying code, but least you get rid of the indirection in the other way.


Not only is that code not produced yet, but the memory isn't allocated yet. The location is thus unknown.


Take a look at the various block chaining approaches of dynamic binary translators (DynamoRIO, PIN, etc.). There is a lot of work done in this space.


Probably not much. We'll probably just get speculative state barriers made full parts of the architectural specification.


I've heard this individual's name come up quite a few times(it's very distinctive)in discussions about the compiler/optimization/cpu architecture space.

Does anyone know if this individual still teaches at the Technical University of Denmark and if there might be any lectures of their's online?

I've been watching Onur Mutlu's CMU lectures online and this material is a nice addition to computer architecture self-study.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: