For example, the stack alignment restriction only came about because some SSE instructions initially required that their operands be aligned in memory. This is a rather odd restriction considering that x86 was always intelligent about unaligned accesses (they used to be slower, but the penalty has almost disappeared with the latest models), and later revisions of SSE added instructions for unaligned access. But this doesn't affect any other instructions - in particular, call/return don't care, so in your own code you don't need to align the stack every time - it's only when you call into some other code that requires/expects such a condition.
Ditto for calling conventions; they are defined just so you can interface with other code, and not due to any inherent aspect of the instruction set. In fact although there are explicit call/return instructions, nothing requires that you use them, nor does the concept of a "function" even exist at this level. This means you can write horribly convoluted "sphagetti code", but also do what can't be done easily in a higher-level language with "unconventional" control flow, like coroutines (https://news.ycombinator.com/item?id=7676691 ).
Also, I believe that you should learn Asm not just so you can write the same code a compiler could generate, but so you can see and exploit the full potential of the machine. I feel this point is lost in a lot of material on this subject, maybe because their authors also never realised this, and so contribute to the belief that using Asm is extremely difficult and tedious with very little gain. IMHO it's only when you "break out" of the restricted way of thinking that HLLs impose, and see the machine for what it really is, that you can truly see the cost-benefit tradeoff and what it means for code to be efficient. Look at the demoscene 4k/64k productions for some inspiration. :-)
A simple overview is given in this recent article
Go inherited and extended the Plan 9 toolchain, but the calling convention was changed. The main reason Go returns on the stack is to support multiple return values.
x86-64 ABI point 3: Variadic functions need to have the number of vector arguments specified in %al.
Did you find any articles that explain the rational behind the way "varargs" argument passing is done? Or better specifications for it? It feels like a total mishmash.
For example, why is the number of vector arguments specified explicitly, but the number of total arguments is not? It seems so obviously useful to pass that total that you'd hope there would be a good reason not to do so.
And since printf() is using the format string to determine the number of variables anyway, why can't it just count the number of vector arguments itself? What does it do differently when the wrong number is given?
When va_start is used, it needs to save argument registers to the stack in the prologue of the function. The program is free to use different conventions (format string, sentinel value) to signal to the callee how many arguments there are. But the code generated for va_start has no way of knowing what convention the program happens to use.
I guess it makes sense to keep a consistent argument passing ABI, but I still find the answer quite sad: to preserve the ability to call functions without prototypes, you pass the arguments in registers and then immediately write them back to the stack.
Putting the number of vectors in %rax/%al seems at odds with the consistent ABI argument. Once you are changing things to require this, it seems like you might as well make some other useful changes as well: like passing the number of arguments and skipping the register to stack conversion.
It would be nice if there was a contortion-free entry point to the x64 printf() that started with the values on the stack. Can vprintf(const char *format, va_list ap) be used in this way instead? Is a 'va_list' just a block of memory containing the arguments?
I guess I need study the article you referred to (and stare at the libc source a while: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-commo...)
"Back" to the stack is a big assumption. A lot of the time, there is no reason an argument would be on the stack in the first place; or even if it is a spilled variable, it usually wouldn't be in the right place (without some fairly uselessly clever stack layout), so it's just a matter of whether the caller or callee does the write.
For example, it's not necessary that all the args passed in registers need be written back to the stack by va_start --- ignoring for the moment the complication of structures and SSE registers, a va_list could just contain one field, which would initially hold an index, the "current register number" where the desired argument is stored. Then va_arg (which is now implemented as a compiler primitive) could generate code that checks this index, and if the index is less than the maximum number of arguments that can be passed in registers, resolve to an access to the particular register (here's where a "register-indirect" addressing mode would be really useful, since it could just use that index). Otherwise that field becomes a pointer into the stack like the traditional implementation. In other words, depending on that index, reading a scalar va_arg turns into reading a register or reading from the stack.
It's not that hard to extend this to accommodate structures; and I don't see any limiting reason on why the ABI couldn't pass structures partially in registers and partially on the stack. We just have to define the rules that let us do so, as a structure is only a concept of grouping data together, an HLL construct. There's nothing that restricts a structure to being always a contiguous set of registers or memory locations. Having to "reassemble" a structure in memory is only necessary if its address is taken, and then only if operations other than accessing its members via that address are performed; otherwise its components can be operated on independently, regardless of whether they're in registers or memory. Since the compiler knows where va_arg is used with a structure type, it can decide whether it really needs to generate the code to reassemble the structure, or only a series of scalar accesses. (I touch on this point somewhat here too: https://news.ycombinator.com/item?id=7683823 )
Extending this to the SSE regs is simpler than for structures, since this is merely another bank of registers that arguments can be read from: another field in va_list to hold the SSE register index/pointer.
I find it a bit odd that the ABI would define an actual implementation of varargs rather than only where the arguments will be expected to be, since how they're actually accessed should be outside the scope of an ABI spec; now every compiler vendor is tempted to just copy this (IMHO sub-optimal) implementation instead of thinking about how it could be done, and possibly coming up with something better.
You might be interested in Josh Haberman's post on using GDB's "breakpoint command lists" as an alternative to actually putting the variadic calls into the assembly: http://blog.reverberate.org/2013/06/printf-debugging-in-asse...
lea edi, [eax + myarray + ebx * 4]
lea myarray(%eax,%ebx,$4), %edi
When using AT&T syntax, I also find it cleaner to drop the size suffix (movq vs mov) from the instruction whenever the size can be inferred from a register. It is still necessary in some cases that are ambiguous (e.g. mov (mem), $imm), so I suppose for teaching purposes, it might not hurt to always include it.
By the way, the 32-bit x86 ABI is much simpler than x86-64, so if you're learning assembly language for the first time, x86-64 might not be the easiest place to start. x86-64 also has some unintuitive behavior like zero-extending 32-bit mov (mov ax, 0x1234 does not modify the high part of eax, but mov eax, 0x12345678 clears the top half of rax).
Personally I would say x86-64 is easier than x86 because of extra registers. The System V ABI is a bit more complicated because arguments are passed in registers, which is great if they all fit, but makes it more complicated if they don't (it does encourage you to try to make them fit, though). The ARM ABIs are very similar in this respect, though.
MIPS may not be as visible as ARM but it's still used in a ton of embedded devices - like routers.
So the tiny snippet at the top will become, for instance:
subq $8, %rsp
movq $0, %rdi
movq $1, %rax
movq $2, %rbx
as nothing.s -o nothing.o
ld nothing.o -o nothing -e _main
For linux x86-64 syscalls, proper way is to use "syscall" instruction. For linux x86-32 stuff, best way is to make a call via vDSO with "call $gs:0x10" (hopefully i didn't butcher AT&T syntax) so the kernel can dictate (via the vDSO) the best way to actually get you to the point of performing the syscall.
and, to your point, I recommend that anyone disassemble their code compiled with libc (and look at the differences between linux, os x, bsd, whatever)... it gets really interesting how much is added and how each OS handles it.
Either way, I am proceeding with the rest by using what someone else recommended I do, which is use `gcc -nostartfiles`. Looking at gcc in verbose mode reveals an ld command line I could use to run his code without modification:
ld nothing.o -o nothing -e _main -dynamic-linker /lib64/ld-linux-x86-64.so.2 -lc
What is indeed true is that the x86(-64) ISA alone does not give you enough information to predict performance accurately. Furthermore, due to the interactions between the different subsystems (e.g, caches, OoO buffers, etc) it is essentially impossible to determine performance with cycle-accuracy. But it is still possible to have a first-order approximation, for a given microarchitecture, of what is going on underneath the CISC mask.
Covers x86-64, MMX, SSE2/3/4, AVX.
The author, Chris Rose, has also written a free e-book:
"Assembly Language Succinctly"
-- (PDF) https://www.syncfusion.com/Content/downloads/ebook/Assembly_...
I also stumbled on The Art of Assembly which may prove promising.