Hacker News new | comments | show | ask | jobs | submit login
A fundamental introduction to x86 assembly programming (nayuki.io)
320 points by nkurz on May 14, 2016 | hide | past | web | favorite | 74 comments



If you want to learn x86 assembly, I recommend one of my favorite books Programming From The Ground Up:

http://savannah.nongnu.org/projects/pgubook/ (free pdf!)

This is a practical book and teaches assembly programming on Linux.

Author Jonathan Bartlett wrote this book because he was frustrated to no end with the existing books. At the end of them he could still ask, "How does the computer really work?" and not have a good answer. Jonathan's goal is to take you from knowing nothing about programming to understanding how to think, write, and learn like a programmer. You won't know everything, but you will have a background for how everything fits together.

Fun story: I remember how I went through this book in 2004, a day before a job interview, and I exactly got asked a question about how C functions get compiled to assembly, how the stack and memory management works. I got that job.


For "How does the computer really work?" questions, I recommend this book:

http://www.charlespetzold.com/code/


Funny you mention that one. I have been re-reading that book the past few weeks.

This book takes a bit of a different angle on explaining computers. I really enjoy the history.

My only complaint would be that I think he has a Microsoft bias (but then I guess I am biased myself).

And similarly, I have a minor annoyance with the OP's mention of Linux, and only Linux, when he touches on calling conventions. This bias toward one system, and ignorance of others, is typical of many websites and documentation. To be fair, the OP mostly avoids it.

To be clear, great explanations of computers to me are ones that either:

   1) take great care to stay completely neutral and only discuss universally shared traits across systems,

   2) go to great lengths to try to be as comprehensive as possible, including many systems and all their commonalities and idiosyncracies, or

   3) focus only on one system and go into great detail how it works.
The more the author strays from 1, 2 or 3, the less likely I am to read their work.

Petzold pays ample attention to Morse code and similar succinct ways of communicating information. In my opinion this type of focus is the mark of a skilled coder. When I look at the entries to IOCC, it is no surprise to me that Morse code is (or at least was) a frequent focus of the entrants.


This book was what made C click for me (in the few chapters I digested way back when). I actually stopped reading twice because I suddenly understood something that had blocked my progress in C, and went on my way for a year or two until I decided to pick the book up again.

For a quick idea of what ASM can look like if you build up the foundations step-by-step, and understand what you're working with:

    .include "record-def.s"
    .include "linux.s"
    #PURPOSE: This function reads a record from the file descriptor
    #
    #INPUT: The file descriptor and a buffer
    #
    #OUTPUT: This function writes the data to the buffer
    # and returns a status code.
    #
    #STACK LOCAL VARIABLES
    .equ ST_READ_BUFFER, 8
    .equ ST_FILEDES, 12
    .section .text
    .globl read_record
    .type read_record, @function
    read_record:
    pushl %ebp
    movl %esp, %ebp
    pushl %ebx
    movl ST_FILEDES(%ebp), %ebx
    movl ST_READ_BUFFER(%ebp), %ecx
    movl $RECORD_SIZE, %edx
    movl $SYS_READ, %eax
    int $LINUX_SYSCALL
    #NOTE - %eax has the return value, which we will give back to our calling program
    popl %ebx
    movl %ebp, %esp
    popl %ebp
    ret
  
With the definitions in place, its like a whole different language.

For reference, the definitions are simply a text file with contents similar to:

    #System Call Numbers
    .equ SYS_EXIT, 1
    .equ SYS_READ, 3
    .equ SYS_WRITE, 4
    .equ SYS_OPEN, 5
    .equ SYS_CLOSE, 6
    .equ SYS_BRK, 45
  
Or (record-defs.s in the example) clearly describing the data with:

    .equ RECORD_FIRSTNAME, 0
    .equ RECORD_LASTNAME, 40
    .equ RECORD_ADDRESS, 80
    .equ RECORD_AGE, 320
    .equ RECORD_SIZE, 324
  
This book opened my eyes to the concrete, data driven nature of the problems I'm trying to solve, at an atomic level. It somehow dispelled all the magic behind programming, while exciting the mechanical side of my brain, leaving me with that "its just a machine, I can solve any problem if I just trace things patiently until I understand the parts and how they interact".


Unrelated: the code is equivalent to read(fd, buf, RECORD_SIZE) that may read less than RECORD_SIZE.

The function should read in a loop, to fulfill its contract (read a record).

  #include <unistd.h>
  
  ssize_t read_record(int fd, void* buf, size_t record_size, size_t* written) {
    for (*written = 0; *written < record_size; ) {
      ssize_t n = read(fd, (char*)buf + *written, record_size - *written);
      if (n < 1)
        return n; /* error or EOF */
      *written += n;
    }
    return *written;
  }


Ha, that page is frozen in 2004, its like a web time capsule! The book is really good though. Not sure I would have found this otherwise, thanks for sharing.


I was into asm back when Amiga was around and lost the will and knowledge over time (mostly since programming is more of a hobby now, over the last couple of years). But, I had a strong desire to get into asm again and I did it a bit unconventionally. It worked, though.

Get a TIS-100 game and play it (that got me interested again). After that, I compiled simple(r) C programs with gcc and looked at their .S output. After that (along with that) grab a tool like ollydbg, x64dbg or (if you can!) IDA Pro and open up your favorite programs and modify them. Whenever you stumble upon an unknown (to you) instruction, look it up in the intel manual and google for it to see idioms people use. This process has worked really well for me, for now, albeit it feels like I'm cracking software or something like it (it's fun though). Along with that you can start writing asm blocks in your programming language of choice and/or full asm with any of the assemblers (flat, yasm, nasm, whatever).

Only thing you need to know beforehand are the basics of C and data/memory manipulation.


I was surprised to read that x64 apparently doesn't allow pushing or popping 32-bit values. I have a language that uses 32 bits as the basic unit for all values and I'm working toward x64 code generation. Should I just promote values to 64 bits and waste half the stack? Should I use mov instructions instead of push/pop? What solutions are other compiler-writers using?


x86-64 is really oriented around integer values being 64-bits. For example, 32-bit operations will zero-extend the result to write the full 64-bit integer register. The ABI also assumes integral values are promoted to 64-bits and the stack is 64-bit aligned on calls.

That said, as long as you keep RSP aligned you can do whatever you want. Consider this code:

    extern void value(int* a, int* b, int* c);

    int main() {
      int a, b, c;
      value(&a, &b, &c);
      return a+b+c;
    }
This is how LLVM compiles it:

    subq	$24, %rsp
    leaq	20(%rsp), %rdi
    leaq	16(%rsp), %rsi
    leaq	12(%rsp), %rdx
    callq	value
    movl	16(%rsp), %eax
    addl	20(%rsp), %eax
    addl	12(%rsp), %eax
    addq	$24, %rsp
    retq
Note that the int values are allocated at 4-byte alignment, but rsp is aligned to 8-bytes. If you add an additional parameter, 'd', you'll see that the compiler still allocates 24-bytes of stack, and stores the additional parameter at 8(%rsp) (which is unused in the code above).


Almost correct.

If you want to call C code conforming to the x86-64 SYSV ABI, RSP needs to be aligned to 16 bytes when you execute the call. If the code you generate never calls alien code, 8 byte alignment is enough.

Since 8 bytes are occupied by return address pushed by the call which started your function, you need to decrease RSP by further 8, 24, 40, 56, 72, ... bytes before calling code generated by others.

Reason: having stack 16 byte aligned makes it easier to allocate aligned 16 byte stack variables and this is useful because x86 has 16 byte registers (SSE) which are most efficiently loaded/stored to aligned addresses.

However, it isn't only performance that you lose by neglecting alignment. I learned the hard way that some code generated by gcc crashes if you call it with unaligned stack.

That's why in this example LLVM allocates 24 bytes, even though 16 would be enough for 3 ints.

Another example (gcc):

  extern void bar();

  void foo() {
          bar();
  }

  0000000000000000 <foo>:
   0:   48 83 ec 08             sub    $0x8,%rsp
   4:   b8 00 00 00 00          mov    $0x0,%eax
   9:   e8 00 00 00 00          callq  e <foo+0xe>
   e:   48 83 c4 08             add    $0x8,%rsp
  12:   c3                      retq
To anyone writing x86-64 compilers, I recommend finding the x86-64 SYSV ABI spec and reading it. Saves debugging time.


Ah, good point. I forgot about the return address.


> I was surprised to read that x64 apparently doesn't allow pushing or popping 32-bit values.

A simple way to circumvent this problem (I don't claim it is the best) is

  sub rsp, 4
  mov eax, [rsp]
where eax of course contains the value to push.


> The way the x87 FP stack works is a bit weird, and these days it’s better to do floating-point arithmetic using xmm registers

x87 allows to do calculations in extended precision, i.e. word width is 80 bits. SSE is limited to double precision (64-bit words). Also FPU has some advanced math instructions, like sin, cos, tan, exp, etc., and these operations will be available in AVX512F.


Have fun debugging arithmetic difference resulting from spilling an intermediate result to memory and rounding to 64bit value vs. not spilling and keeping the intermediate value as 80 bits.


FST/FLD can store/load full 80 bits if you need such precision. No problem whatsoever.


I've never seen anyone store 10 byte IEEE-754 values in memory.

People store doubles, and people get upset when compiler optimizations (like when to spill from registers to memory and whether to use one or two instructions for multiply-and-add) change not just the performance but also the result of computations.


> I've never seen anyone store 10 byte IEEE-754 values in memory.

But you can, if you are after precision as the OP apparently was.

I'm not sure what this c-word is doing in your post, I thought we were talking assembly, but as far as c-things go, at least gcc and clang represent long double as 80b extended precision on x86.

So, for example, compiling this beauty (which is too large for x87 stack):

        long double a[64], b[64], c[64], d[64];
        // load a,b,c,d from somewhere

        long double x =
                ((((((((a[0]+a[1])+(a[2]+a[3]))+((a[4]+a[5])+(a[6]+a[7])))
                +(((a[8]+a[9])+(a[10]+a[11]))+((a[12]+a[13])+(a[14]+a[15]))))
                +((((a[16]+a[17])+(a[18]+a[19]))+((a[20]+a[21])+(a[22]+a[23])))
                +(((a[24]+a[25])+(a[26]+a[27]))+((a[28]+a[29])+(a[30]+a[31])))))
                +(((((a[32]+a[33])+(a[34]+a[35]))+((a[36]+a[37])+(a[38]+a[39])))
                +(((a[40]+a[41])+(a[42]+a[43]))+((a[44]+a[45])+(a[46]+a[47]))))
                +((((a[48]+a[49])+(a[50]+a[51]))+((a[52]+a[53])+(a[54]+a[55])))
                +(((a[56]+a[57])+(a[58]+a[59]))+((a[60]+a[61])+(a[62]+a[63]))))))

                +((((((b[0]+b[1])+(b[2]+b[3]))+((b[4]+b[5])+(b[6]+b[7])))
                +(((b[8]+b[9])+(b[10]+b[11]))+((b[12]+b[13])+(b[14]+b[15]))))
                +((((b[16]+b[17])+(b[18]+b[19]))+((b[20]+b[21])+(b[22]+b[23])))
                +(((b[24]+b[25])+(b[26]+b[27]))+((b[28]+b[29])+(b[30]+b[31])))))
                +(((((b[32]+b[33])+(b[34]+b[35]))+((b[36]+b[37])+(b[38]+b[39])))
                +(((b[40]+b[41])+(b[42]+b[43]))+((b[44]+b[45])+(b[46]+b[47]))))
                +((((b[48]+b[49])+(b[50]+b[51]))+((b[52]+b[53])+(b[54]+b[55])))
                +(((b[56]+b[57])+(b[58]+b[59]))+((b[60]+b[61])+(b[62]+b[63])))))))

                +(((((((c[0]+c[1])+(c[2]+c[3]))+((c[4]+c[5])+(c[6]+c[7])))
                +(((c[8]+c[9])+(c[10]+c[11]))+((c[12]+c[13])+(c[14]+c[15]))))
                +((((c[16]+c[17])+(c[18]+c[19]))+((c[20]+c[21])+(c[22]+c[23])))
                +(((c[24]+c[25])+(c[26]+c[27]))+((c[28]+c[29])+(c[30]+c[31])))))
                +(((((c[32]+c[33])+(c[34]+c[35]))+((c[36]+c[37])+(c[38]+c[39])))
                +(((c[40]+c[41])+(c[42]+c[43]))+((c[44]+c[45])+(c[46]+c[47]))))
                +((((c[48]+c[49])+(c[50]+c[51]))+((c[52]+c[53])+(c[54]+c[55])))
                +(((c[56]+c[57])+(c[58]+c[59]))+((c[60]+c[61])+(c[62]+c[63]))))))

                +((((((d[0]+d[1])+(d[2]+d[3]))+((d[4]+d[5])+(d[6]+d[7])))
                +(((d[8]+d[9])+(d[10]+d[11]))+((d[12]+d[13])+(d[14]+d[15]))))
                +((((d[16]+d[17])+(d[18]+d[19]))+((d[20]+d[21])+(d[22]+d[23])))
                +(((d[24]+d[25])+(d[26]+d[27]))+((d[28]+d[29])+(d[30]+d[31])))))
                +(((((d[32]+d[33])+(d[34]+d[35]))+((d[36]+d[37])+(d[38]+d[39])))
                +(((d[40]+d[41])+(d[42]+d[43]))+((d[44]+d[45])+(d[46]+d[47]))))
                +((((d[48]+d[49])+(d[50]+d[51]))+((d[52]+d[53])+(d[54]+d[55])))
                +(((d[56]+d[57])+(d[58]+d[59]))+((d[60]+d[61])+(d[62]+d[63]))))))))
                ;
produces only 80b spills (fstpt in GNU syntax):

  $ objdump -d fpmonster |grep fst
  400457:       db 7c 1c 10             fstpt  0x10(%rsp,%rbx,1)
  400462:       db bc 1c 10 04 00 00    fstpt  0x410(%rsp,%rbx,1)
  400470:       db bc 1c 10 08 00 00    fstpt  0x810(%rsp,%rbx,1)
  40047e:       db bc 1c 10 0c 00 00    fstpt  0xc10(%rsp,%rbx,1)
  400b47:       db 7c 24 10             fstpt  0x10(%rsp)
  400d91:       db 3c 24                fstpt  (%rsp)
Clearly, extended precision can be done right both in C and raw assembly.

> People store doubles, and people get upset

That's their fault :) and another story altogether. For reproducible low precision, indeed SSE is the way to go.


Anytime I see anything names my-asm or mini.asm or anything like that, it instantly yanks me back to college. We had this awesome teacher who had been at DEC for decades and taught at night. He'd bring in chunks of core memory, and tell us all about the old days in between course work. God I loved that class.


I'm reading Hackers by Stephen Levy right now and DEC's pdp computer era as described seems like a real golden age. I'd enjoy more history book recommendations along this line if anyone knows of some


The Soul of a New Machine, about building the first Data General machines.


Sunburst the Ascent of Sun Microsystems. Terrific stories, so many lessons.


Specifically about British computing is Electronic Dreams (Tom Lean) (disclosure: I went to university with the author).

The Cogwheel Brain: Charles Babbage and the Quest to Build the First Computer (Doron Swade) is also an interesting read, especially as Babbage's ideas were eventually vindicated when the Science Museum built it in the 1990s and it worked.


A very nice book about assembly programming is "Assembly Language Step-by-Step: Programming with Linux, 3rd edition" (http://www.amazon.com/dp/0470497025).

The nice thing about this book is that it guides the reader at understanding how the machine works first, and only then to assembly programming.

The sad thing about this book is that it references 32 bit intel-compatible processors.

My guess is that the original author has grown old and is not interested in producing a fourth edition of such book.

On this matter, I would like to ask: is it worth learning assembly for the x86/32-bit instructions, now that pretty much every computer is built on the amd64 architecture ?


I work in an IT dept which supports almost a dozen departments that all told use about 30 or 40 apps, almost all of which are still 32-bit. The hardware is recent and all 64-bit (as is our OS) but even the MS Office we use is 32-bit because of interaction with other apps. We also have to default the browser to the 32-bit IE executable rather than the 64-bit because of plugins (even MS recommends this). Most vendors still aren't up to 64-bit yet because they don't want to shut out the customers that are still years behind on upgrading. I'm thinking 32-bit will still be around for another 10 years to be on the safe side.


When I worked at a university, we had several pieces of legacy 32-bit software, mostly written in C, which were essential to some courses. It became more and more difficult to run them as Linux distributions stopped shipping 32-bit libraries by default (I think Scientific Linux 7.1 caused a lot of problems because of this).


amd64 is essentially a superset of the 32-bit version, and so it makes sense to understand 32-bit first. Actually, it's more like a half-64-bit extension because not everything is consistently 64-bit and a lot of things like operand sizes and registers actually default to being 32 bits.


That's true. I started with 16b and transitions to 32b and 64b were pretty straightforward. It's mostly just registers becoming wider.


I found this book during my Google searches and really enjoy reading the author's website. Here's a comment [1] he made 3 years ago about a new version:

"Well, that isn’t completely my decision. When the publisher wants a new edition, they contact me, and then I begin writing. I don’t see a new edition on the horizon for a couple of years yet. Worse, I need additional pages to cover 64 bit issues, and the number of pages I have in the book is limited. Unless I can persuade the publisher to go beyond the 600-page mark, I’m going to have to eliminate other material to cover 64-bit assembly."

[1] http://www.contrapositivediary.com/?page_id=1808#ASMSBS3E



Adding to my read later. Looks like a good intro. Does anyone know of a similar resource for ARM assembly?


I'm not aware of a good free introduction to ARM assembly, but you could check out the reading lists from these courses:

http://studentnet.cs.manchester.ac.uk/ugt/COMP15111/syllabus... http://studentnet.cs.manchester.ac.uk/ugt/COMP22712/syllabus...

Despite its age, ARM System-on-Chip Architecture (Steve Furber) is still a good introduction to the processor and assembly (I'm re-reading it at the moment). ARM Assembly Language - an Introduction (J.R. Gibson) is worth reading if you want to learn ARM assembly.

The materials page for COMP22712 also has some interesting resources, including the lab manuals and a small ARM assembler written in C (source is also on GitHub: https://github.com/uomcs/aasm). Unfortunately COMP15111 is now hosted on Blackboard so you can only get at the materials if you're a current student enrolled on the course.

(I've been an undergrad, postgrad and staff in CS at Manchester, so I'm familiar with the courses - other universities may have similar resources with fewer access restrictions).


I'm planning on writing a calculator in x86 assembly, so this will probably be a good starting point for me. I want to do this to grasp a better understanding of assembly. I currently am looking to develop on my MacBook Pro. Does anyone have more resources and/or suggestions? Thank you.


That is a little vague. What type of calculator exactly? A GUI one, or just an expression parser/evaluator? For the latter, you might find this interesting:

http://www.hugi.scene.org/compo/compoold.htm#compo4


Randall Hyde's work is interesting for beginners and users alike. Art of Assembly teaches you assembly but he uses a high-level assembler to do it in pieces. So, you can abstract away some things like in a HLL to ignore them until you understand enough to use the raw ASM. Likewise, if you use HLA for projects, you can do HLL stuff where understanding is more important than performance/memory. Standard library for HLA is so large the HTML reference about froze my browser haha.

http://www.plantation-productions.com/Webster/


This is a nice fundamental introduction, but where would I go to see how to actually run code?


My best introduction to any asm language is just my C compiler putting out asm (gcc -S, I think). I can create small programs to do what I want, and see what the compiler puts out.


https://gcc.godbolt.org is also helpful, it colorizes the assembly output corresponding to the C/C++ source. Furthermore, you can simply change between Intel and AT&T synatx.


The -S flag also works in clang, although the output may differ. For example, last time I checked a comparison check of i < 10 in a for loop would become:

  cmpl    $9, -4(%rbp)
  jle .L3
under GCC ('if i <= 9, loop') and:

  cmpl    $10, -8(%rbp)
  jge .LBB0_4
under clang ('if i >= 10, end the loop'). Neither GCC or clang does an exact conversion, and they both produce different assembly instructions (optimisations disabled in both cases).

If you build a cross-compiler, you can also output assembly for architectures other than your local machine, though this can be quite fiddly (see crosstool-ng for a project which has done most of the work for you).


I wonder if there is a nice and easy to use x86 simulator like the MARS simulator for MIPS [1]? Some place where I can run (step through) little x86 assembly programs study their effect of the each register.

[1] http://courses.missouristate.edu/KenVollmar/MARS/


GDB?


By GDB, you mean the GNU debugger? I have not used it for over a decade. Is it a convenient interface for toying and experimenting with x86 assembler?



Am I the only one who feels the "intro to x86" market is oversaturated?


Seconding this. Is there even any advanced x86_64 assembly language material out there besides the AMD and Intel reference manuals?


Probably no. I think when you reach some level of understanding, you'll use manuals. Then you know what you're looking for. :)


And once you start reading the docs on the more efficient but far less consistent 64-bit calling convention, you may find yourself choosing words to describe it other than the "improved" that this author opted for.


I take it you're a fan of the plan9 calling convention (f. ex.: all arguments and return value(s) are on the stack)?

Curiously, it doesn't actually appear all that ineffecient. Go uses it AFAIK. I wonder whether anyone has studied it. I also wonder whether gccgo uses that convention or defaults to SysV x64.


I didn't know it was a System V thing. Thank you for cluing me in!

I don't think of it in terms of like-vs-dislike. My observation is that it's a difficult thing to get right without a compiler, and thus avoided for introductory material.

As far as I am aware, storing values in a register requires fewer instructions. However, I have never personally confirmed the performance difference of this calling convention.

Handling of all calling conventions is one of the many things that I am personally much happier leaving up to GCC and LLVM in practice.


It's somewhat analogous to how we went from 16 bit registers to 32 bit registers with the extended opcode variants, so if you understand x86, x86_64 isn't really much of a stretch.


I only have a vague understanding of assembly in general, and x86_64 in general - but I thought that with so many more registers, the recommended style of programming can change quite a bit (hence aggressive use of registers for passing arguments to function calls in the x86_64 C ABIs)?


That's less of an assembly issue and more of a compiler/calling convention issue. Assembly allows you to pass parameters on the stack or in registers. Calling conventions simply define a protocol for doing this consistently.


Actually Id'd say that it's mostly the crappy processor design issue: for example, UltraSPARC has 32 physical registers in 32-bit mode, and 256 virtual registers (through register windows, specifically designed for compilers). Even Motorola 68000 with eight general purpose address registers and eight general purpose data registers is far more elegant than a 32-bit four register intel CPU.

intel CPU is just crap from a design standpoint, and since they had to remain backward compatible, it's gotten a lot faster with lots and lots of tricks, but it still sucks in 32-bit mode. No amount of tricks will change that. It has to be run in 64-bit mode to gain a performance boost and simplify the code, whereas processors with fixed 32-bit instruction encoding run faster in 32-bit mode and code simplicity is a constant.


While x86 is arguably pretty ugly, I think you're being unfair.

> ... far more elegant than a 32-bit four register intel CPU.

I count eight: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP.

x86 has also advantage of supporting arbitrary immediate values, so you don't need to allocate registers just for constants.


eax is the accumulator register, ebx the base register, ecx the counter register, edx the data register, esi is the source index, edi is the destination index, ebp is the base pointer, and esp is the stack pointer.

When I originally wrote four general purpose registers, I had eax, ebx, ecx and edx in mind, but after fully listing them above, I revise my earlier statement: the x86 assembler has two general purpose registers. Crappy processor architecture with lots of specialized registers, but too few general purpose ones.

Compared to MOS 6502, Motorola MC680##, or SPARC, only the eax and edx are really general purpose registers -- even mentioning ecx, the counter register, would be iffy.


6502 has only accumulator anyways. X and Y are not general purpose.

68k is pretty nice, d0-d7 registers are indeed interchangeable. Of course a0-a7 are just for addressing, I think a7 was usually stack pointer.

SPARC I've never programmed, so no comments about it.

I've written x86 code in the past (20 years ago) using all 8 registers for general purpose task -- yes, even ESP. It was faster that way to implement a texture mapper. Ugly but fast.

Those 8 x86 registers are mostly general purpose, apart from some exceptions.

Multiplication was the only annoying one, getting result in EAX:EDX.

I always succeeded making x86 do whatever I wanted, despite some limitations with register use.


Indeed. While repurposing ESP might be rightfully considered ugly, repurposing EBP is quite common and EBX, ECX, ESI, EDI pretty much are general purpose because nobody has been using them for their fixed functions for two decades.


Is there something wrong with loop and friends?

I agree that stosb/rep would probably be confined to memory management, and saving/restoring register around such ops isn't the end of the world. Not sure about movsb -- I suppose if you're copying enough data, store/restore is going to be negligible overhead in terms of speed, but if you're actually trying to write clear code, it would certainly be easier to not have to worry about the book keeping?


I don't know what's the current status quo, but most of the time after 80286/80386, "rep stos" and "rep movs" has been significantly slower than just (loading and) storing data in an unrolled loop. This limits usefulness just to very short spans or when code size is most important. But most short spans are also predictable (static), so compilers can often just generate an instruction or two to do so (like xor eax, eax / mov <target>, eax).

Currently the fastest way to memset large chunks of memory is probably to use SSE or AVX. I'd guess this is what gets generated if compiler target arch allows.

With SSE/AVX you also have an option to use non-temporal moves to avoid polluting caches. This might have a negative impact on any memset micro-benchmark [1], but significantly help any concurrently executing memory bound CPU cores.

Properly aligned (cache line 64-byte boundary) you might be able to avoid read-for-ownership as well, further reducing memory bus traffic.

So most use of rep-prefix might be pointless, unless you can accept the performance hit.

[1]: Just like micro-benchmarking any other resource constrained operation. Micro-benchmarks can give you very wrong idea of what is best for the system as a whole.


Those instructions are treated as "legacy PITA" by CPU vendors and, being complex and harder to implement than simpler ones, aren't implemented as efficiently.

The CPUs have lots of duplicated logic to process many instructions in parallel and, on "friendly" code, can sustain average throughput of 2 or more instructions per clock cycle, provided that the instructions are simple enough.

The end result is that a loop made with normal adds, cmps and jnes outperforms those dedicated looping instructions.

They are only used by compilers when optimizing for code size and maybe by people who want concise hand written assembly, though I'm not sure why wouldn't they just use C in such case.

See "Software Optimization Guides" released by AMD/Intel for more info.


Isn't it disingenuous to call esp/ebp (or rsp/rbp) general purpose registers considering there are instructions that implicitly assume esp/rsp is the stack pointer (push, pop, call, ret, ...).


> Isn't it disingenuous to call esp/ebp (or rsp/rbp) general purpose registers

In the sense of instruction encoding they are. Additionally, the x86-64 call convention typically does not use ebp as frame pointer (except when you use something like alloca). Instead functions typically allocate their nessesary stack amount at the begin. So ebp is generally used as a general purpose register by most compilers in x86-64.


Is there something like SPARC or other RISC architectures?


Yes there are... I wouldn't recommend starting with SPARC Asm though, it's nowhere near as fun as a CISC like x86, nor as easy as MIPS (which tends to be the "boring" go-to architecture CS courses use.) ARM is more interesting than MIPS and easier than x86.


I've done a little bit of x86 and I have to say I'm not very impressed by some of it. It seems like the lowest common denominator where nothing 'fun' happened.

ARM has a few ecosystem problems that I'd rather not deal with. From what I understand there is a lack of hardware discovery. ARM is more embedded then 'user' computer. Nothing is swapable.


You might find these size-optimisation challenges more fun:

http://www.hugi.scene.org/compo/compoold.htm

The 256B and below categories in the demoscene are also sources of interesting x86 Asm programs:

https://news.ycombinator.com/item?id=7960358


What do you mean by ARM being more interesting than MIPS?


- More ARM systems shipping today than MIPs

- The ARM has a standard MMU (well, two major revs). MIPs has quite a few variants, and they involve TLB invalidation / reload. They're both worth working with.

- ARM has made some interesting architectural choices over the years, and it's worth studying what they've done. MIPs is more a static platform, not as much market pressure or will to innovate.


Yes, The SPARC architecture manual. Punch that into Google or DuckDuckGo and it should be the second link (you want the SPARC V9 instruction set architecture). I would post the link I found, but since I actually have the physical book, I'm uncertain as to whether posting a link to a PDF of a possibly copyrighted book would get me in trouble with Hacker News or not.

I've done some SPARC assembler programming for fun, and for someone with 6502 / MC680x0 assembler background, SPARC has a really exotic assembler (register windows and synthetic instructions, most notably). It also illustrates just how complex and powerful the Scalable Processor ARCitecture is; for a RISC CPU, it has loads and loads of advanced features all designed for high performance, which leads me to think that the existing compilers generating SPARC code must be crap, since some of the SPARC processors (notably UltraSPARC II, UltraSPARC III, and UltraSPARC T1) are generally known to be slow when it comes to non-parallelized, number crunching performance. The design of the SPARC processors and their assembler is in direct opposition to that.


It shouldn't be a problem to link to it, because you can find a newer version of it sitting on Oracle's site as well as SPARC's official site:

http://sparc.org/technical-documents/specifications/

are generally known to be slow when it comes to non-parallelized, number crunching performance

I think that's due to a similar set of design decisions that lead to the Itanium also being an amazing benchmark performer, but dismal in "general purpose" code. IMHO SPARC and Itanium are architectures that have optimised heavily for the sort of massively parallel, predictable/few branches, predictable-memory-access-pattern that "high performance computing" benchmarks show off, but at the expense of the small, branchy, less predictable "serial byte/bit manipulation" that tends to be common in other, more general-purpose code. To use a car analogy, it's like a dragster vs. a rally car.


How much does the hard copy run you? I am thinking of porting my hobby OS to the architecture so it's probably worth having around in hard copy form.


Prices vary. I might have bought it new off of Amazon for $90, but I don't remember any more.

http://www.amazon.com/SPARC-Architecture-Manual-Version9-Int...


> other RISC architecture...

For mips there is an excellent book called 'see mips run' by dominic-sweetman. you might find it to be quite instructive...


See the RISC-V User-Level ISA Specification v2.0 [1]. It's more detailed but still fairly easy reading.

[1] http://riscv.org/specifications/


Hi, Nayuki here. I'm happy to take any comments, questions, and constructive criticism on the article.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: