
X86 assembly doesn’t have to be scary - signa11
https://blog.benjojo.co.uk/post/interactive-x86-bootloader-tutorial
======
tvmalsv
When I was about 14 I was enthralled with programming my new Commodore 64,
first in BASIC, then 6510 assembly. I had the opportunity to accompany my
mother to a one-day class on programming. Being just an intro on the subject,
I was well ahead of what they would be discussing, but thought it would
interesting to talk to some adults that were also into programming.

I was talking to a couple of guys about what I had been doing on my C=64, and
when I mentioned the assembly stuff I was writing, one of them said, "How can
you possibly write anything with only three registers?!" (just the accumulator
and x/y registers). I was wondering what the big deal was since that was the
only architecture I had known at that point. Every game and utility I had were
only using three registers, so it was already proven to me that three were
"enough".

It's funny how you can just adapt and work with whatever is available, and
that becomes your norm. Especially when you don't even realize there are other
options out there.

Those were the days!

~~~
gruez
>"How can you possibly write anything with only three registers?!"

AFAIK you can even go down to 1 register, which is how stack machines work.
You might even say it’s 0 registers because it’s not something you can
directly access.

~~~
poizan42
There are many simple microcontroller architectures with only one register,
usually called A (accumulator) or W (work register). You just need to
constantly load memory to and from the register, nothing weird in that.
Sometimes they will call their memory locations registers, in which case they
have lots of registers - the terminology gets quite unclear when everything is
on the same chip anyways.

A true stack machine does not have any registers, so that would be 0.

~~~
blattimwind
RPN calculators are simple stack machines.

------
zokier
If your intention is to avoid scaring newbies off, I'm not sure if 16bit real
mode, PC boot process, BIOS services and all that arcana that follows is the
best place to begin.

~~~
msla
I agree with you to an extent: The IBM PC was not a very elegant design, and
other computers certainly had less complexity to them. But there is a point I
want to make which defends this choice somewhat:

Newbies like me are more frustrated by thinking there's no path from
introductory material to something useful or realistic. Something that's more
immediately friendly would be a simplified virtual machine with no or trivial
peripheral hardware, like Redcode in Core War:

[http://vyznev.net/corewar/guide.html](http://vyznev.net/corewar/guide.html)

The death of education is "So What?": "OK, I've learned to play Core War and I
know enough to write a warrior that can occasionally beat other warriors
written by beginners. So far, so good... so what? I want to program computers
in assembly, not play games using assembly programs, so where's the path from
where I am now to where I want to be?"

In this case, the pathway from VM opcodes to native opcodes is hardware
interface and, while the IBM PC has some funky hardware, it's heartening to
know that your code could, in principle and barring emulation errors, run
unmodified on real hardware and do something. Not something useful, but you
can get to useful. You can ramp up to it, now that you have the pathway in
front of you. It might be a long pathway, but it's there.

~~~
zokier
Well, I had in mind something _more_ useful and realistic, not less; namely
teaching assembly in normal Linux environment. Sure, there are all sorts of
complexities there too, but in general I feel like they are also more
worthwhile.

~~~
msla
> Well, I had in mind something more useful and realistic, not less; namely
> teaching assembly in normal Linux environment.

This might sound odd, but as someone who's done some assembly programming
under Linux, it usually isn't different enough from C to be worth the effort.
Even if you eschew libc, the kernel APIs are still fairly high-level and, more
to the point, there are no new concepts relative to C: You have pointers and
pointer arithmetic, you have fixed-size buffers, you have ints and doubles,
and the rest is just syntax.

The exception is doing something like a really tight high-performance kernel
using SIMD opcodes, which necessarily involves learning a lot about the
specific SIMD hardware and data organization to optimize cache use and other
details C can't express.

------
ChrisSD
I wonder if the history of x86 is holding us back in a big way. It started out
being close to the metal but now it's an abstraction that can mislead you if
you think processors are literally working the way x86 assembly describes.

And surely the whole spectre issue could be lessened if we could be less
reliant on CPUs having to guess what to keep in cache, which code paths are
most likely, etc?

~~~
blattimwind
> I wonder if the history of x86 is holding us back in a big way. It started
> out being close to the metal but now it's an abstraction that can mislead
> you if you think processors are literally working the way x86 assembly
> describes.

But when you compare high-end cores, you always see the same picture,
regardless of ISA. Large surface area (~1/3 of the core) used for insn
fetch/decode/schedule; they look all the same, whether it's POWERn or the
latest Sandy-Bridge/Haswell rehash... people claim that other ISAs would be a
lot easier/faster/more efficient to decode, but that doesn't seem to be true
in practice. It seems to me that any difference that might be there gets
dwarfed by the sheer complexity of OoOE and speculative execution.

What we _do_ know is outside some niches (like DSPs, where VLIW reigns king)
static techniques essentially _do not work_ for application code, because it's
_impossible_ to predict statically.

~~~
thechao
Even 15 decoders are a pinprick on the core’s area—at least when I saw die
area for LRB. Fetch & schedule we’re a bit larger. Most of the core area was
register files & floating-vector logic.

~~~
brandmeyer
In fairness, Larrabee was originally an in-order design.

~~~
gpderetta
It was also a simpler core, which means that for a complex OoO monster, the
decoder would be an even smaller area. The OoO instruction scheduler would be
a significant chunk, but that has nothing to do with being x86.

------
krylon
I think to many programmers assembly is the "GOTO" of programming languanges:
From the day you start learning to program, you are told that all this fancy
high-level-language stuff is there so you do not have to deal with assembly.
So most people never go there.

I _did_ go there, briefly, about ten years ago. It was wicked fun. But all in
all, I may have written maybe 20 or 30 instructions of assembly in total. I
did try to rewrite a few small, heavily-used functions from our code base in
assembly only to discover that the code I came up with was practically
identical to what the compiler emitted. At that point I figured that the
people who told me that "you can't beat the compiler" were probably right and
called it a day[0]. Alas, I never had the hardcore performance requirements
that would make go back there. But it was fun to get a taste of it.

[0] At the same time, I was kind of proud that I did not get beat by the
compiler. Then again, those functions were fairly trivial.

~~~
scottlamb
> At that point I figured that the people who told me that "you can't beat the
> compiler" were probably right and called it a day

On that subject...my understanding [0] is:

These days, the best bet for beating the compiler is to use vendor intrinsics
(for SIMD, encryption, bit-twiddling, etc). Shaving an instruction off the
inner loop might give you a few percent; using SIMD lets you operate on 256 or
512 bits per instruction instead of 8, 16, 32, or 64. You might be able to
show your inner loop is memory-bound (and thus prove further improvements have
to come from algorithmic improvements / better cache locality, rather than
continuing to fiddle with instructions).

The compiler automatically uses SIMD sometimes, but it can't do so reliably:

* The transformations require things the compiler isn't allowed to do, like increasing alignment of key variables or altering the larger algorithm.

* code that might run on older processor revisions needs multiple implementations selected at runtime. I think gcc has some magic extension ("target_clones"?) to do this relatively easily; otherwise you might need to write your own logic to decide which function pointer to use.

Note that each "vendor intrinsic" matches one assembly instruction, and it's
valuable to understand assembly while writing them, but the actual code you
check in can end in .cc (C++) or .rs (Rust) or whatever. Doing so means it can
be inlined into functions written in the higher-level language, you don't have
to encode knowledge about the platform's calling convention into your code,
etc.

[0] Not from personal experience. Corrections welcome.

~~~
glangdale
Your understanding is very good for someone without personal experience.

Automatic compiler use of SIMD is rarely that great unless you're in a nice
big loop doing nice regular things. I've pretty much never seen it on the
stuff I do.

Using intrinsics gets you 95% of the way there. I reach for asm only when I
absolutely have to. It is a huge PITA. My irritation at the "bro, just write a
.s file" people peaks when I'm trying to write a 200LOC function with 10
different variants based on (say) pipeline depth and unroll width. Yeah,
because I'd like to spend the next year doing register allocation by hand.

The compiler is really good at doing routine stuff, and when I hand-edit the
asm to do things that better fit my idea of regalloc and scheduling I usually
make things worse. Where the compiler falls down is instruction selection and
stuff that borders on algorithm design.

For example, I built a shift-or string matcher in SIMD where a first-stage was
OK to have false positives (positives in shift-or are represented by zeros in
the bit vector). I was able to get a big performance boost by tolerating these
false positives when shifting SIMD bits and bringing in some zeros, but no
compiler is going to know that a few false positives are OK in that
circumstance.

IMO the best way to work is with intrinsics, a tiny bit of embedded asm for
things that you can't get intrinsics for (I had to resort to gcc asm blocks to
make a cmov happen) and close inspection of your object file (at least on the
hotspots) to ensure that the code you're getting is what you think you're
getting. It's possible to make minor screwups and suddenly see dozens of extra
instructions pushing everything in and out of memory for no good reason.

The other place you can beat the compiler is by doing deeper/wider pipelining
of branch free code. This is a dark art. Often going branch free is 10-20%
worse than branchy when you have _1_ iteration happening at a time but it will
scale better when you are going lots of stuff at once - if you have (say) 12
different copies of your loop body happening in one iteration, and there's a
mildly unpredictable branch per loop body, the branch miss on one iteration
stops all the others from progressing too!

I occasionally blog on these things at branchfree.org and have some more low-
level stuff brewing shortly.

------
pjmlp
Loved the article.

Then one can follow up with some MS-DOS classics.

[https://www.amazon.com/Peter-Nortons-Assembly-Language-
Book/...](https://www.amazon.com/Peter-Nortons-Assembly-Language-
Book/dp/0136619010)

[https://www.amazon.com/Advanced-S-DOS-Programming-
Microsoft-...](https://www.amazon.com/Advanced-S-DOS-Programming-Microsoft-
Programmers/dp/0914845772/ref=pd_lpo_sbs_14_t_2)

[https://www.amazon.com/Peter-Norton-Programmers-Bible-
progra...](https://www.amazon.com/Peter-Norton-Programmers-Bible-
programming/dp/1556155557/ref=pd_lpo_sbs_14_t_0)

[https://www.amazon.de/PC-Intern-Programming-Encyclopedia-
Dev...](https://www.amazon.de/PC-Intern-Programming-Encyclopedia-
Developers/dp/1557551456/ref=sr_1_2)

~~~
mhd
I see your Norton and raise you an Abrash:

[https://www.amazon.com/Zen-Assembly-Language-Knowledge-
Progr...](https://www.amazon.com/Zen-Assembly-Language-Knowledge-
Programming/dp/0673386023/)

[https://www.amazon.com/Zen-Graphics-Programming-2nd-
Applicat...](https://www.amazon.com/Zen-Graphics-Programming-2nd-
Applications/dp/1883577896/)

[https://www.amazon.com/Zen-Code-Optimization-Ultimate-
Softwa...](https://www.amazon.com/Zen-Code-Optimization-Ultimate-
Software/dp/1883577039/)

Zen of Asm is also online, I think a few of the other works, too.

[http://www.jagregory.com/abrash-zen-of-asm/](http://www.jagregory.com/abrash-
zen-of-asm/)

~~~
disqard
Ahhh, Zen of Graphic Programming was my one of first graphics books, which led
straight to Foley, vanDam, et. al., and I was hooked :)

It's safe to say Abrash's book had a big impact on me.

Norton's books also make me nostalgic, so thank you for that, GP.

------
setquk
I started with 6502, then ARM in 1989 (thanks acorn!), then a little bit of
x86. The latter was a culture shock. It’s was like grooming the devil’s
genitals in comparison to ARM. It drove me to C and Unix where I’ve been happy
ever since (occasionally a bit of PIC assembly as well).

It’s not scary just nasty.

------
inamberclad
The problem I keep running into is that everybody's asm is different looking,
and not even internally consistent. Trying to parse all of this stuff:

    
    
        [BITS 16]  ;tell the assembler that its a 16 bit code
    

Okay, so this is an instruction to the assembler saying that we're only
working with 16 bit registers and to emit 16 bit code. Straightforwards enough
here.

    
    
        [ORG 0x7C00];Origin, tell the assembler that where the code will
        
    
                ;be in memory after it is been loaded
    

This means that the first instruction will be at physical address 0x7c00 when
the code loads, right? What's the first instruction, the one below?

    
    
        mov ah, 0x0A ; Set the call type for the BIOS
    
    
        mov al, 66   ; Letter to display
    
    
        mov cx, 3    ; Times to print it
    
    
        mov bh, 0    ; Page number
    

Makes enough sense, just mov instructions al refers to the top 8 bits of
register ax and al refers to the bottom 8 bits. I see that register bh is
zeroed, but what about bl? Why not xor bx, bx or something like that? Don't
instructions normally go in a section named .text? How does the BIOS know
where to find this code and begin execution in the first place?

...

    
    
        TIMES 510 - ($ - $$) db 0    ;fill the rest of sector with 0
    

Wtf is that? What's a sector? What the syntax here? Are we repeating this 510
times, or 510 minus some value? The db command is for writing bytes, but where
are the bytes written to?

    
    
        DW 0xAA55          ; add boot signature at the end of bootloader
    

Where are these bytes written? Why is DW capitalized here when db was left in
lowercase before?

~~~
wruza
ORG shifts all addresses below it to given offset. That means that external
loading code is supposed to put it there, otherwise all non-relative
instructions will load/store at wrong locations.

BL is not set because it is irrelevant. How you clear/save a register was up
to you.
[https://en.wikipedia.org/wiki/INT_10H](https://en.wikipedia.org/wiki/INT_10H)

.text is a section of executable format, but what you got is a boot loader. It
doesn’t have sections, it’s just a first 512-byte boot disk sector that is
loaded at 7C00h by BIOS and directly jmp’d into.

I forgot what exactly $-$$ is, but one of those is “current” address and
another should be ORG address, but I’m not sure, so this expression turns into
e.g. TIMES 410 DB 0, if 100 bytes used by so far.

AA55 is a boot sector end marker, something like canary value. Never looked
into that in detail.

Assemblers were mostly case-insensitive, like Pascal and BASIC. Case is up to
you, if you have any preference.

------
partycoder
Some free quality resources, ready to use:

\- Intel CPU programming guide
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-instruction-set-reference-manual-325383.pdf)

\- Calling conventions
[https://en.wikipedia.org/wiki/X86_calling_conventions](https://en.wikipedia.org/wiki/X86_calling_conventions)
[https://en.wikibooks.org/wiki/X86_Disassembly/Calling_Conven...](https://en.wikibooks.org/wiki/X86_Disassembly/Calling_Conventions)

\- Web compiler output viewer [https://godbolt.org/](https://godbolt.org/)

\- Operation code (opcode) reference
[http://ref.x86asm.net/](http://ref.x86asm.net/)

\- Disassembler/debugger [https://github.com/eteran/edb-
debugger](https://github.com/eteran/edb-debugger),
[https://x64dbg.com](https://x64dbg.com)

\- Disassembler/Reverse engineering tools
[http://hte.sourceforge.net/](http://hte.sourceforge.net/)
[https://github.com/radareorg/cutter](https://github.com/radareorg/cutter)

And this:
[https://github.com/codilime/veles](https://github.com/codilime/veles) (binary
analysis tool)

~~~
zshrdlu
Probably also need some syscall references:
[https://fresh.flatassembler.net/lscr/](https://fresh.flatassembler.net/lscr/)

~~~
partycoder
Neatly organized. The "man" command also has documentation on syscalls.

A good introduction to syscalls, with some explanation is "The Linux
Programming Interface", [http://man7.org/tlpi/](http://man7.org/tlpi/) (book)

------
glangdale
Anyone who really wants to dig into this, and is willing to be a bit scared,
should review the materials for the 15-410 operating systems course at
Carnegie Mellon University. The fall edition of the page isn't up yet but
there are probably some older versions kicking around.

If you are at CMU you should definitely take this course, as long as your pain
tolerance is high.

This course teaches students to write a preemptive operating systems kernel
from scratch. It is _quite_ an experience (I TA'd it after finishing my PhD
while trying to figure out what to do next).

The kernels the students wrote used to be able to boot on standard (but old)
PC hardware. Sadly the modern USB stack is so complex that a conformant thing
to talk to a USB keyboard is pretty much as complex as the student's whole
project (but less educational). So non-legacy hardware that lacks a PS/2
keyboard no longer has this nice easy path to read/write stuff to console
without a lot of setup.

------
dcomp
What would be more interesting is the UEFI boot process. I don't think any new
computer comes booting into 16bit real mode anymore. I would love to see a
"Write your own UEFI bootable kernel" I'm sure going straight to a semi sane
32bit environment is much easier to deal with than starting at 16 and working
your way up.

~~~
kccqzy
> I don't think any new computer comes booting into 16bit real mode anymore

It most certainly does. See the chapter on 8086 emulation in the Intel
Software Developer's Manual, Volume 3.

------
mmjaa
Sometimes its necessary to return to the past to remember the things we've
abandoned in the rush to modernity.

I'm speaking, of course, of the wonderful Prince of Persia and its delights.
So many treasures to be discovered for anyone interested in even a little bit
of assembly-language programming.. it definitely sharpens my chops, anyway:

[http://fabiensanglard.net/prince_of_persia/index.php](http://fabiensanglard.net/prince_of_persia/index.php)

PRINCE OF PERSIA CODE REVIEW.

------
secure
I love the built-in demos!

------
AnnoyingSwede
The key to x86 assembler/machine code was avoiding the software interupts as
they were painfully slow. Sure, when you are dealing with a 512byte boot
sector you are better of offloading as much code as you can to software
interupts, but for everything else the battle was to come up with something
faster than the software interupts.

In many cases some basic routines used in qbasic was actually faster than
their counterparty interupts in asm. A grand example of this would be using
mode 13h (320x200) and interupt 10 to set all pixels on a screen to a single
color, which could take up to 2 seconds using machine code and bios interupts
(as it verifies vertical and horizontal refresh prior to setting the pixel).
Using interupts however is relativly painfree as the author pointed out.

------
nsxwolf
It's hard to find a tutorial that really tells you what is going on. You have
to start with whatever operating system you're on, then figure out what the
heck you're reading. What is x86 and what is some reserved word peculiar to
the assembler you're using?

Here's macOS Hello, world code I see floating around a lot, and I can't make
heads or tails of it:

    
    
      global start
    
      section .text
      start:
          push    dword msg.len
          push    dword msg
          push    dword 1
          mov     eax, 4
          sub     esp, 4
          int     0x80
          add     esp, 16
    
          push    dword 0
          mov     eax, 1
          sub     esp, 12
          int     0x80
    
      section .data
    
      msg:    db      "Hello, world!", 10
      .len:   equ     $ - msg

~~~
pjc50
Basically everything left-aligned is for the assembler, and the indented
material is the mnemonics for the x86 set.

The assembler breaks down as:

    
    
          push    dword msg.len
          push    dword msg
          push    dword 1
    

Place a pointer to the message, its length, and 1 on the stack. Each of these
decrements esp by 4.

    
    
          mov     eax, 4
    

We'll be doing syscall 4 in a sec

    
    
          sub     esp, 4
    

Reserve 4 bytes for the return

    
    
          int     0x80
    

Do the syscall! See
[https://opensource.apple.com/source/xnu/xnu-2782.20.48/bsd/k...](https://opensource.apple.com/source/xnu/xnu-2782.20.48/bsd/kern/syscalls.master)
; this is 'write'

    
    
          add     esp, 16
    

Discard the return value and arguments from the stack.

    
    
          push    dword 0
          mov     eax, 1
          sub     esp, 12
          int     0x80
    

Similarly setup and do syscall 1 to exit.

~~~
nsxwolf
But then there's that "msg.len", which I assume is some sort of assembler
shortcut for getting the length of an array and not part of the x86.

~~~
pjc50
That's referring to the _label_ msg.len, which is a value computed at compile
time in the 'data' section at the bottom.

    
    
      msg:    db      "Hello, world!", 10
      .len:   equ     $ - msg
    

'msg' is a label, which will be assigned to an _address_ at compile time. 'db'
short for 'declare bytes' puts some bytes at that address. '.len' is then
defined as the current output address ($) minus the start of the string.

~~~
e12e
In addition to all these excellent comments, it can also be instructive to go
"backwards" from C, here on Linux:

cat hello.c

    
    
      #include <unistd.h>
    
      char str[] = "Hello, World!\n";
    
      int main() {
        write(1, str, sizeof(str)-1);
        return 0;
      }
    

cat hello.s

    
    
      .file	"hello.c"
      	.intel_syntax noprefix
      	.globl	str
      	.data
      	.align 8
      	.type	str, @object
      	.size	str, 15
      str:
      	.string	"Hello, World!\n"
      	.text
      	.globl	main
      	.type	main, @function
      main:
      .LFB0:
      	.cfi_startproc
      	push	rbp
      	.cfi_def_cfa_offset 16
      	.cfi_offset 6, -16
      	mov	rbp, rsp
      	.cfi_def_cfa_register 6
      	mov	edx, 14
      	lea	rsi, str[rip]
      	mov	edi, 1
      	call	write@PLT
      	mov	eax, 0
      	pop	rbp
      	.cfi_def_cfa 7, 8
      	ret
      	.cfi_endproc
      .LFE0:
      	.size	main, .-main
      	.ident	"GCC: (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0"
      	.section	.note.GNU-stack,"",@progbits
    

Note the line: lea rsi, str[rip] which I belive is something along the lines
of load effective address (lea), into rsi register, of "str" symbol offset
with relative instruction pointer (rip - to allow for relative addressing).

You can compile and run with:

gcc -std=c11 hello.c && ./a.out

Prouduce hello.s from hello.c with: gcc -std=c11 hello.c -S -masm=intel

~~~
cesarb
"lea" is a fun instruction.

Going piece by piece: on x86, each register has several sizes. For instance,
rax is a 64-bit register, with its lower 32-bit half being eax, its lower
16-bit half being ax, and its lower 8-bit half being al (for historical
reasons). So when a register name starts with "r", it's the 64-bit variant.
Therefore, "rip" is the 64-bit address of the next instruction.

The instruction "mov rsi, str[rip]" would get the 64-bit address of the next
instruction, add to it a fixed offset (which the assembler and linker compute
for you, as the exact offset you need to get to the data at the "str" label),
load 64 bits from that address, and put the result in the "rsi" register. And
that's not even the most complex addressing mode; you can get a register, add
to it another register multiplied by 2, 4, or 8, add to it a constant, and use
it as the memory address to load, store, or even modify in-place.

The "lea" instruction (load effective address) is a way to get directly at the
power of that complex address calculation logic for your own uses. Instead of
using the computed memory address to get at the memory, it's the memory
address that's put in the register. Therefore, where "mov rsi, str[rip]" would
read from memory, "lea rsi, str[rip]" would put in rsi the memory address the
"mov" would have read from.

This also allows for a few tricks. For instance, you can use "lea" to multiply
a number by five, without using the multiplier: just use a register and the
same register scaled by 4 (and you can also add a constant to the result,
still in a single complex instruction).

------
CrankyCanuck
Back in the late 80's, when I was in University, we had second year course
where we had to do x86 assembly on PC XT or AT clones. Assignment #1 was
messing with keyboard interrupts...easy peasy. Assignment #2? Write a video
game...basically we had to do the "snake game" with a never ending line you
could steer as it grew. Anyhow, I have to admit pretty much the whole class
figured it out...assembly was not nearly as insane as we thought. By the end
of that course, I was almost as comfy writing x86 assembly as Pascal. Fun
times.

------
adricnet
Thanks for the share this looks great!

It aligns quite well with the study (in diagrams and commented assembly) of
x86 in POC || GTFO starting in pocorgtfo04.pdf chapter 3 provided by Shikhin
Sethi.

------
chaotic_clanger
it seems to me that x86 assembly is the less readable kind of assembly. i
always found powerpc set more thought through.

------
bogomipz
Tangential question - does anyone know what model IBM that is?

~~~
wglb
It appears to be the original IBM-PC.

~~~
AnIdiotOnTheNet
Yes, the IBM 5150 PC. In fact, the image is the one used on the Wikipedia page
for "Desktop Computer".

~~~
bogomipz
Ah yes, the 5150. Thank you.

------
kowdermeister
I wasn't really familiar with webassebly, but when I've seen this source for
fibonacci I immediately recognized familiar features which can't be said of
the rest of the assembly implementations:

[https://github.com/nebulet/nebulet/blob/master/wasm/fibonacc...](https://github.com/nebulet/nebulet/blob/master/wasm/fibonacci.wat)

[https://github.com/Hanks10100/wasm-
examples/blob/master/simp...](https://github.com/Hanks10100/wasm-
examples/blob/master/simple/math.wast)

