I think that playing around with it for 2 hours will teach you more than most classes on the topic. It really drives home why interactivity is such a bit deal in education.
You should also try writing a script for counting instructions in binaries. It's pretty illuminating. Here are some sample statistics
You'll see the various ways the compiler avoids emitting an actual multiplication instruction. The smallest multiplier I found that actually produced an imul: 46.
GCC produces an lea that multiplies by 5 into %rax for the return value, then another lea that multiplies that by 2 and adds the original.
As I step through an assembly language program, it gives me instant visual feedback of the contents of each register, flag, and memory.
I'm really surprised that there's nothing remotely like this on the web yet, and that I have to resort to running dosbox or freedos to have access to this super useful tool.
 - http://www.oocities.org/siliconvalley/office/6208/
I also just created a .tar.xz file of my own copy of the tutorial and uploaded it to:
I also did a base64 encoding of the same .tar.xz file and pasted it here:
The sha256 checksum of the .tar.xz file is:
Please note that this is not an official version of this tutorial. It's just what I had lying around on my disk after who knows how many years. Use at own risk.
If you want an official copy, you could try contacting the author at firstname.lastname@example.org
I love it even more for using Intel x86 assembly syntax, instead of the horrible AT&T syntax.
As an aside, I was looking at the header code generated by gcc to handle the initial function call. What's the convention on how we assign parameters to registers? I'm trying stuff out, and the first paramter always seems to be edi, then esi, eax. Except when I change the types, and it turns into xmm0 and the like. And the result returned is eax too, except when it's a float, then it's xmm0. How do things know where to look?
The compiler generates code that uses the correct register. So the compiler picks a register into which it will put the result and then generates code after the calling location that gets the result from the correct register.
And yeah, there's quite a bit of surprises. E.g. I found out that gcc is smart enough to perform tail call optimizations https://godbolt.org/g/MZDmwP
Also that example, heh. I tried to go back gcc versions to see if there was a case where it didn't do TCO - nope. Also, I like how returning 0 is "xor eax, eax".
Is this the same thing as, or related to, the calling conventions that used to be used in Microsoft DOS and Windows native language / C apps some years ago - things like "long far pascal" as function declaration /definition qualifiers, and things like the fact that C pushes function arguments onto the stack (before the call) from left to right, and Pascal does it from right to left (or the other way around)? (Did some of that stuff, but it's been a while).
I did read the surrounding comments to this one, and saw that some of the topic was about registers, not the stack.
So the choice of registers cannot be arbitrary, unless the compiler knows the function is only used within an object file.
The registers are predetermined by a convention unless you use the 'static' keyword to signal that the function is only used internally to a module, so the compiler has complete freedom to choose registers.
By using information kept with the function, or perhaps even encoded into the function name itself (as already happens when distinguishing between different calling conventions, or in the case of C++ name mangling)?
Coming from an Asm background, where there basically is no one "calling convention", and programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains) are for what, I've always wondered why that idea didn't seem to go far.
How would you do that with dynamically linked code, inspect functions you're calling at runtime before laying out your arguments?
> perhaps even encoded into the function name itself
That would mean name mangling in C and assembly.
> Coming from an Asm background, where there basically is no one "calling convention"
Right, because you can lay out memory however you want since you're at the assembly level. Higher-level code (C up) can't do that, so instead you've got standard calling conventions for inter-library call (inside a compilation unit, the compiler is free to use alternate calling conventions since it has complete control over both sides of the callsite, that's also how it can inline calls entirely).
> programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains)
Some standard CC (though not the old CDECL) also use registers, so far as they can, depending on the arch. The SystemV AMD64 ABI uses 6 x86_64 registers for integer/pointer arguments and 8 SSE registers for FP arguments, with the rest on the stack.
(And realistically, for all but the most trivial functions, having one convention is probably a highly reasonable default. Trivial functions should probably be made available to the compiler for inlining anyway. Note also that GCC allows you to override the number of arguments passed on the stack via an annotation on 32-bit x86, if you insist.)
Ah, interesting, I figured that the generated object files would just store some metadata on that basically.
Why is it so different with different optimisation levels? The default emits quite a bit of code, -O1 is `mov eax, 0`
Often you would not want the flags updated when simply clearing a register - you're hardly likely to test the Zero flag having just set something to zero, because it's obvious, and more importantly you may want to set something to zero while preserving the flags from a previous operation.
But often you don't care about the flags so you can use the slightly shorter and/or faster XOR operation. It used to generally be shorter and faster because the MOV instruction had the initial step of an immediate load of the zero from memory.
And that's why it changes with different optimisation levels - the compiler knows when the flags need to be preserved, and if they don't it can get away with using XOR.
> -O1 is `mov eax, 0`
Simply because it is shorter: On x86-64 (and x86-32)
31h C0h or 33h C0h
B8h 00h 00h 00h 00h
Having privately analyzed some 256b demos I cannot even imagine how one could even come to the idea to use `mov r32, imm32` for zeroing a register (except for the reason that people don't want to understand how the assembly code is internally encoded) - the canonical way to use is `xor` (`sub` also works in principle, but `xor` is the way that is recommended by Intel).
EDIT: Here is an article about that topic: https://randomascii.wordpress.com/2012/12/29/the-surprising-...
If the condition flags have to be preserved, you are right. But otherwise, read the linked article (https://randomascii.wordpress.com/2012/12/29/the-surprising-...):
"On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 220.127.116.11 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it."
A proficient human coder, on the other hand, writes assembler that is partly optimized by default.
But few humans could write code like a seriously optimizing compiler, esp. on modern pipelined architectures - that stuff is unintelligible. Which is as it should be, because modern processors are not designed to be programmed directly by humans.
I'd say that's more attributed to someone many many years ago deciding they would not follow the official Intel syntax (for what reason I do not know), and somehow convincing the rest of the community to follow them. That's actually one of the things that could make for a very interesting article: how one processor family got two different and incompatible Asm syntaxes. The fact that the mnemonics and syntax don't correspond to those found in the manufacturer's datasheets and manuals just increases the barrier to understanding. As far as I know, the same didn't happen to ARM, MIPS, SPARC, and the others. Especially when the sense of the comparisons/conditional jumps is reversed, and some of the more advanced addressing modes look less-than-obvious, it's hard to imagine why anyone would adopt such a syntax:
Note that the GNU tools have option to use Intel syntax too, so you can avoid some of the confusion (in the DOS/Windows and embedded world at least until recently, Intel syntax is overwhelmingly the norm.)
For someone who grew up on normal processors (MC68000 and UltraSPARC) AT&T syntax is the best thing since sliced bread: it's perfectly logical to move something to somewhere, instead of "move to somewhere something".
cmp eax, 5 ; eax - 5
jg morethan5 ; eax > 5 ? then jump.
sub eax, ecx ; eax = eax - ecx
sub A, B ; B = A - B ??
And it's all arbitrary anyway. Some people might prefer a [src, dest] ordering, but it's not inherently any more natural than [dest, src]. Look at variable assignments: "x = y" in almost any programming language will assign y to x.
That is confusing as all hell to me: if I compare x to 5, and 5 to x, it's still the same comparison, so what difference does it make?
Anyway, on Motorola 68000 it would look like so, assuming data was in data register 0 (there are eight general purpose data registers, and eight general purpose address registers):
cmp.l #5, d0 ; d0 is unchanged by the comparison
; substract the value of d1 from d0, and store the result
; in d0.
sub.l d1, d0
cmp.l #5, d0
the jump if greater/lesser/equal/greater-or-equal/etc instructions are all defined in terms of checking the sign, zero, carry and overflow flags. for instance, jz and je are the same instruction. you just use one or the other if it makes more sense in context (I used to write asm by hand for the art of 4096 byte demos)
I used intel notation, because that's what the Asphyxia tutorials and turbo pascal used. So it's what I'm used to. I don't have much of an opinion about the order except that it was an idiotic decision to swap it, either way, the confusion that caused doesn't weigh against if/which one would be more theoretically "right". All the sigils and pre/postfixes seem messy though.
I haven't done assembly code for a while, and was not an expert at it earlier, so guessing, but:
it may be because of what flags in the flags register (if there is one nowadays) get set - they could be different for the two versions of your comparison.
Are you sure? That was my whole point - that it may not be that way. As I said, it's been a while, but it seems to me that the bits that get set in the flags/status register, on comparing A to B, should be, in some sense at least, the opposite (maybe not for all the bits) of what would get set on comparing B to A; because I thought it would be done by subtracting A from B or B from A, and then setting (some of) those flag bits based on which was greater or equal. If that is so, comparing A to B will not have the same result in the register as comparing B to A. And the reason why I think so, is that there are assembly instuctions like JGE (Jump if Greater or Equal), JE (Jump if Equal), JNE (Jump if Not Equal), etc. - the meaning of those instructions would get changed and so would the resulting action (jump or not jump) based on the looking at the flags set on comparing A to B vs. B to A.
On any given processor, in this context, there is one and only one way to compare an immediate value with one in a register, so you don't have to worry about whether you're comparing 5 to %eax, or eax to 5: you can't subtract the value in %eax from 5, because 5 is an immediate value, not a memory location.
.align 4, 0x90
movl $17, %eax
cmpl $5, %eax
movl $5, %eax
> cc cmp.s -o cmp
> ./cmp; echo $?
.align 4, 0x90
movl $17, %eax
cmpl %eax, $5
movl $5, %eax
> cc cmp.s -o cmp
cmp.s:6:13: error: invalid operand for instruction
cmpl %eax, $5
Also, "move destination, source(s)" is consistent regardless how many sources there are (although I agree with you that "source -> destination" is more intuitive for us left-to-right people).
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddb mm0, mm1
paddb xmm0, xmm1
First convince yourself (for example by looking at page 26-39) that in the individual tables the column "Solaris Mnemonic" stands for the AT&T syntax and the column "Intel/AMD Mnemonic" stands for the Intel syntax.
Now look at page 48. Surprise: It is paddb, paddw, paddd, paddq both in Intel and AT&T syntax.
In Go's case it's because they wanted to have the syntax reflect the MachineInstr-like abstraction they have in place. I wouldn't be surprised if something similar was responsible for the AT&T syntax as well.
(In case it isn't clear, my preference is to do what LLVM does and to treat MachineInstr and machine code as distinct objects, and to use the vendor syntax for the human readable representation of the latter.)
To quote "The details vary with architecture, and we apologize for the imprecision; the situation is not well-defined." 
Since as had some issues with the Intel syntax I was using, I decided to not use its support for the syntax and convert it to AT&T.
Never again, it is just plain ugly.
Compared with DOS/Windows Macro Assemblers (TASM, MASM, NASM, ...), AT&T ones are just plain pre-historic.
Do you know the reason for this?
(please don't upvote if you agree, i'm at 69 points :)
That would imply that the Intel syntax is not LL(1) or LR(1). Are you sure?
Break your programs into basic blocks! Reverse engineers never read assembly in a straight line. Instead, they read the control flow graphs of subroutines, which is the graph where nodes are runs of instructions ending in jumps and branches, and the edges are the jump targets. I hope this doesn't sound complicated, because it isn't: it's literally just what I wrote in this paragraph. It takes about 15 minutes for most platforms to learn enough to recover CFGs from subroutines by hand.
To get a decent understanding of what a chunk of assembly code is doing, all you really need is:
* The code broken into subroutines (this is usually your starting point) and then CFGs (good disassemblers do this for you, but it's easy to do by hand as a first pass)
* The CALLs (CALLs don't end basic blocks!)
* The platform's calling convention (how are arguments passed and return values returned from subroutines)
There are two tricks to reading large amounts of assembly:
1. Most of the code does not matter, and you won't be much better off for painstakingly grokking it.
2. Virtually all the assembly you'll see is produced by compilers, and compilers spit out assembly in patterns. Like the dude in The Matrix, after an hour or so of reading the CFGs of programs from a particular compiler, you'll stop having to read all the instructions and start seeing the "ifs" and "whiles" and variable assignments.
Anyway, it shows three different encodings for the ADD instruction. The first:
 If the current code segment is designated as a 16-bit segment, then the w means 16 bits, unless a size override byte (an opcode prefix byte) is present, in which case it means 32-bits. If the current code segment is designated as a 32-bit segment, then the w means 32 bits, again unless a size override byte is present, in which case it means 16-bits.
 It seems to me that if w=1, then the s bit is extraneous and thus could be used to encode other instructions. I'm not sure if that is the case but it's common to use otherwise nonsensical instruction encoding to do something useful.
Opcode 82h is an alias for 80h --- it presumably sign-extends the immediate value into an internal temporary register, but the upper bits don't matter anyway since it's an 8-bit add. Some interesting discussion on that here, along with an example application:
The link shows nine ways to use the ADD instruction with each method resulting in a different opcode.
I.e., because of things like addressing modes, different invocations of an ADD instruction can map to different machine instructions. But one ADD invocation will always map to one machine instruction.
The parent comment sounded to me like one assembly instruction could map to several machine instructions, like one line of C is equivalent to several lines of assembly. Just wanted to clarify that that isn't the case.
That's not true on x86-16, x86-32 and x86-64. For example
"The assembler automatically embeds a "fingerprint" into the generated code through a particular choice of functionally equivalent instruction encodings. This makes it possible to tell if code was assembled with A86, and also to distinguish between registered and unregistered versions of the assembler, although access to the source code is required."
Said another way, when you write:
MOV eax, 5
This will map to _either_:
110111 _or_ 110110, but _not_ both in sequence.
(Almost wish that last link was cut off one letter earlier...)
Seeing the confusion and clarifications in the replies to your comment, I think it may have been more clear if you had said (and you probably meant):
"Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one out of a set of machine instructions (where the chosen machine instruction depends on things like the addressing mode (immediate, indexed, indirect indexed, etc. - I'm using older terms for addressing mode, not sure if they are valid now with newer processors. but the concept is the same).
Or go learn Z80, x86's weird, 8-bit cousin (it had a 16-bit version, but it sold poorly), which had a greater emphasis on backwards compatability (you can run code from the original 8080 on a Z80, unchanged), and is nicer to work with (because it wasn't extended in unticipated directions far beyond its original capabilities, while keeping fetishistic backwards compatability by stacking hack on top of hack on top hack. It also didn't have memory segmentation, otherwise known as The Worst Thing.)
There are only two common reasons to learn Z80 assembler, though: to program the Gameboy (which runs on a modified Z80 with all the cool instructions removed), and to program a TI calculator, thus making all highschoolers in your area immensely happy.
TI calculators are a comically overpriced scam, that have only survived because of the College Board, but that's another story.
If you're ever in a situation where you need to read assembly, it's typically not up to you what ISA it'll be in. There are two situations where I've had to read assembly: either reverse-engineering a compiled binary or trying to understand the compiler output for a small piece of a program I'm working on in a higher-level language.
Besides, you don't have to worry about the worst of x86 in those cases.
I'm pretty sure everyone has to learn at least that much x86 at some point, like it or not.
But I'm only 15. I still have time.
Actually, it was more along the lines of that being a thing I should do at some point.
It's a good skill to have, and since not everyone has it you will often be able to solve problems that no one else can.
x86 is clearly not a beautiful ISA, but it is not as black as it is painted. The first thing that one should understand is large parts of the encoding of the instructions make a lot more sense once one writes them down in octal instead of hexadecimal (something even the Intel people who wrote the reference manual seem to have missed):
Like POSIX, MS-Windows, X11, WinAPI, C++, HTML5, ... ;-)
Although POSIX and HTML5 hold up a bit better than Win32 and C++, IMHO. Especially POSIX (it's not great, but it works pretty well).
But I've heard X11 is absolutely miserable, so the windows folks don't have a monopoly on satanically evil APIs with religious backwards-compatability.
But fair enough.
Maybe I do need to do more research. Maybe I need to try new things.
I like to think I don't senselessly parrot, but that doesn't mean that I'm right.
Also reading the hard-won experience of others is much more efficient than trying to get it yourself. Books are wonderful. With enough books you can advance beyond the authors without having to tread the same paths, you parrot their findings as a base for your own. Or another case, you can at least be aware of common pitfalls your senior coworkers are constantly falling into because of an aversion to reading. Why do you know the pitfall is there? You're just parroting back what a book said. That doesn't make it untrue, or not useful to know, or not useful to share with other people, or even not useful to bring up to show there's a shared context.
I coded my first assembler program (a link relocation routine) when I was 13, so age is pretty meaningless in the context of assembler coding.
As forums go, HN is far from the worst. But it could be better.
Maybe I'm crazy, and I'm definitely inexperienced. So maybe take my opinion with a grain of salt?
But yes, so far as I've seen, trying to understand x86 in its present state is a painful experience.
The examples in the post are clearly x86-64 running in 64 bit mode. I.e. it's running in flat model... there is no segmentation to worry about.
Under Windows gs (under x86-32 fs) points to the Thread Information Block (TIB):
Under Linux gs is used for thread-local storage (TLS).
But "as a typical programmer" indeed you only have to worry about the internal details of this if you are an OS developer, otherwise you can simply use the appropriate segment override prefix and otherwise don't care about it.
Stupid outdated resources...
It depends if you're compiling for ARM or Thumb. The main rules are:
ARM and 32-bit Thumb instructions: 12 bits for arithmetic data processing instructions and load/store, 8bits with even rotate right for bitwise data processing instructions + MOV and MVN
16-bit Thumb: 3 bits for arithmetic instructions with Rd != Rn, 5 bits for shifts (so the whole range is covered) and load / store (but shifted left by the size of the data, i.e. the maximum offset for this version of LDR is (0x1F << 2) = 124), 8 bits for arithmetic instructions with Rd == Rn, SP-relative loads / stores and literal loads.
I doubt x86's popularity has much to do with its instruction set at this point. In particular, the variable length instructions are a pain for decoders (both hardware and software).
Also the whole set of exchange-register-with-itself were defined but never used. E.g. xchg ax,ax which does nothing in one byte. In fact that one was considered useful, its used as the 'no-op' instruction (0x90) right? But what about xchg bx,bx, xchg cx,cx and so on? Just wasted single-byte opcodes. Leaving actual common instructions to use longer bytecode sequences.
So maybe an executable should begin with an opcode-decode-table that is loaded with the code, that tells the hardware what byte sequences mean what instructions. So each executable code can be essentially compressed, using optimum coding for exactly the instructions that code uses most often. Just thinking out loud.
Luckily in x86-16
xchg eax, eax
> But what about xchg bx,bx, xchg cx,cx and so on?
This (cleverly?) cannot be encoded in one byte. Here you have to use at least two bytes (0x87 followed by the ModR/M byte in x86-16; the same holds for their 32 bit counterparts xchg ebx, ebx; xchg ecx, ecx etc. in x86-32):
> So maybe an executable should begin with an opcode-decode-table that is loaded with the code, that tells the hardware what byte sequences mean what instructions.
The engineers of the Transmeta Crusoe/Efficion processor tried something similar:
"Crusoe was notable for its method of achieving x86 compatibility. Instead of the instruction set architecture being implemented in hardware, or translated by specialized hardware, the Crusoe runs a software abstraction layer, or a virtual machine, known as the Code Morphing Software (CMS). The CMS translates machine code instructions received from programs into native instructions for the microprocessor. In this way, the Crusoe can emulate other instruction set architectures (ISAs).
This is used to allow the microprocessors to emulate the Intel x86 instruction set. In theory, it is possible for the CMS to be modified to emulate other ISAs."
Compilers just schedule destination registers carefully, never need to swap them around.
Actually, it can happen with certain instructions that need fixed register constraints (multiply, divide, string ops) --- I've encountered a few cases where, had the compiler knew about the exchanges, it could've avoided using another register or spilling to memory. As far as I know, in modern x86 cores the reg-reg exchanges are handled in the register renamer, so they aren't slower than using an extra register and definitely faster than spilling to memory (which might happen anyway for something else if it needed the extra register.)
To witness, here is something no compiler (software) I know of can generate, even when given code that could generate it:
xchg eax, edx
add eax, edx
I thought it couldn't a second ago. I don't know why. Makes no sense to me now.
Here's a handy heuristic: if somebody claims to know every x86-64 instruction (or even every x86-32 instruction), you can be at least 90% sure they're lying.
I very much prefer ARM myself, but you can probably apply the same rule of thumb to it. There are upwards of 400 instructions both in AArch32 and AArch64, with a fair number of differences between the two.
Edit: I've also posted a breakdown of the immediate limitations for ARM in this thread. It's not that complicated when sticking to the standard instructions.
I should probably count up x86 against ARM, so I have more than guesses to go on here. Maybe that part of x86 actually better.
It is not easy to tell how many instruction there actually are on x86:
It is definitely, definitely not the ISA.
It is though.
What about the myriad of other (mostly vintage) computer systems and video game consoles out there? ;)
Sega Master System and Game Gear, for example.
As an aside: most of the old, classic 8-bit micros are complete pains to write modern code for, because modern programming languages all assume fast stacks. The Z80 has no stack-relative addressing, which means you need to reserve a precious index register as a frame pointer at the top of every function, and then indirect off that --- but the Z80 designers didn't realise that people would want to do it so often and as a result it's verbose, deal slow, and doesn't handle 16-bit values. So you need to do:
ld h, [iy+8]
ld l, [iy+9]
The Game Boy processor (which doesn't have a snappy name) allows this:
ld hl, sp+8
ldi a, [hl] // load and increment
ld l, [hl]
mov a, h
If you look at the instruction encodings, the Z80's actually a pile of nasty hacks. The original 8080 is way more elegant; and there's lots of software and tooling for it, too. (But it still can't run C efficiently.)
>If you look at the instruction encodings, the Z80's actually a pile of nasty hacks. The original 8080 is way more elegant; and there's lots of software and tooling for it, too. (But it still can't run C efficiently.)
I don't know about what makes an instruction encoding elegant or inelegant, so can't help you there.
Yes, the 8080 is probably more elegant, but the extra features on the Z80 are incredibly useful (especially register exchange: The Z80 had two sets of registers, which you can exchange. No, Zachtronics didn't make that up: that was a real thing, on the Z80 at least). Also, the Z80 tooling is quite nice: asxxxx and WLA-DX are fine assemblers, and SDCC is a pretty good C compiler. It sure as heck beats cc65, in any case.
But yes, if you want to program your TRS-80 (but only the original: later ones were 6502), or your ZX* (How many of you lot know the ZX line? Spectrum? No?), or your Game Gear, or your Master System, or any of the various CP/M machines, you have to learn Z80.
The ZX Spectrum and its clones were very popular in the UK, Eastern Europe, and the former USSR.
I learned to program on the 6502 based VIC-20 using Lance Leventhal's "6502 Assembly Language Subroutines" as a guide by manually assembling and poking into memory. What a fun time.
And thanks for the book reccomendation.
I can only assume you've never had to program a Burroughs B90 in assembly language.
After programming in it, I can assure you that programming in other assembly languages (including x86) is a breeze.
I should put my cheat-sheet on the web, if I can still find it.
There was also a B1900 built in Liège in Belgium, which was a 24 bit machine whose instruction set was designed to run virtual machines (i.e. interpreters). Those systems had a reputation for being slow. I don't know much about them.
The Liège plant closed around 1982 and the Cumbernauld plant closed around 1985.
Burroughs mainframes (B5000 onwards to A series) may be the ones you're thinking of. These are justifiably praised for being ahead of their time. They were high level stack based machines with 48 bit words + 3 tag bits, and programmed directly in an Algol 60 variant, with additional instructions to enable COBOL to execute efficiently. There was no assembly language needed.
Yes, intel is really bad, especially for learning, and while ARM is certainly better, it's pretty esoteric, and also backwards (right to left) like intel.
If you want a nice, orthogonal ISA to learn assembler on, MC68000 family is a song. The instructions are human readable, the processor is big endian, and the moves are src, dst. It's almost like a high level programming language.
The 6502's a bit less simple to learn, but I'd say it's worth it. It worked its way into many important computers, and is arguably one of the most emulated and most used processors in existence.
The Z80 only has a 16-bit address space.
My favorite CISC architecture is the 68000 series.
The 68k (and it's 8-bit semi-cousin, the 6809) were very nice. However, unlike the Z80, the 6502, and x86, they're no longer being made, and are increasingly rare.
They're in the "legacy" line and labelled "not recommended for new design" but I think that's just design opinion and doesn't imply it's no longer made.
I'm sure this video has done the HN rounds before. It's a slow but fascinating watch. "Motorola 68000 Oral History Panel" from original Motorola team members. https://www.youtube.com/watch?v=UaHtGf4aRLs
But definitely Z80 over x86. x86 is pain.
MIPS and ARM are fun, though, AFAICT. As is 6502.
I think the video maker does a good job of mapping a simple C program to its disassembly.
This assertion comes up over and over again in the last 30 years. Every time I've had it asserted to me, it always came from non-assembler programmers, who always wrote in a high level language. I have yet to see evidence of optimizing compilers generating code even remotely close in efficiency to what we would code directly in assembler.
A coder would never write all those extra frame pointer setup code, nor would they waste encoding and tact cycles shuffling values from one register to another. For example, a human might write the code from the article thus:
Add42: addb $42, %al
They can actually. Compiler optimizations have come long way, even Java's JIT should be able to optimize that. (ok, not using the AL register)
My personal story - I used to use exclusively assembler for 6502 and 8086 as it actually ran fast enough.
In the mid 90s I saw Delphi's code (and Delphi was not known for its optimizations) but it was able to use the Pentium instruction
pairing which takes quite an effort to accomplish by hand.
While beating an old compiler was easy it was the time the compilers began making strides rivaling humans.
Still, hand written inner loops in Assembly might yield some performance (iirc, grep still relies on) but overall there are very limited amount of settings where there would be significant difference... to warrant the effort (incl. correctness and [micro]benchmarks)
But a human would almost never use some of those more complex instructions, for a very simple reason: they eat too many clock cycles. When one is coding in assembler, one usually targets two constraints:
1. the least amount of clock cycles needed to pull off an operation;
2. the least amount of bytes to encode the operation.
Where those two meet is where the best coders get unbelievable performance out of the hardware. At least that's the case in the demo scene, although many nowadays cheat by banging the GPU's in CUDA or OpenGL.
A human will know while coding in assembler, at any given time, how many variables are in the game; and will almost always manage to fit them all within processor's registers; There were only two times in my life where I actually had more than eight variables within a subroutine and had to use the stack, and even then, I didn't push everything, but only as many registers as I was actually coming up short, and the rest I still stuffed in the available registers. The other time, I figured out a more efficient algorithm where I could fit everything within the seven general purpose address registers (a0 - a6, since a7 is the stack pointer). A human will also know whether the expected result is within a byte, word, longword, or quadword range, and will only use those instruction and register sizes; a compiler has no chance to figure that out. It's trivial for humans, but as far as I'm aware, impossible for a general compiler algorithm.
In fact, even the best optimizing compilers are so dumb, that one is not allowed to mix and match 8-, 16-, 32- or 64-bit code; one must either compile everything 32- or 64-bit (the linker won't let one link 32- and 64-bit object code together). A human could easily write correct assembler code using all of those instruction / register sizes at once, and we often do.
I have yet to see a compiler capable of inferring that. If you know of one, please show me the generated code. I'd love to use such a compiler.
moveq $52, %eax
You could purposely tell GCC to omit the frame pointer -fomit-frame-pointer, but then you just made the code extremely difficult to debug (and stack setup code will still be generated, as the compiler can't generate useful / functional code without it), so that's no solution either.
without explicit direction from humans, compilers can't generate code which calls subroutines or functions without using the stack at the very minimum, and that was my point.
And that's just GCC on intel; check out intel's compiler, it's even worse (click on "turn off intel syntax", and "compile to binary and disassemble the output"):
...and if you try other processors as well, it just gets worse and worse:
Intel syntax is much cleaner, in particular, Intel Ideal (as opposed to MASM), and specifically, FASM (flat assembler).
FASM makes it as clean as possible and turns writing assembly into a joy.
movl %fs:-10(%ebp), %eax ; AT&T
mov eax, dword ptr fs:[ebp-10] ; MASM
mov eax, [fs:ebp-10] ; FASM
Your example won't assemble because size can't be guessed, but this will:
mov dword [fs:ebp-10], 5
fs movs byte [edi], [esi]
movs byte [fs:edi], [esi]
movs byte [edi], [fs:esi]
Note that code can jump past some of the prefixes. The C library on Linux does this to bypass prefixes. Reasonable assembly syntax needs to be able to describe this. You need to be able to put a label right after a prefix.
fs rep gs fs ; Note: extraneous prefix
mov eax, [fs:ebx]
> Note that code can jump past some of the prefixes. The C library on Linux does this to bypass prefixes
What's the use case for this?
I wouldn't split the prefix from the instruction and would rather use label+1 where it's absolutely required.
Well, what isn't? I have a knack for languages but that beautiful beast seems like an almost impenetrable fortress of strangeness.
On the other hand, that encourages me to learn to read basic assembly.
"To write code that runs directly on your microprocessor you need to know how memory segmentation works"
Although you can't completely ignore segments, in practice at least on Linux the only segments in use are user code/data and kernel code/data segments.
Does anyone know why the author might suggest that understanding segmentation is necessary to write Assembly code?
When writing programs for user mode (ring 3 on x86), you hardly need to care (except sometimes use some segment override prefixes (cf. https://news.ycombinator.com/item?id=13052076), which pedantically is "dealing with the MMU", since because of the MMU this works; but in my opinion it is not necessary to understand the technical details behind it, why this works).
On the other hand, if you are an OS ("operating system", here I don't mean "open source") developer, you probably better know the details of the MMU.
Concerning https://news.ycombinator.com/item?id=13052892: I also consider the author's statement as misleading that one has to know how segmentation works. The knowledge of about segmentation is absolutely necessary for x86-16 (real mode), which many people tend to associate with assembly (because there seem to be many more assembly tutorials available for DOS/x86-16 than for x86-32 or even x86-64), but hardly relevent for people who just write user mode code.
It's not magic. The best way to learn assembly is to program in it. I learned on the Gameboy by getting a job and programming 2 games in it. Fun as hell, especially when the machine is small enough to really fit in your head and clock cycles count at 4Mhz.
No one particularly enjoyed working in raw ASM 100% of the time.
addl $42, %edi
add edi, 42
I was going to post this last night but figured it must have been posted already. I was wrong! Nice to see it ok the front page.
I suppose this could work if, as you suggest, I manually called out to a lib_c function like printf instead.