I think that playing around with it for 2 hours will teach you more than most classes on the topic. It really drives home why interactivity is such a bit deal in education.
One great example to try: add -O3 to the compiler options, then write a function returning its argument multiplied by 2, 3, 4, ...
You'll see the various ways the compiler avoids emitting an actual multiplication instruction. The smallest multiplier I found that actually produced an imul: 46.
One of the most useful tools that helped me learn assembly was the Ketman Assembly Language Tutorial.[1]
As I step through an assembly language program, it gives me instant visual feedback of the contents of each register, flag, and memory.
I'm really surprised that there's nothing remotely like this on the web yet, and that I have to resort to running dosbox or freedos to have access to this super useful tool.
Please note that this is not an official version of this tutorial. It's just what I had lying around on my disk after who knows how many years. Use at own risk.
If you want an official copy, you could try contacting the author at btketman@btinternet.com
This is amazing. After a few minutes I learned so much already - try with an empty function that returns 0, then return an argument, then return an argument +1, argument^2, etc.
As an aside, I was looking at the header code generated by gcc to handle the initial function call. What's the convention on how we assign parameters to registers? I'm trying stuff out, and the first paramter always seems to be edi, then esi, eax. Except when I change the types, and it turns into xmm0 and the like. And the result returned is eax too, except when it's a float, then it's xmm0. How do things know where to look?
This is determined by the calling convention [0]. If you're interacting with code that you didn't write, you need to know the calling convention that they use in order to set up arguments correctly.
The compiler generates code that uses the correct register. So the compiler picks a register into which it will put the result and then generates code after the calling location that gets the result from the correct register.
And yeah, there's quite a bit of surprises. E.g. I found out that gcc is smart enough to perform tail call optimizations https://godbolt.org/g/MZDmwP
OK, so it's not a magical convention - it works it out bottom-up. First decide on registers for each parameter when you generate the code for the function, then based on that generate specific code for the instances where you call that function. Cool, thank you.
Also that example, heh. I tried to go back gcc versions to see if there was a case where it didn't do TCO - nope. Also, I like how returning 0 is "xor eax, eax".
It is a convention, it's called a procedure call standard. Compilers which conform to the PCS can call functions compiled by other compilers (that's how you can use libraries for example). If there's a bug in the compiler that results in non-PCS compliance, well that's a "fun" bug to track down.
>It is a convention, it's called a procedure call standard.
Is this the same thing as, or related to, the calling conventions that used to be used in Microsoft DOS and Windows native language / C apps some years ago - things like "long far pascal" as function declaration /definition qualifiers, and things like the fact that C pushes function arguments onto the stack (before the call) from left to right, and Pascal does it from right to left (or the other way around)? (Did some of that stuff, but it's been a while).
I did read the surrounding comments to this one, and saw that some of the topic was about registers, not the stack.
Suppose you're right and the registers are arbitrary. Then how would foreign function calls work? If you're compiling Rust code that calls into a C library, how does it know what registers to use?
So the choice of registers cannot be arbitrary, unless the compiler knows the function is only used within an object file.
The registers are predetermined by a convention unless you use the 'static' keyword to signal that the function is only used internally to a module, so the compiler has complete freedom to choose registers.
Then how would foreign function calls work? If you're compiling Rust code that calls into a C library, how does it know what registers to use?
By using information kept with the function, or perhaps even encoded into the function name itself (as already happens when distinguishing between different calling conventions, or in the case of C++ name mangling)?
Coming from an Asm background, where there basically is no one "calling convention", and programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains) are for what, I've always wondered why that idea didn't seem to go far.
How would you do that with dynamically linked code, inspect functions you're calling at runtime before laying out your arguments?
> perhaps even encoded into the function name itself
That would mean name mangling in C and assembly.
> Coming from an Asm background, where there basically is no one "calling convention"
Right, because you can lay out memory however you want since you're at the assembly level. Higher-level code (C up) can't do that, so instead you've got standard calling conventions for inter-library call (inside a compilation unit, the compiler is free to use alternate calling conventions since it has complete control over both sides of the callsite, that's also how it can inline calls entirely).
> programmers would document which registers (almost always registers, rarely the stack --- and that can make for some great efficiency gains)
Some standard CC (though not the old CDECL) also use registers, so far as they can, depending on the arch. The SystemV AMD64 ABI uses 6 x86_64 registers for integer/pointer arguments and 8 SSE registers for FP arguments, with the rest on the stack.
Win32 has various different calling conventions, and each function is annotated accordingly in the header files. It's all a bit of a mess, which is presumably why they drastically simplified it in the x64 transition.
(And realistically, for all but the most trivial functions, having one convention is probably a highly reasonable default. Trivial functions should probably be made available to the compiler for inlining anyway. Note also that GCC allows you to override the number of arguments passed on the stack via an annotation on 32-bit x86, if you insist.)
That can work for statically-linked object files, but what about dynamically linked? You can load one with a function call, get back a function pointer, and invoke it like any other function. Trying to use some metadata would slow down the function call significantly, even if you tried to cache it somewhere.
I used to be a native asm programmer in Z80 and 680x0, and one reason for using XOR rather than MOV is to do with condition codes: the XOR operation will most likely update the condition codes (notably, the Zero flag), whereas MOV will probably not.
Often you would not want the flags updated when simply clearing a register - you're hardly likely to test the Zero flag having just set something to zero, because it's obvious, and more importantly you may want to set something to zero while preserving the flags from a previous operation.
But often you don't care about the flags so you can use the slightly shorter and/or faster XOR operation. It used to generally be shorter and faster because the MOV instruction had the initial step of an immediate load of the zero from memory.
And that's why it changes with different optimisation levels - the compiler knows when the flags need to be preserved, and if they don't it can get away with using XOR.
It's been a while since I programmed low level but I think on the 68k series they started to introduce cache and multi stage instruction pipelines. By alternating instructions working on different things you could get a decent performance gain. If every instruction had to wait for the result of the previous instruction to complete then it wouldn't be running at its best. With careful planning you could insert 'free' instructions but you would have to watch how flags were altered. We used to spend quite a bit of time optimising code to this level, eeking every bit of performance out of the hardware. Great fun.
Sure, things have moved on a lot since those days. I think in modern RISC architectures you can even specify whether the instruction should set the condition flags.
> > Also, I like how returning 0 is "xor eax, eax".
> -O1 is `mov eax, 0`
Simply because it is shorter: On x86-64 (and x86-32)
xor eax,eax
encodes as
31h C0h or 33h C0h
(depending on the assembler; typically the first one is used) - 2 bytes, while
mov eax,0x0
encodes as
B8h 00h 00h 00h 00h
- 5 bytes.
Having privately analyzed some 256b demos I cannot even imagine how one could even come to the idea to use `mov r32, imm32` for zeroing a register (except for the reason that people don't want to understand how the assembly code is internally encoded) - the canonical way to use is `xor` (`sub` also works in principle, but `xor` is the way that is recommended by Intel).
It's not just shorter, it's also faster. But see my answer also: there are condition flag implications of using XOR and sometimes MOV will be preferable. The optimiser will always know best :)
"On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 2.1.3.1 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it."
It's fascinating how far down the rabbit hole goes these days. One might think machine code as emitted by compilers would be pretty close to where the buck stops, but no. Named registers are just an abstraction on top of a larger register pool, opcodes get JIT compiled and optimized to microcode instructions, execution order is mostly just a hint for the processor to ignore if it can get things done faster by reordering or parallelizing... And memory access is probably the greatest illusion of all.
What I also find rather interesting is the concept of macro-op fusion that Intel introduced with the Core 2 processors: This means for example that a cmp ... (or test ...) followed by a conditional jump can/will be fused together to a single micro-op. In other words: Suddenly a sequence of two instruction maps to one internal micro-op. If you are interested in the details, read section 8.5 in http://www.agner.org/optimize/microarchitecture.pdf
the lower levels of optimizations are supposed to be more straightforward translations of the high-level language code. you can imagine this might be useful if you are debugging assembly.
On the other hand, I find O0 is significantly worse than what even a novice human Asm programmer would do if asked to manually compile code, and O1 would be around the same as a novice human.
Yes, I used to find that too. It's because, pre-optimization, on older architectures, the compiler outputs chunks of asm as if from a recipe book. Loads of unnecessary memory access, pointless moving data between registers, etc.
A proficient human coder, on the other hand, writes assembler that is partly optimized by default.
But few humans could write code like a seriously optimizing compiler, esp. on modern pipelined architectures - that stuff is unintelligible. Which is as it should be, because modern processors are not designed to be programmed directly by humans.
You still need an overview of the instructions and registers. Knowing where the operator and operand of the instruction is also helpful. (i always mix up intel and at&t notations)
What a train wreck! It’s hard to imagine a more confusing state of affairs.
I'd say that's more attributed to someone many many years ago deciding they would not follow the official Intel syntax (for what reason I do not know), and somehow convincing the rest of the community to follow them. That's actually one of the things that could make for a very interesting article: how one processor family got two different and incompatible Asm syntaxes. The fact that the mnemonics and syntax don't correspond to those found in the manufacturer's datasheets and manuals just increases the barrier to understanding. As far as I know, the same didn't happen to ARM, MIPS, SPARC, and the others. Especially when the sense of the comparisons/conditional jumps is reversed, and some of the more advanced addressing modes look less-than-obvious, it's hard to imagine why anyone would adopt such a syntax:
Note that the GNU tools have option to use Intel syntax too, so you can avoid some of the confusion (in the DOS/Windows and embedded world at least until recently, Intel syntax is overwhelmingly the norm.)
This. The AT&T syntax for x86 thing is a huge mistake. All the official docs are Intel syntax. Intel syntax is easier to read and write. Half the gotchas in this article are problems that don't exist in Intel syntax, like the instruction suffixes. The instruction suffixes get even weirder when you get to the sign extending instructions. I wrote an article about this here: http://blog.reverberate.org/2009/07/giving-up-on-at-style-as...
For someone who grew up on normal processors (MC68000 and UltraSPARC) AT&T syntax is the best thing since sliced bread: it's perfectly logical to move something to somewhere, instead of "move to somewhere something".
I haven't done any 68K Asm and barely glanced at SPARC, but how does src, dst interact with noncommutative operations like subtraction and comparison? E.g. with x86 Intel syntax,
This is one of the most confusing things about AT&T x86 --- the comparisons and subtractions have their operands reversed, and you have to identify and manually reverse them to understand the code correctly. With Intel syntax, the operands to a subtraction appear in the usual arithmetic order. Or do those processors' syntax keep the order but instead replace the subtrahend with the result??
Exactly this. The instruction set is designed around Intel syntax. When you flip operands around because you prefer a different ordering, it messes up things like jg/ja/jl/jb/etc.
And it's all arbitrary anyway. Some people might prefer a [src, dest] ordering, but it's not inherently any more natural than [dest, src]. Look at variable assignments: "x = y" in almost any programming language will assign y to x.
Yeah but in most assemblers you're not setting, but either loading or moving values into something, or from somewhere. Because of that, one never has to think in terms of x = y.
This is one of the most confusing things about AT&T x86 --- the comparisons and subtractions have their operands reversed,
That is confusing as all hell to me: if I compare x to 5, and 5 to x, it's still the same comparison, so what difference does it make?
Anyway, on Motorola 68000 it would look like so, assuming data was in data register 0 (there are eight general purpose data registers, and eight general purpose address registers):
cmp.l #5, d0 ; d0 is unchanged by the comparison
bgt MoreThanFive
;
; substract the value of d1 from d0, and store the result
; in d0.
;
sub.l d1, d0
however, we don't usually branch if greater or lower; we simply compare whether a register is equal to some value:
in x86 assembly it matters because what is comparison exactly? cmp is (very cleverly) defined as equivalent to sub (subtraction) without storing the result, only setting the flags.
the jump if greater/lesser/equal/greater-or-equal/etc instructions are all defined in terms of checking the sign, zero, carry and overflow flags. for instance, jz and je are the same instruction. you just use one or the other if it makes more sense in context (I used to write asm by hand for the art of 4096 byte demos)
I used intel notation, because that's what the Asphyxia tutorials and turbo pascal used. So it's what I'm used to. I don't have much of an opinion about the order except that it was an idiotic decision to swap it, either way, the confusion that caused doesn't weigh against if/which one would be more theoretically "right". All the sigils and pre/postfixes seem messy though.
>That is confusing as all hell to me: if I compare x to 5, and 5 to x, it's still the same comparison, so what difference does it make?
I haven't done assembly code for a while, and was not an expert at it earlier, so guessing, but:
it may be because of what flags in the flags register (if there is one nowadays) get set - they could be different for the two versions of your comparison.
Yes, there is a status register, every processor must have one (or else the processor couldn't function). Doesn't matter whether you compare 5 to a register (or memory location, depending on the processor), or memory / register to 5, the same bit(s) will still be set in the status register.
>Doesn't matter whether you compare 5 to a register (or memory location, depending on the processor), or memory / register to 5, the same bit(s) will still be set in the status register.
Are you sure? That was my whole point - that it may not be that way. As I said, it's been a while, but it seems to me that the bits that get set in the flags/status register, on comparing A to B, should be, in some sense at least, the opposite (maybe not for all the bits) of what would get set on comparing B to A; because I thought it would be done by subtracting A from B or B from A, and then setting (some of) those flag bits based on which was greater or equal. If that is so, comparing A to B will not have the same result in the register as comparing B to A. And the reason why I think so, is that there are assembly instuctions like JGE (Jump if Greater or Equal), JE (Jump if Equal), JNE (Jump if Not Equal), etc. - the meaning of those instructions would get changed and so would the resulting action (jump or not jump) based on the looking at the flags set on comparing A to B vs. B to A.
You're overthinking this way more than you need to.
On any given processor, in this context, there is one and only one way to compare an immediate value with one in a register, so you don't have to worry about whether you're comparing 5 to %eax, or eax to 5: you can't subtract the value in %eax from 5, because 5 is an immediate value, not a memory location.
it can't be done, because there is one and only one way to compare an immediate value with one in a register. intel or AT&T syntax -- dst, src or src, dst -- the comparison is the same. Therefore, AT&T syntax is the best thing since sliced bread, because it's left to right instead of right to left, which is how we think in terms of taking something and moving it somewhere -- in the physical world, step 1. will be to take an object and step 2. will be to move that object somewhere.
Thats only part of the syntax differences between AT&T and Intel.
Also, "move destination, source(s)" is consistent regardless how many sources there are (although I agree with you that "source -> destination" is more intuitive for us left-to-right people).
I prefer Intel syntax and admit that you have a good argument in your article why the AT&T syntax might be problematic, but for newer extensions (SSE etc.) the instruction naming in Intel syntax actually converged towards AT&T syntax: Just to give an example from SSE2 (from http://softpixel.com/~cwright/programming/simd/sse2.php):
That's not really "AT&T syntax", since the suffixes here are denoting how the MMX/XMM register is being split up and not the whole operand size. Those instructions can still be used with either the 64-bit MMX or 128-bit XMM registers:
paddb mm0, mm1
paddb xmm0, xmm1
If it was really more like "AT&T" style (I don't know how GNU really does this, so I'm guessing), it would be more like paddqb for MMX and paddob for SSE.
I accept the argument that the suffixes mean something a little different in these SSE2 instructions than in the "classical" x86 instructions. But I think it should be clear where these suffixes come from (thus there is some convergence of the Intel syntax for new instruction towards the AT&T syntax). And indeed the instructions paddb, paddw, paddd, paddq are the same on Intel and AT&T syntax. Look at
First convince yourself (for example by looking at page 26-39) that in the individual tables the column "Solaris Mnemonic" stands for the AT&T syntax and the column "Intel/AMD Mnemonic" stands for the Intel syntax.
Now look at page 48. Surprise: It is paddb, paddw, paddd, paddq both in Intel and AT&T syntax.
And Golang uses a separate syntax from Intel syntax and AT&T syntax, so now there are three incompatible syntaxes in common use. What a mess. :(
In Go's case it's because they wanted to have the syntax reflect the MachineInstr-like abstraction they have in place. I wouldn't be surprised if something similar was responsible for the AT&T syntax as well.
(In case it isn't clear, my preference is to do what LLVM does and to treat MachineInstr and machine code as distinct objects, and to use the vendor syntax for the human readable representation of the latter.)
A few years ago I decided to convert the Asm syntax of my toy compiler from Intel to AT&T, because I wanted to rely only on as, not forcing people to install another Assembler.
Since as had some issues with the Intel syntax I was using, I decided to not use its support for the syntax and convert it to AT&T.
Never again, it is just plain ugly.
Compared with DOS/Windows Macro Assemblers (TASM, MASM, NASM, ...), AT&T ones are just plain pre-historic.
Familiarity with the VAX I'd say. Although that doesn't explain the % in front of registers. Maybe that came from SPARC? Looks like they tried to unify various asm syntaxes.
I remember reading a long time ago about the syntaxes and the author said that it is because AT&T assembled faster.
Makes sense considering that the intel syntax requires some backtracking (or more of temporary data).
(please don't upvote if you agree, i'm at 69 points :)
So, this is fantastic, but I want to make an appeal for the most important thing to understand about any assembly language, even before you work out the individual instructions:
Break your programs into basic blocks! Reverse engineers never read assembly in a straight line. Instead, they read the control flow graphs of subroutines, which is the graph where nodes are runs of instructions ending in jumps and branches, and the edges are the jump targets. I hope this doesn't sound complicated, because it isn't: it's literally just what I wrote in this paragraph. It takes about 15 minutes for most platforms to learn enough to recover CFGs from subroutines by hand.
To get a decent understanding of what a chunk of assembly code is doing, all you really need is:
* The code broken into subroutines (this is usually your starting point) and then CFGs (good disassemblers do this for you, but it's easy to do by hand as a first pass)
* The CALLs (CALLs don't end basic blocks!)
* The platform's calling convention (how are arguments passed and return values returned from subroutines)
There are two tricks to reading large amounts of assembly:
1. Most of the code does not matter, and you won't be much better off for painstakingly grokking it.
2. Virtually all the assembly you'll see is produced by compilers, and compilers spit out assembly in patterns. Like the dude in The Matrix, after an hour or so of reading the CFGs of programs from a particular compiler, you'll stop having to read all the instructions and start seeing the "ifs" and "whiles" and variable assignments.
To understand assembly it really helps to know at least something about how computers work on a low level. When I first tried learning it (long time ago, in a high school) I had no idea how computers really work on such a low level, how CPU's addressing registers and that kind of stuff, and while I managed to learn the syntax, even write some asm code, it was all really confusing to me. And only few years later on University, after I've learned in details about the cpu architecture, registers, buses, DMA, etc, it suddenly all started to make perfect sense and became 100x more clear and easier. So if you're interested in this, it will save you a lot of effort to invest some time first to learn the computer architecture basics, and then from there go to learn the assembly lang. Just my $0.02
It's helpful to realize x86 assembly is not what's executed by the machine; machine code is. One assembly instruction, e.g. ADDL, is translated to several different machine code instructions depending on the destination, source, and addressing mode.
I'm looking at the Microsoft Macro Assembler 5.1 Reference manual (it was nearby and easily accessible to me; yes it's old (from the very late 80s or early 90s) but it covers the 32 bit 80386, which is still valid.
Anyway, it shows three different encodings for the ADD instruction. The first:
000000dw mod,reg,r/m
This adds register to register, or memory to register (either direction, the d above) using either 8 or 16/32 bits (the w above [1]). The second form:
100000sw mod,000,r/m
This adds an immediate value (8 or 16 bits, w again) to a register. The s bit is used to sign extend the data (s=1; otherwise, 0-extend it) if required [2]. The final form:
0000010w data
This adds an immediate value to the accumulator register (EAX, AX, AL) [1]. That's three different encodings for the "same" instruction. The MOV instruction (and again, I'm only talking about the 80386 here) has 8 different encodings, depending upon registers used.
[1] If the current code segment is designated as a 16-bit segment, then the w means 16 bits, unless a size override byte (an opcode prefix byte) is present, in which case it means 32-bits. If the current code segment is designated as a 32-bit segment, then the w means 32 bits, again unless a size override byte is present, in which case it means 16-bits.
[2] It seems to me that if w=1, then the s bit is extraneous and thus could be used to encode other instructions. I'm not sure if that is the case but it's common to use otherwise nonsensical instruction encoding to do something useful.
It seems to me that if w=1, then the s bit is extraneous and thus could be used to encode other instructions. I'm not sure if that is the case but it's common to use otherwise nonsensical instruction encoding to do something useful.
Opcode 82h is an alias for 80h --- it presumably sign-extends the immediate value into an internal temporary register, but the upper bits don't matter anyway since it's an 8-bit add. Some interesting discussion on that here, along with an example application:
There's a bit of a miscommunication going on. What I meant was, when you write an assembly instruction, that maps to one machine instruction.
I.e., because of things like addressing modes, different invocations of an ADD instruction can map to different machine instructions. But one ADD invocation will always map to one machine instruction.
The parent comment sounded to me like one assembly instruction could map to several machine instructions, like one line of C is equivalent to several lines of assembly. Just wanted to clarify that that isn't the case.
> There's a bit of a miscommunication going on. What I meant was, when you write an assembly instruction, that maps to one machine instruction.
That's not true on x86-16, x86-32 and x86-64. For example
060o, 310o
and
062o, 301o
(...o means "octal"; for the reason why I give this example in octal instead of hexadecimal cf. https://news.ycombinator.com/item?id=13051770) both stand for "xor al, cl" (the assembler you use will one of the two encodings) - for those people who really prefer hexadecimal here: It corresponds to
30h, C8h
and
32h, C1h
The fact that there are different ways to encode some instructions was used by the A86 assembler (https://en.wikipedia.org/w/index.php?title=A86_(software)&ol...) to watermark machine code that was generated by it; in particular to detect whether it was generated by a registered or unregistered version of A86:
"The assembler automatically embeds a "fingerprint" into the generated code through a particular choice of functionally equivalent instruction encodings. This makes it possible to tell if code was assembled with A86, and also to distinguish between registered and unregistered versions of the assembler, although access to the source code is required."
That is also true --- for example, "mov reg, reg" is a special case of "mov reg, r/m" or "mov r/m, reg" with the r/m specifying a register, so basically two separate sequences of bytes which perform the same operation. This has been exploited by copy-protection and steganography, going back to the A86 shareware assembler which was the first use of this technique that I can remember, to more recent developments:
>Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one machine instruction.
Seeing the confusion and clarifications in the replies to your comment, I think it may have been more clear if you had said (and you probably meant):
"Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one out of a set of machine instructions (where the chosen machine instruction depends on things like the addressing mode (immediate, indexed, indirect indexed, etc. - I'm using older terms for addressing mode, not sure if they are valid now with newer processors. but the concept is the same).
Yeah stuff like dma isn't getting enough attention in a lot of treatments of the subjects. Also in/out instructions (did you ever consider how a CPU talks to a HDD)?
Two ways. The first is IO mapped IO, which is what the x86 supports with the IN and OUT instructions. All this really is is a MOV to a different "address" space (16 bits of addressing). The second is memory mapped IO, in which hardware is mapped to the addressing space of the CPU (Motorola 68K uses this format). A CPU that allows IO mapped IO can also do memory mapped IO (it's not precluded).
PCI did for discovery, and could use it for device access (but often also memory-mapped). More info from the osdev wiki, which btw is a great place if you want to know about low-level initialization, the boot process etc.
While I like Petzold's "Code", it's really aimed at nontechnical audiences. Jon Stokes' "Inside the Machine" or Nisan and Schocken's textbook "The Elements of Computing Systems" are far better if you have a technical background.
I second the Stokes recommendation. I have very little hardware background, but quite a bit science and tech (engineering). He provided the right amount of background and introduction for me.
x86 is the worst ISA. If you want to play with assembler without feeling a desire to stab yourself and end it all, I recommend ARM.
Or go learn Z80, x86's weird, 8-bit cousin (it had a 16-bit version, but it sold poorly), which had a greater emphasis on backwards compatability (you can run code from the original 8080 on a Z80, unchanged), and is nicer to work with (because it wasn't extended in unticipated directions far beyond its original capabilities, while keeping fetishistic backwards compatability by stacking hack on top of hack on top hack. It also didn't have memory segmentation, otherwise known as The Worst Thing.)
There are only two common reasons to learn Z80 assembler, though: to program the Gameboy (which runs on a modified Z80 with all the cool instructions removed), and to program a TI calculator, thus making all highschoolers in your area immensely happy.
TI calculators are a comically overpriced scam, that have only survived because of the College Board, but that's another story.
This is about learning to read, not learning to write.
If you're ever in a situation where you need to read assembly, it's typically not up to you what ISA it'll be in. There are two situations where I've had to read assembly: either reverse-engineering a compiled binary or trying to understand the compiler output for a small piece of a program I'm working on in a higher-level language.
I'm not sure everyone has to learn even that much. I've managed to avoid it until very recently and I'm more willing to dive into underlying code than most people I know (I've contributed to a FreeBSD kernel patch in TCP and an OpenSSL key exchange interoperability fix)
I don't think that's true. I work at a major software company and certainly not all of my coworkers are interested enough in low-level stuff to ever need to read assembly.
It's a good skill to have, and since not everyone has it you will often be able to solve problems that no one else can.
x86 is clearly not a beautiful ISA, but it is not as black as it is painted. The first thing that one should understand is large parts of the encoding of the instructions make a lot more sense once one writes them down in octal instead of hexadecimal (something even the Intel people who wrote the reference manual seem to have missed):
...that really doesn't help with the issues I have: the instruction set is an absolutely ugly jumble of almost 40 years of religious backwards compatability, odd hacks, and various extensions.
Although POSIX and HTML5 hold up a bit better than Win32 and C++, IMHO. Especially POSIX (it's not great, but it works pretty well).
But I've heard X11 is absolutely miserable, so the windows folks don't have a monopoly on satanically evil APIs with religious backwards-compatability.
It helps to have actual real life experience under your belt when making such claims. You seem to be parroting what countless rants have already repeated without much content.
The GP didn't have much content either, merely listing off other obsessively backwards compatible things. Your reply might be suitable in a formal debate setting (as would "fallacy!" claims be suitable when challenging faulty deductive logic) but this isn't a debate, it's a conversation. The source of one's claims doesn't matter, you only know they're probably not from experience because the Parent was kind enough to share their age. The claims are used to drive the conversation and establish a shared context for further conversation (or at least ranting about the shape of our industry), not to debate.
Also reading the hard-won experience of others is much more efficient than trying to get it yourself. Books are wonderful. With enough books you can advance beyond the authors without having to tread the same paths, you parrot their findings as a base for your own. Or another case, you can at least be aware of common pitfalls your senior coworkers are constantly falling into because of an aversion to reading. Why do you know the pitfall is there? You're just parroting back what a book said. That doesn't make it untrue, or not useful to know, or not useful to share with other people, or even not useful to bring up to show there's a shared context.
Given, I'm fairly weak in assembler (I've been learning), so I may not be the most reliable resource: GP has a point about my lack of experience, just not for the reason they think they do.
Probably because not breaking programs registers a lot higher on most people's priority list than making it easier to write -- especially when we're talking about assembly, which hardly anyone writes in the first place.
Everything you wrote is actually correct, so I'm really not sure why you're being voted down, but it's a behavioral pattern I've noticed on HN in general: write anything that's not high praise or in any way disagrees with what's popular and expect to be brutally censored. It really has me contemplating ditching HN altogether, if all we're ever going to do here is stroke eachothers' egos and pander to popular trends. And here I thought the point of HN was stimulating discussion.
It's not the worst. There are far worse places. And this sort of problem appears in all forums, but especially voting-based systems like HN (the "internet points" problem).
As forums go, HN is far from the worst. But it could be better.
In my day I wrote a fair amount of x86 assembly language. I found it fairly easy and straight forward. There are probably even less reasons to learn it these days but nothing gives you a better idea of how a computer works (arguably, besides microcode, but that is a different kettle of fish).
The simple stuff's alright, but if you want to do anything performant, it gets hairy pretty fast, as you wade through almost 40 years of expansions and ugly hacks.
Under Linux gs is used for thread-local storage (TLS).
But "as a typical programmer" indeed you only have to worry about the internal details of this if you are an OS developer, otherwise you can simply use the appropriate segment override prefix and otherwise don't care about it.
I never felt more like stabbing myself than when trying to cipher out exactly which immediate values are possible on ARM, and which are not. X86 I happen to enjoy. It is not "the worst ISA" by any means. It has wonderful code density, which turns out the be very important. There's a reason that x86 won and continues to win.
> I never felt more like stabbing myself than when trying to cipher out exactly which immediate values are possible on ARM, and which are not.
It depends if you're compiling for ARM or Thumb. The main rules are:
ARM and 32-bit Thumb instructions: 12 bits for arithmetic data processing instructions and load/store, 8bits with even rotate right for bitwise data processing instructions + MOV and MVN
16-bit Thumb: 3 bits for arithmetic instructions with Rd != Rn, 5 bits for shifts (so the whole range is covered) and load / store (but shifted left by the size of the data, i.e. the maximum offset for this version of LDR is (0x1F << 2) = 124), 8 bits for arithmetic instructions with Rd == Rn, SP-relative loads / stores and literal loads.
I doubt x86's popularity has much to do with its instruction set at this point. In particular, the variable length instructions are a pain for decoders (both hardware and software).
When originally invented the x86 instruction set was efficient - the most-used instructions had shorter byte code sequences. But eventually some instructions got 'left behind' by the compilers. There are a whole host of single-byte instructions that are never, ever used by a compiler - the register exchange instructions for instance (xchg eax, ebx). Compilers just schedule destination registers carefully, never need to swap them around.
Also the whole set of exchange-register-with-itself were defined but never used. E.g. xchg ax,ax which does nothing in one byte. In fact that one was considered useful, its used as the 'no-op' instruction (0x90) right? But what about xchg bx,bx, xchg cx,cx and so on? Just wasted single-byte opcodes. Leaving actual common instructions to use longer bytecode sequences.
So maybe an executable should begin with an opcode-decode-table that is loaded with the code, that tells the hardware what byte sequences mean what instructions. So each executable code can be essentially compressed, using optimum coding for exactly the instructions that code uses most often. Just thinking out loud.
simply is encoded as 0x90, which is the same as nop (the same holds for
xchg eax, eax
in x86-32).
> But what about xchg bx,bx, xchg cx,cx and so on?
This (cleverly?) cannot be encoded in one byte. Here you have to use at least two bytes (0x87 followed by the ModR/M byte in x86-16; the same holds for their 32 bit counterparts xchg ebx, ebx; xchg ecx, ecx etc. in x86-32):
> So maybe an executable should begin with an opcode-decode-table that is loaded with the code, that tells the hardware what byte sequences mean what instructions.
The engineers of the Transmeta Crusoe/Efficion processor tried something similar:
"Crusoe was notable for its method of achieving x86 compatibility. Instead of the instruction set architecture being implemented in hardware, or translated by specialized hardware, the Crusoe runs a software abstraction layer, or a virtual machine, known as the Code Morphing Software (CMS). The CMS translates machine code instructions received from programs into native instructions for the microprocessor. In this way, the Crusoe can emulate other instruction set architectures (ISAs).
This is used to allow the microprocessors to emulate the Intel x86 instruction set. In theory, it is possible for the CMS to be modified to emulate other ISAs."
It is still efficient for general-purpose code compared to other ISAs; the majority of register-register and register-memory ops are 2-3 bytes, while for something like MIPS no instruction is ever shorter than 4 bytes.
Compilers just schedule destination registers carefully, never need to swap them around.
Actually, it can happen with certain instructions that need fixed register constraints (multiply, divide, string ops) --- I've encountered a few cases where, had the compiler knew about the exchanges, it could've avoided using another register or spilling to memory. As far as I know, in modern x86 cores the reg-reg exchanges are handled in the register renamer, so they aren't slower than using an extra register and definitely faster than spilling to memory (which might happen anyway for something else if it needed the extra register.)
To witness, here is something no compiler (software) I know of can generate, even when given code that could generate it:
Been done (somewhat). The PERQ (https://en.wikipedia.org/wiki/PERQ) had a writable instruction set. And I think Transmeta was trying to do something similar with translating x86 code to an internal format for execution.
Are you using the past tense because you are going to tell us about another contemporary ISA that has more code per byte? X86 _is_ compact and this continues to be relevant to performance today.
Well, yes. That part sucks. But on x86, everything is like that. I'd rather have one weird, inconsistant thing than an amorphous, ever-shifting mass of them. Which is x86 in a nutshell.
Here's a handy heuristic: if somebody claims to know every x86-64 instruction (or even every x86-32 instruction), you can be at least 90% sure they're lying.
> Here's a handy heuristic: if somebody claims to know every x86-64 instruction (or even every x86-32 instruction), you can be at least 90% sure they're lying.
I very much prefer ARM myself, but you can probably apply the same rule of thumb to it. There are upwards of 400 instructions both in AArch32 and AArch64, with a fair number of differences between the two.
Edit: I've also posted a breakdown of the immediate limitations for ARM in this thread. It's not that complicated when sticking to the standard instructions.
Fair enough, some instructions have many variants. But I work with ARM assembly almost daily and I still wouldn't remember that, for example, 'VQRDMULH' is a real instruction and it stands for 'Vector Saturating Rounding Doubling Multiply Returning High Half'.
The Game Boy doesn't quite use a Z80 --- like the Z80 it's based on the 8080, but in a different way. So you don't get things like the IX and IY or the alternate register banks, but you do get things like (a very crude set of) stack-relative addressing modes, which makes it a better fit for modern programming languages than the Z80.
As an aside: most of the old, classic 8-bit micros are complete pains to write modern code for, because modern programming languages all assume fast stacks. The Z80 has no stack-relative addressing, which means you need to reserve a precious index register as a frame pointer at the top of every function, and then indirect off that --- but the Z80 designers didn't realise that people would want to do it so often and as a result it's verbose, deal slow, and doesn't handle 16-bit values. So you need to do:
ld h, [iy+8]
ld l, [iy+9]
...for a total of 8 bytes of code and lots of cycles.
The Game Boy processor (which doesn't have a snappy name) allows this:
ld hl, sp+8
ldi a, [hl] // load and increment
ld l, [hl]
mov a, h
...which is (IIRC) five bytes. Still not great, but shorter, and also loads faster.
If you look at the instruction encodings, the Z80's actually a pile of nasty hacks. The original 8080 is way more elegant; and there's lots of software and tooling for it, too. (But it still can't run C efficiently.)
As I understood it, the GB is based on the Z80, not the 8080: That's why its nickname is the GBZ80.
>If you look at the instruction encodings, the Z80's actually a pile of nasty hacks. The original 8080 is way more elegant; and there's lots of software and tooling for it, too. (But it still can't run C efficiently.)
I don't know about what makes an instruction encoding elegant or inelegant, so can't help you there.
Yes, the 8080 is probably more elegant, but the extra features on the Z80 are incredibly useful (especially register exchange: The Z80 had two sets of registers, which you can exchange. No, Zachtronics didn't make that up: that was a real thing, on the Z80 at least). Also, the Z80 tooling is quite nice: asxxxx and WLA-DX are fine assemblers, and SDCC is a pretty good C compiler. It sure as heck beats cc65, in any case.
But yes, if you want to program your TRS-80 (but only the original: later ones were 6502), or your ZX* (How many of you lot know the ZX line? Spectrum? No?), or your Game Gear, or your Master System, or any of the various CP/M machines, you have to learn Z80.
... and for a brief period of time in the late '80s, "popular" in India as well. (I quoted popular because they were bloody expensive. In 7th grade, a kid in my class had one. He was the only one in all of the school to have a computer.).
Another good one for learning is the 6502. Don't forget to buy a copy of Lance A. Leventhal's "6502 Assembly Language Programming". Great book for learning not only assembly, but also the fundamentals leading up to assembly.
I learned to program on the 6502 based VIC-20 using Lance Leventhal's "6502 Assembly Language Subroutines" as a guide by manually assembling and poking into memory. What a fun time.
Of course. Who hasn't wanted to write their own chiptunes? Grab your (emulated) Ricoh 2A03 (and your emulated Konami VRC VI, for some extra fun), and get hacking.
It wasn't easy. Very asymmetric, not many registers, and tiny stack. The only things that were written directly in it were the operating system, and the virtual machines for higher level languages such as COBOL and MPL, because it was too hard to compile to. I worked on the virtual machines.
After programming in it, I can assure you that programming in other assembly languages (including x86) is a breeze.
I should put my cheat-sheet on the web, if I can still find it.
Isn't the high level Burroughs assembly a compertly insane almost high level language? I looked into implementing it into my assembler and just ran in the other direction when I saw how much unnecessary co plexity there was in the language.
B90 was an 8 bit machine and was built at Cumbernauld in Scotland, where I worked. The B900 had similar architecture.
There was also a B1900 built in Liège in Belgium, which was a 24 bit machine whose instruction set was designed to run virtual machines (i.e. interpreters). Those systems had a reputation for being slow. I don't know much about them.
The Liège plant closed around 1982 and the Cumbernauld plant closed around 1985.
Burroughs mainframes (B5000 onwards to A series) may be the ones you're thinking of. These are justifiably praised for being ahead of their time. They were high level stack based machines with 48 bit words + 3 tag bits, and programmed directly in an Algol 60 variant, with additional instructions to enable COBOL to execute efficiently. There was no assembly language needed.
x86 is the worst ISA. If you want to play with assembler without feeling a desire to stab yourself and end it all, I recommend ARM.
Yes, intel is really bad, especially for learning, and while ARM is certainly better, it's pretty esoteric, and also backwards (right to left) like intel.
If you want a nice, orthogonal ISA to learn assembler on, MC68000 family is a song. The instructions are human readable, the processor is big endian, and the moves are src, dst. It's almost like a high level programming language.
The 6502's a bit less simple to learn, but I'd say it's worth it. It worked its way into many important computers, and is arguably one of the most emulated and most used processors in existence.
I agree that the x86 ISA is pretty warty (though I have a strange fondness for it), but I'd recommend 6502 rather than Z80. There are a lot of fun retro-computer platforms that are 6502-based. Thinking of the zero-page functioning as a register-bank is really fun, too.
For 8-bits I'd recommend the 6809. Two 8-bit accumulators that can be used as a 16-bit accumulator, 4 16-bit index registers (that can largely be interchanged, except for S which is also the stack register) and you can generate pure relocatable code. And the zero-page isn't restricted to address $0000.
This sounds really interesting. I've never taken a look at the 6809 before. I happen to have a working Tandy CoCo I scored in a vintage computer haul. I'll definitely check out 6809 assembler.
The 68k (and it's 8-bit semi-cousin, the 6809) were very nice. However, unlike the Z80, the 6502, and x86, they're no longer being made, and are increasingly rare.
They're in the "legacy" line and labelled "not recommended for new design" but I think that's just design opinion and doesn't imply it's no longer made.
I'm sure this video has done the HN rounds before. It's a slow but fascinating watch. "Motorola 68000 Oral History Panel" from original Motorola team members. https://www.youtube.com/watch?v=UaHtGf4aRLs
68k Macs are easy enough to find on eBay. Grab one of the old Powerbook 1xx series laptops and you have a self-contained development station that's less expensive than many modern microcontroller dev kits.
If you want something friendly and CISC-y and modern, check out the Renesas RX600 and friends. They have a nice instruction set and zero-wait-state RAM and ROM, so writing assembly by hand with predictable timing ought to be easy.
(author here) ...FYI I actually learned assembly language for the first time back in the 1970s and 80s using a 6809 inside a Radio Shack "Color Computer." I was super-fun at the time. I don't remember much of it now but I'm sure x86 isn't as clean or fun as 6809 assembly was.
And, of course, modern compilers will usually produce faster, more optimized code than you ever could, without making any mistakes.
This assertion comes up over and over again in the last 30 years. Every time I've had it asserted to me, it always came from non-assembler programmers, who always wrote in a high level language. I have yet to see evidence of optimizing compilers generating code even remotely close in efficiency to what we would code directly in assembler.
A coder would never write all those extra frame pointer setup code, nor would they waste encoding and tact cycles shuffling values from one register to another. For example, a human might write the code from the article thus:
Add42: addb $42, %al
ret
and that's it. No frame pointer or stack setup, that's all unnecessary overhead because compiler algorithms can't reliably make such contextual decisions.
>>because compiler algorithms can't reliably make such contextual decisions
They can actually. Compiler optimizations have come long way, even Java's JIT should be able to optimize that. (ok, not using the AL register)
My personal story - I used to use exclusively assembler for 6502 and 8086 as it actually ran fast enough.
In the mid 90s I saw Delphi's code (and Delphi was not known for its optimizations) but it was able to use the Pentium instruction
pairing which takes quite an effort to accomplish by hand.
While beating an old compiler was easy it was the time the compilers began making strides rivaling humans.
Still, hand written inner loops in Assembly might yield some performance (iirc, grep still relies on) but overall there are very limited amount of settings where there would be significant difference... to warrant the effort (incl. correctness and [micro]benchmarks)
My personal story - I used to use exclusively assembler for 6502 and 8086 as it actually ran fast enough. In the mid 90s I saw Delphi's code (and Delphi was not known for its optimizations) but it was able to use the Pentium instruction pairing which takes quite an effort to accomplish by hand.
But a human would almost never use some of those more complex instructions, for a very simple reason: they eat too many clock cycles. When one is coding in assembler, one usually targets two constraints:
1. the least amount of clock cycles needed to pull off an operation;
2. the least amount of bytes to encode the operation.
Where those two meet is where the best coders get unbelievable performance out of the hardware. At least that's the case in the demo scene, although many nowadays cheat by banging the GPU's in CUDA or OpenGL.
Why couldn't they though? Doesn't sound very hard to only generate the wordy prologues and epilogues when necessary (i.e. when you have to save any registers). Why they apparently don't do this is another question then.
I'm not aware of a general algorithm which is capable of deciding whether and how many processor's registers to use instead of setting up frame and stack pointers, and pushing a variable number of arguments on the stack, are you?
A human will know while coding in assembler, at any given time, how many variables are in the game; and will almost always manage to fit them all within processor's registers; There were only two times in my life where I actually had more than eight variables within a subroutine and had to use the stack, and even then, I didn't push everything, but only as many registers as I was actually coming up short, and the rest I still stuffed in the available registers. The other time, I figured out a more efficient algorithm where I could fit everything within the seven general purpose address registers (a0 - a6, since a7 is the stack pointer). A human will also know whether the expected result is within a byte, word, longword, or quadword range, and will only use those instruction and register sizes; a compiler has no chance to figure that out. It's trivial for humans, but as far as I'm aware, impossible for a general compiler algorithm.
In fact, even the best optimizing compilers are so dumb, that one is not allowed to mix and match 8-, 16-, 32- or 64-bit code; one must either compile everything 32- or 64-bit (the linker won't let one link 32- and 64-bit object code together). A human could easily write correct assembler code using all of those instruction / register sizes at once, and we often do.
I have yet to see a compiler capable of inferring that. If you know of one, please show me the generated code. I'd love to use such a compiler.
There isn't really much more to optimise: at higher optimization levels, the compiler will figure out whether that was a one time operation or not, and if he determines that it was, it will simply hardcode:
moveq $52, %eax
but all the extra cruft with stack and frame pointer setup will remain unchanged, and will still be there, if only to comply with the ABI calling conventions. I guesstimate that there are up to 50 clock cycles used for each setup and teardown of the stack and frame pointers; now multiply that with the number of times a function is called, and you can easily waste hundreds of thousands, or even millions of clock cycles pretty much doing no useful work, just housekeeping.
Then you end up with huge code sizes, where calls like
jsr ScrollRaster(pc)
end up repeating the ScrollRaster code over and over and over again, basically leading to macro expansion. And depending on the processor, your code might end up being too large to fit into the instruction cache... and you just kissed instruction burst mode bye-bye. This is unlikely to be the case on modern 80x86 processors as they have staggering amounts of cache, but who knows what embedded platforms and targets people are targeting this very moment as I write this, and who knows how small the instruction and data caches might be on those.
You have to have stack and frame setup if the code is to be ABI / target platform's calling convention compliant. Which compiler are you referring to, because GCC won't remove it, as far as I could tell looking at the generated assembler code?
You could purposely tell GCC to omit the frame pointer -fomit-frame-pointer, but then you just made the code extremely difficult to debug (and stack setup code will still be generated, as the compiler can't generate useful / functional code without it), so that's no solution either.
Please provide instructions on how to reproduce / verify the assertion that GCC doesn't generate stack / frame pointer setup. I want to see this for myself.
That's a contrived example of inlining; now watch what happens when you provide more than just a trivial function which can be inlined; look at all the futzing it does with the stack:
without explicit direction from humans, compilers can't generate code which calls subroutines or functions without using the stack at the very minimum, and that was my point.
And that's just GCC on intel; check out intel's compiler, it's even worse (click on "turn off intel syntax", and "compile to binary and disassemble the output"):
As AT&T syntax is still being used, at this point I'm willing to believe it's to purposely make x86 assembly hard and unpleasant to read and write.
Perhaps so people will want to stay away from it, and in a way, to reduce the amount of code that is tied to the x86 platform.
It spreads the thinking x86 assembly is terrible and ugly.
Intel syntax is much cleaner, in particular, Intel Ideal (as opposed to MASM), and specifically, FASM (flat assembler).
FASM makes it as clean as possible and turns writing assembly into a joy.
As I recall, the "dword ptr" stuff was only necessary if the instruction was otherwise ambiguous. Using EAX means you are using a 32-bit destination. But something like:
move fs:[ebp-10],5
is ambiguous. Is that an 8-bit constant? 16 bits? 32-bits?
All of those are misleading because the FS segment override isn't specific to an operand. It applies to the whole instruction, which commonly has one place (a memory reference) for the override to take effect. You can have more than one override, but only the last one remains active. Normally you can have an override even if it isn't used. There are a few instructions with more than one memory access; the override only applies to one access and you don't get to choose which one.
How would you disassemble that instruction with more than one segment prefix in front of it? The hardware accepts this by ignoring all but the final segment prefix. For example, the prefixes might be: FS, REP, GS, FS, FS
Note that code can jump past some of the prefixes. The C library on Linux does this to bypass prefixes. Reasonable assembly syntax needs to be able to describe this. You need to be able to put a label right after a prefix.
Nice article! It take a subject that is scary for many and does a great explaining a little bit very clearly using good visual aids. I look forward to the next articles!
"To write code that runs directly on your microprocessor you need to know how memory segmentation works"
Although you can't completely ignore segments, in practice at least on Linux the only segments in use are user code/data and kernel code/data segments.
Does anyone know why the author might suggest that understanding segmentation is necessary to write Assembly code?
Probably because you need to deal with the MMU. You can't just write raw assembly and expect it to work (ignoring the MMU), but the kernel takes care of that for you.
I am not following - what is raw assembly? I can write a complete userland program using only Assembly and it will run just fine. When do I need to deal with the MMU exactly?
When writing programs for user mode (ring 3 on x86), you hardly need to care (except sometimes use some segment override prefixes (cf. https://news.ycombinator.com/item?id=13052076), which pedantically is "dealing with the MMU", since because of the MMU this works; but in my opinion it is not necessary to understand the technical details behind it, why this works).
On the other hand, if you are an OS ("operating system", here I don't mean "open source") developer, you probably better know the details of the MMU.
Concerning https://news.ycombinator.com/item?id=13052892: I also consider the author's statement as misleading that one has to know how segmentation works. The knowledge of about segmentation is absolutely necessary for x86-16 (real mode), which many people tend to associate with assembly (because there seem to be many more assembly tutorials available for DOS/x86-16 than for x86-32 or even x86-64), but hardly relevent for people who just write user mode code.
Reading assembly language is about having the computer in your head, just like regular programming. You read and execute the instructions just like any other language, just the operations are that much smaller and less abstract. Each instruction is a 'function call'; just like any low level language you leverage abstraction to build up these operations into greater pieces of execution knowledge using macros and functions in logical ways to get the outcome.
It's not magic. The best way to learn assembly is to program in it. I learned on the Gameboy by getting a job and programming 2 games in it. Fun as hell, especially when the machine is small enough to really fit in your head and clock cycles count at 4Mhz.
When I worked at Intel in the Server BIOS group, the development process involved several iterations of developing macro abstractions until the x86 ASM code became more readable and maintainable.
No one particularly enjoyed working in raw ASM 100% of the time.
This is the first time I actually read all the way through an article on assembly. It was nice and concise. Granted, I'll probably forget this until the next article (due to focusing on other studies), but thank you none the less.
The examples seem odd to me: the argument order is reversed from every Intel disassembly (or assembler) I've ever used. Addi edi, 43 is the normal way to say "Add 43 to edi". The destination register is normally first; the source register 2nd in the disassembly, right?
The examples in the article are in AT&T syntax, you seem to be used to Intel syntax. Just keep on reading, the article will discuss the differences between AT&T and Intel syntax.
If you're experimenting with asm in crystal, you might want to use --prelude=empty to remove the standard library to make the asm output cleaner. You can then then require lib_c and use that directly.
I tried that while researching the article, but found the call to "puts" doesn't link without the standard library code. And without a "puts" or similar call to produce output LLVM optimized the entire program away :)
I suppose this could work if, as you suggest, I manually called out to a lib_c function like printf instead.
Yeah, puts is part of the part of the standard library, and uses crystal's evented io framework, fiber scheduler and libevent. This is what most of the extra code in the asm output will be doing.
https://godbolt.org/
I think that playing around with it for 2 hours will teach you more than most classes on the topic. It really drives home why interactivity is such a bit deal in education.
You should also try writing a script for counting instructions in binaries. It's pretty illuminating. Here are some sample statistics https://webcache.googleusercontent.com/search?q=cache:j0gebK...