
The Assembler Developer's Kit - mcbits
http://www.plantation-productions.com/Webster/RollYourOwn/index.html
======
userbinator
_The ADK contains over 75,000 lines of code that an assembler author will not
have to write themselves._

FASM ( [https://flatassembler.net/](https://flatassembler.net/) ) is also
written in Asm, and far less than 75kLoC. It is also a complete assembler.

IMHO HLA is not what I would consider "real" Asm --- it's more like an
intermediate language between Asm and C. The syntax is somewhat like what
you'd get if machine instructions were C functions. I don't think I've
actually seen any projects use it, besides itself.

I used to recommend the author's Art of Assembly book, before he switched to
using HLA.

~~~
pjmlp
TI has an assembler for some of their processors that also looks like that.

------
gspetr
Last time I seriously considered giving ASM a go, I checked Randall Hyde's
[author of this developer's kit] books on Amazon and the majority of reviewers
had the following problem with it - it has neither Microsoft's nor AT&T's
syntax but instead something that you have slim odds of encountering in real
world, making things weird.

Which is why I also ultimately passed on it and looked into MASM instead,
hoping to have fun reversing games on Windows.

------
bogomipz
I read through chapter 15 on the "back end" of the assembler. I was hoping to
read about how the opcodes got turned into binary codes during code
generation, however I didn't see any literature on that part of the topic.

How is that done in practice, is it just a giant lookup table for the
particular ISA being targeted? Does anyone have a good resource for learning
about this part of the back end?

~~~
simias
I've never implemented an assembler but I did implement a few toy
disassemblers and emulators and you'd need a bit more than a simple LUT for
many architectures.

In particular some ISAs have "modular" opcodes so to speak, where a machine
opcode contains several fields that can be mixed and matched. For instance the
ARM instruction set has a 4 bit condition code in each instruction so "add"
"addeq" (add if equal) and "addne" (add if not equal) are basically the same
opcode with a different condition flag.

Furthermore not all instructions have necessarily the same layout, so your
"LUT" must also contain enough information on how to pack the opcode and its
operands correctly.

Lastly assembly opcodes don't necessarily map one-to-one with machine code,
it's not rare that a single assembly directive would result in multiple
machine opcodes.

For instance in MIPS assembly "li $t0, 0x42" will generate a single opcode
(probably something like "addiu $t0, $0, 0x42") while "li $t0, 0x12345" will
generate two ("lui $t0, 1; addiu $t0, $t0, 0x2345"). In ARM Thumb the "bl"
instruction is actually encoded with two successive 16bit opcodes.

EDIT: actually I just remembered that I did implement a very basic MIPS I
assembler, here's the codegen portion:
[https://github.com/simias/rustation/blob/master/src/assemble...](https://github.com/simias/rustation/blob/master/src/assembler.rs#L340)

MIPS is a really simple ISA though, I suspect an assembler for x86 would be a
lot more... more.

~~~
bogomipz
Interesting in about the ARM modular opcodes.

>Lastly assembly opcodes don't necessarily map one-to-one with machine code,
it..."

Sure, I didn't mean to imply there was a 1 to 1 mapping between opcode and
machine code only that there must be some indexing into LUT as an initial
step.

>"MIPS is a really simple ISA though, I suspect an assembler for x86 would be
a lot more"

Indeed the x86 ISA is not for the faint of heart.

Thanks for the link to your MIPs assembler, very cool.

~~~
simias
ARM has actually a lot more modularity in its opcodes, in particular it has a
whole bunch of different addressing modes for its ALU operands (register and
register, register and immediate, register shifted by a register, register
shifted by an immediate, register rotated by a register, etc...).

There's also the "flag set" bit which says if the instruction is supposed to
update the CPSR status flags or not. And then there's the conditional
execution flags I talked in my previous post.

So with all these combinations you can easily end up with hundreds of possible
encodings for an "add" mnemonic.

And you have similar shenanigans for memory access instructions (load word,
load word and increment after, load word and increment before, load word and
increment before, load byte and sign extend...)

So you'll probably end up using a combination of multiple LUTs and a bit of
code to put it all together.

~~~
bogomipz
So its sound like the addressing modes on the ARM are as varied at x86 then?

~~~
simias
That's very much possible, but I'll admit that I never really looked very
closely at the x86 instruction encoding, mainly because I value my sanity.
Good thing I mostly work with ARM environments these days...

~~~
bogomipz
Can I ask what you develop? I'm also curious how long it took you to be
comfortable developing on ARM?

~~~
simias
Embedded development, which nowadays is synonymous with ARM.

But my knowledge of the instruction encoding comes mainly from writing this
pocketstation emulator:
[https://github.com/simias/pockystation/](https://github.com/simias/pockystation/)

It's not complete yet unfortunately, I'll have to get back to that.

