Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)
Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.
APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...
On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.
My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(
(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).
Thanks for the pointer to QEMU's decoder! I actually never looked at it before.
So you coded all the tables manually in C -- interesting, that's quite some effort. I opted to autogenerate the tables (and keep them as data only => smaller memory footprint) [1,2]. That's doable, because x86 encodings are mostly fairly consistent. I can also generate an encoder from it (ok, you don't need that). Re 'custom size "xh"': AVX-512 also has fourth and eighth. Also interesting that you have a separate row for "66+F2". I special case these two (CRC32, MOVBE) instructions with a flag.
I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as branch-not-taken/taken hints and 3e as notrack. (See SDM Vol. 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and SS) are ignored.") Also, REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow, I need to take a closer look at the decoder with more time. :-)
> For EVEX my plan is to keep the raw bits until after the opcode has been read
I came to the same conclusion that this is necessary with APX. The map+prefix+opcode combination identifies how the other fields are to be interpreted. For AVX-512, storing the last byte was sufficient, but with APX, vvvv got a second meaning.
> Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story.
Yes... especially for corner cases, getting real hardware is the only reliable way to find out, how the CPU behaves.
It looks like someone started with Intel's XED code (which relies on custom tables to specify instructions, and compiles that to C tables at compile time) and hand-minimized the code into a single file. I'm guessing it's designed to never have any more code added to it.
> interesting that you have a separate row for "66+F2"
Yeah that's only for 0F38F0 to 0F38FF.
> Re 'custom size "xh"': AVX-512 also has fourth and eighth
Also AVX for VPMOVSX and VPMOVZX but those are handled differently. I probably should check if xh is actually redundant... EDIT: it's only needed for VCVTPS2PH, which is the only instruction with a half-sized destination.
> I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode
Interesting, I need to check how they interact with the FS/GS prefixes (64/65).
> REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored
> how they interact with the FS/GS prefixes (64/65)
For memory operations, they are ignored: 64-2e-65-3e gives 65 as segment override. (From my memory and the resulting implementation, I did some tests with hardware a few years back.)
I do need to check myself how 2e/3e on branches interact with other segment overrides, though.
Another bonus quirk, from the 486 and Pentium area..
BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.
However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.
On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.
Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?
A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" only in the presence of the data16 prefix, and thus the longer immediate; and it's the only significant case (I think?) of a prefix changing the total instruction size, and thus, for some such instructions, the 16-bit version, sometimes (but not always!) is significantly slower. "but not always" as in, if you try to microbenchmark a loop of such, sometimes you can have entire microseconds of it consistently running at 0.25 cycles/instr avg, and sometimes that same exact code (in the same process!) will measure it at 3 cycles/instr (tested on Haswell, but uops.info indicates this happens on all non-atom Intel since Ivy Bridge).
Probably, if the uops come from the uop cache you get the fast speed since the prefix and any decoding stalls don't have any impact in that case (that mess is effectively erased in the uop cache), but if it needs to be decoded you get a stall due to the length changing prefix.
Whether a bit of code comes from the uop cache is highly dependent on alignment, surrounding instructions, the specific microarchitecture, microcode version and even more esototeric things like how many incoming jumps target the nearby region of code (and in which order they were observed by the cache).
Yep, a lot of potential contributors. Though, my test was of a single plain 8x unrolled loop doing nothing else, running for tens of thousands of iterations to take a total of ~0.1ms, i.e. should trivially cache, and yet there's consistent inconsistency.
Did some 'perf stat'ting, comparing the same test with "cmp eax,1000" vs "cmp ax,1000"; per instruction, idq.mite_uops goes 0.04% → 35%, and lsd.uops goes 90% → 54%; so presumably sometimes somehow the loop makes into LSD at which point dropping out of it is hard, while other times it perpetually gets stuck at MITE? (test is of 10 instrs - 8 copies of the cmp, and dec+jne that'd get fused, so 90% uops/instr makes sense)
> The following alignment situations can cause LCP stalls to trigger twice:
> · An instruction is encoded with a MODR/M and SIB byte, and the fetch line boundary crossing is between the MODR/M and the SIB bytes.
> · An instruction starts at offset 13 of a fetch line references a memory location using register and immediate byte offset addressing mode.
So that's the order of funkiness to be expected, fun.
> False LCP stalls occur when (a) instructions with LCP that are encoded using the F7 opcodes, and (b) are located at offset 14 of a fetch line. These instructions are: not, neg, div, idiv, mul, and imul. False LCP experiences delay because the instruction length decoder can not determine the length of the instruction before the next fetch line, which holds the exact opcode of the instruction in its MODR/M byte.
The "true" LCP stall for the F7 opcodes would be "test r16,imm16", but due to the split opcode info across the initial byte & ModR/M, the other F7's suffer too.
Did people just... do this by hand (in software), transistor by transistor, or was it laid out programmatically in some sense? As in, were segments created algorithmically, then repeated to obtain the desired outcome? CPU design baffles me, especially considering there are 134 BILLION transistors or so in the latest i7 CPU. How does the team even keep track of, work on, or even load the files to WORK on the CPUs?
It's written in an HDL; IIRC both Intel and AMD use verilog. A modern core is on the order of a million or so lines of verilog.
Some of that will be hand placed, quite a bit will just be thrown at the synthesizer. Other parts like SRAM blocks will have their cad generated directly from a macro and a description of the block in question.
To further expound on this. ASIC (like AMD CPUs) is a lot like software work. The engineers that create a lot of the digital logic aren't dealing with individual transistors, instead they are saying "give me an accumulator for this section of code" and the HDL provides it. The definition of that module exists elsewhere and is shared throughout the system.
This is how the complexity can be wrangled.
Now, MOST of the work is automated for digital logic. However, we live in an analog world. So, there is (As far as I'm aware) still quite a bit of work for analog engineers to bend the analog reality into digital. In the real world, changing current creates magnetic fields which means you need definitions limiting voltages and defining how close a signal line can be to avoid cross talk. Square waves are hard to come by, so there's effort in timing and voltage bands to make sure you aren't registering a "1" when it should have been a "0".
Several of my professors were intel engineers. From what they told me, the ratios of employment were something like 100 digital engineers to 10 analog engineers to 1 Physicist/materials engineer.
They use EDA (Electronic Design Automation) software, there are only a handful of vendors, the largest probably being Mentor Graphics, now owned by Siemens. So, yes, they use automation to algorithmically build and track/resolve refactors as they design CPUs. CPUs are /generally/ block-type designs these days, so particular functions get repeated identically in different places and can be somewhat abstracted away in your EDA.
It's still enormously complex, and way more complex than the last time I touched this stuff more than 15 years ago.
A lot of this is done in software (microcode). But even with that case, your statement still holds: "Can you imagine having to make all this logic work faithfully, let alone fast, in the chip itself?" Writing that microcode must be fiendishly difficult given all the functional units, out of order execution, register renaming...
The crazy parts that were mentioned in the parent comment are all part of the hot path. Microcode handles slow paths related to paging and segmentation, and very rare instructions. Not necessarily unimportant (many common privileged instructions are microcoded) but still rare compared to the usual ALU instructions.
But it's not a huge deal to program the quirky encoding in an HDL, it's just a waste of transistors. The really complicated part is the sequencing of micro operations and how they enter the (out of order) execution unit.
No, that's not the case, since >30 years. Microcode is only used for implementing some complex instructions (mostly system instructions). Most regular instructions (and the rest of the core) don't use microcode and their expansions into uOps are hardwired. Also the entire execution unit is hardwired.
There are typically some undocumented registers (MSRs on x86) that can control how the core behaves (e.g., kill switches for certain optimizations). These can then be changed by microcode updates.
The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.
I'm not following: as long as you are introducing a new, incompatible instruction for leading zero counting, you'd definitely choose LZCNT over BSR as LZCNT has definitely won in retrospect over BSR as the primitive for this use case. BSR is just a historical anomaly which has a zero-input problem for no benefit.
What would be the point of offering a new variation BSR with different input semantics?
When it comes to TZCNT vs BSF, they are just compatible enough for a new compiler to use unconditionally (if we assume that BSF with a zero input leaves its output register unchanged, as it has for decades, and as documented by AMD who defined LZCNT): the instruction sequence
behaves identically on everything from the original 80386 up and is better on superscalars with TZCNT support due to avoiding the false dependency on ECX. The reason for that is that BSF with a nonzero input and TZCNT with a nonzero input have exactly the same output. That’s emphatically not true of BSR and LZCNT, so we’re stuck relegating LZCNT to compiler flags.
TZCNT and BSF are not completely identical even for non-zero input: BSF sets the ZF when the input is zero, TZCNT sets ZF when the output is zero (i.e., the least significant bit is set).
LZCNT and TZCNT are corrections (originally introduced by AMD) for the serious mistake done by the designers of Intel 80386 when they have defined BSF and BSR.
Because on the very slow 80386 the wrong definition for the null input did not matter much, they have failed to foresee how bad it will become for the future pipelined and superscalar CPUs, where having to insert a test for null input can slow down a program many times.
Nevertheless, they should have paid more attention to the earlier use of such instructions. For instance Cray-1 had defined LZCNT in the right way almost ten years earlier.
I know nothing about this space, but it would be interesting to hook up a jtag interface to a x86 CPU and them step instruction by instruction and record all the register values.
You could then use this data to test whether or not your emulator perfectly emulates the hardware by running the same program through the emulated CPU and validate the state is the same at every instruction.
No need to JTAG; you can test most of the processor in virtualization. The only part you can’t check is the interface for that itself if you’re aiming to emulate it (most don’t). (Also I’m pretty sure the debugging interface is either fused off or locked out on modern Intel processors.)
The BSF/BSR quirk is annoying, but I think there is a reason for it is that they were only thinking about it being used in a loop (or maybe an if) with something like:
int mask = something;
...
for (int index; _bit_scan_forward(&index, mask); mask ^= 1<<index) {
...
}
Since it sets the ZF on a zero input, they thought that must be all you need. But there are many other uses for (trailing|leading) zero count operations, and it would have been much better for them to just write the register anyway.
What a cool person. I really enjoy writing assembly, it feels so simple and I really enjoy the vertical aesthetic quality.
The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...
This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.
thank you for this, even though I took some time to understand what's going on which lead me to a series of cool(and very challenging) assembly journey, as a non CS major myself, this code here is a really nice entry for me to start understand how things work, I will definitely dig deeper.
I think both are useful, but designing a modern CPU from the gate level is out of reach for most folks, and I think there's a big gap between the sorts of CPUs we designed in college and the sort that run real code. I think creating an emulator of a modern CPU is a somewhat more accessible challenge, while still being very educational even if you only get something partially working.
I think the key word above was modern. I felt able to design a simple CPU when I finished my Computer Architecture course in university. I think I forgot most of it by now ;) There are a few basic concepts to wrap your head around but once you have them a simple CPU is doable. Doing this with TTL or other off the shelf components is mostly minimizing/adapting/optimizing to those components (or using a lot of chips ;) ). I have never looked at discrete component CPU designs, I imagine ROM and RAM chips play a dominant part (e.g. you don't just built RAM with 74x TTL flip-flops).
He probably used off-the-shelf RAM chips, after all, RAM is not part of the CPU.
In the early 70s, before the internet, even finding the information needed would be a fair amount of work.
I learned how flip flops worked, adders, and registers in college, and that could be extended to an ALU. But still, that was in college, not high school.
I've read some books on computer history, and they are frustratingly vague about how the machines actually worked. I suspect the authors didn't actually know. Sort of the like the books on the history of Apple that gush over Woz's floppy disk interface, but no details.
And the sort that are commercially viable in today's marketplace. The nature of the code has nothing to do with it. The types of machines we play around with today surpass the machines we used to land men on the moon. What's not "real code" about that?
This is an illusion and a red herring. RTL synthesis is the typical functional prototype stage reached which is generally sufficient for FPGA work. To burn an ASIC as part of an educational consortium run is doable, but it's uncommon.
Seconded. A microcoded, pipelined, superscalar, branch-predicting basic processor with L1 data & instruction caches and write-back L2 cache controller is nontrivial. Most software engineers have an incomplete grasp of data hazards, cache invalidation, or pipeline stalls.
IIRC reading some Intel CPU design history some of their designers are from a CS/software background. But I agree. Software is naturally very sequential which is different than digital hardware which is naturally/inherently parallel. A clock can change the state of a million flip-flops all at once, it's a very different way of thinking about computation (though ofcourse at the theoretical level all the same) and then there's the physics and EE parts of a real world CPU. Writing software and designing CPUs are just very different disciplines and the CPU as it appears to the software developer isn't how it appears to the CPU designer.
I'm not talking about Intel's engineers, I'm saying very few software engineers today understand how a processor works. I'm sure Intel hires all sorts of engineers for different aspects of each ecosystem they maintain. Furthermore, very few software engineers have ever even touched a physical server because they're sequestered away from a significant fraction of the total stack.
Speculative and out-of-order execution requires synchronization to maintain dataflow and temporal consistency.
Computer Organization is a good book should anyone want to dive deeper.
Well, I think you're both right. It's satisfying as heck to sling 74xx chips together and you get a feel for the electrical side of things and internal tradeoffs.
When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.
I think trollied is correct actually. I work on a CPU emulator professionally and while it gives you a great understanding of the spec there are lots of details about why the spec is the way it is that are due to how you actually implement the microarchitecture. You only learn that stuff by actually implementing a microarchitecture.
Emulators tend not to have many features that you find in real chips, e.g. caches, speculative execution, out-of-order execution, branch predictors, pipelining, etc.
This isn't "the electrical side of things". When he said "gate level" he meant RTL (SystemVerilog/VHDL) which is pretty much entirely in the digital domain; you very rarely need to worry about actual electricity.
So far on my journey through Nand2Tetris (since I kind of dropped out of my real CS course) I've found the entire process of working my way up from gate level, and just finished the VM emulator chapter which took an eternity. Now onto compilation.
OTOH, are you really going to be implementing memory segmenting in your gate-level CPU? I'd say actually creating a working CPU and _then_ emulating a real CPU (warts and all) are both necessary steps to real understanding.
This. I mean, why not start with wave theory and material science if you really want a good understanding :)
In my CS course I learned a hell of a lot from creating a 6800 emulator; though it wasn't on the course, building a working 6800 system was. The development involved running an assembler on a commercial *nix system and then typing the hex object code into an EPROM programmer. You get a lot of time to think about where your bugs are when you have to wait for a UV erase cycle...
I've written fast emulators for a dozen non-toy architectures and a few JIT translators for a few as well. x86 still gives me PTSD. I have never seen a messier architecture. There is history, and a reason for it, but still ... damn
Studying the x86 architecture is kind of like studying languages with lots of irregularities and vestigial bits, and with competing grammatical paradigms, e.g. French. Other architectures, like RISC-V and ARMv8, are much more consistent.
> Other architectures, like [...] ARMv8, are much more consistent.
From an instruction/operation perspective, AArch64 is more clean. However, from an instruction operand and encoding perspective, AArch64 is a lot less consistent than x86. Consider the different operand types: on x86, there are a dozen register types, immediate (8/16/32/64 bits), and memory operands (always the same layout). On AArch64, there's: GP regs, incremented GP reg (MOPS extension), extended GP reg (e.g., SXTB), shifted GP reg, stack pointer, FP reg, vector register, vector register element, vector register table, vector register table element, a dozen types of memory operands, conditions, and a dozen types of immediate encodings (including the fascinating and very useful, but also very non-trivial encoding of logical immediates [1]).
AArch64 also has some register constraints: some vector operations can only encode register 0-15 or 0-7; not to mention SVE with it's "movprfx" prefix instruction that is only valid in front of a few selected instructions.
As I'm currently implementing an AArch64 code generator for the D language dmd compiler, the inconsistency of its instructions is equaled and worsened by the clumsiness of the documentation for it :-/ But I'm slowly figuring it out.
(For example, some instructions with very different encodings have the same mnemonic. Arrgghh.)
It's probably no longer maintained, but a former colleague of mine did some work on this for C++: https://github.com/ainfosec/shoulder. Obviously if the docs are lying it doesn't help much, but there was another effort he had https://github.com/ainfosec/scapula that tried to automate detecting behavior differences between the docs and the hardware implementation.
Admittedly, I never wrote an assembler, but the encoding of x86-64 seems pretty convoluted [0] [1], with much of the information smeared across the some bits in the Mod/RM and SIB bytes and then extended by prefix bytes. It's more complicated than you would assume from having only written assembly in text and then having the assembler encode the instructions.
One of the things that sticks out to me is that on x86-64, operations on 32-bit registers implicitly zero out the upper 32-bits of the corresponding 64-bit register (e.g. "MOV EAX, EBX" also zeroes out the upper 32-bits of RAX), except for the opcode 0x90, which, logically, would be the accumulator encoding of "XCHG EAX, EAX" but is special-cased to do nothing because it's the traditional NOP instruction for x86. So you have to use the other encoding, 87 C0.
The fact that AArch64 has instructions that are almost always 4 bytes, and the fact that they clearly thought out their instruction set more (e.g. instead of having CMP, they just use SUBS (subtract and store condition codes) and store the result in the zero register, which is always zero), is what made me say that it seemed more regular. I hadn't studied it in as close detail as x86-64, though. But, you've written an an ARM disassembler and I haven't; you seem to know more about it than me, and so I believe what you're saying.
> AArch64 also has some register constraints
x86-64 also has some, in that many instructions can't encode the upper 8 bits of a 16-bit register (AH, BH, DH, CH) if a REX prefix is used.
One of the projects I work on every now and then is the "World's Worst X86 Decoder", where I'm trying to essentially automatically uncover x86 assembly semantics without ever having to build a list of x86 instructions, and this has forced me to look at the x86 ISA in a different way. The basic format I build for an opcode is this:
And all opcodes can optionally have an immediate of 1, 2, 3, 4, or 8 bytes and optionally have a ModR/M byte (which is a separate datastructure because I don't enter that information myself, I simply run a sandsifter-like program to execute every single opcode and work out the answer).
This isn't quite accurate to how an assembler would see it, as Intel will sometimes pack instructions with 0 or 1 operands into a ModR/M byte, so that, e.g., 0F.01 eax, [mem] is actually the SGDT instruction and 0F.01 ecx, [mem] is SIDT (and 0F.01 eax, ecx is actually VMCALL).
As long as you're internally making a distinction between the different operand forms of instructions (e.g., 8-bit ADD versus 16-bit versus 32-bit versus 64-bit, and register/register versus register/immediate versus register/memory), it's actually not all that difficult to deal with the instruction encoding, or even mapping IR to those instructions, at least until EVEX prefixes enter the picture.
I add to it in tandem with writing the code generator. It helps flush out bugs in both by doing this. I.e. generate the instruction, the disassemble it and compare with what I thought it should be.
It's quite a bit more complicated than the corresponding x86 disassembler:
x86 has a lot of special cases and it's becoming harder and harder to find non-code material with all instructions in a nice tabular format. But overall the encoding isn't that complicated. What's complex in the architecture is the amount of legacy stuff that still lives in privileged code.
> x86-64 also has some, in that many instructions can't encode the upper 8 bits of a 16-bit register (AH, BH, DH, CH) if a REX prefix is used.
You can't use both the high registers and the extended low registers in the same instruction, indeed. But in practice nobody is using the high registers anymore so it's a relatively rare limitation.
I think English may be a better example; we just stapled chunks of vulgar latin to an inconsistently simplified proto-germanic and then borrowed words from every language we met along the way. Add in 44 sounds serialized to the page with 26 letters and tada!
Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.
None, really. I just happened to get a copy of the manual and start idly reading it when my computer got stuck in a very long update-reboot cycle and I couldn't do anything other than read a physical book.
I recently implemented a good portion of x86(-64) decoder for some side project [1] and kinda surprised how it got even more complicated in recent days. Sandpile.org [2] was really useful for my purpose.
> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works.
The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!
I would say writing an ISA emulator is actually not helpful for understanding how a modern superscalar CPU works, because almost all of it is optimizations that are hidden from you.
Apparently my memory is false, I thought originally the salsa20 variants and machine code were on cryp.to in my memory, but Dan Berstein's site is - https://cr.yp.to/
While at a startup when we were looking at data at rest encryption, streaming encryption and other such things. Dan had a page with different implementations (cross compiled from his assembler representation) to target chipsets and instruction sets. Using VMs (this was the early/mid 2000s) and such, it was interesting to see what of those instruction sets were supported. In testing, there would be occasional hiccups where an implementation wasn't fully supported though the VM claimed such.
It's funny to me how much grief x86 assembly generates when compared to RISC here, because I have the opposite problem when delinking code back into object files.
For this use-case, x86 is really easy to analyze whereas MIPS has been a nightmare to pull off. This is because all I mostly care about are references to code and data. x86 has pointer-sized immediate constants and MIPS has split HI16/LO16 relocation pairs, which leads to all sorts of trouble with register usage graphs, code flow and branch delay instructions.
That should not be constructed as praise on my end for x86.
Yes, x86 is weird but the variable-length instructions are actually nice and easy to understand, once they're unpacked to text form anyway. The problem with them is they're insecure, because you can hide instructions in the middle of other instructions.
I think the biggest thing you learn in x86 assembly vs C is that signed/unsignedness becomes a property of the operation instead of the type.
It would be cool if you could use flags, which is easy/easier on some architectures like PPC/armv7, but x86 overwrites them too easily so it's too hard to use their values.
Interesting read. I have a lot of respect for people who develop emulator for x86 processors. It is a complicated processor and from first hand experience I know that developing and debugging emulators for CPU's can be very challenging. In the past year, I spend some time developing a very limited i386 emulator [1] including some system calls for executing the first steps of live-bootstrap [2], primarily to figure out how it is working. I learned a lot about system calls and ELF.
Most of the complexities lie in managing the various configurations of total system compatibility emulation, especially for timing, analog oddities, and whether to include bugs or not and for which steppings. If you want precise and accurate emulation, you have to have real hardware to validate behavior against. Then there are the cases of what not to emulate and offering better-than-original alternatives.
Haha! Writing an x86 emulator! I still remember writing a toy emulator which was able to execute something around the first 1000-ish lines of a real bios (and then it stuck or looped when it started to access ports or so, can't remember it was too long ago and I didn't continue it as I started to get into DirectX and modern c++ more).
Intel architecture is loaded with historical artifacts. The switch in how segment registers were used as you went from real mode to protected mode was an incredible hardware hack to keep older software working. I blame Intel for why so many folks avoid assembly language. I programmed in assembly for years using TI's 84010 graphics chips and the design was gorgeous -- simple RISC instruction set, flat address space, and bit addressable! If during the earlier decades folks were programming using chips with more elegant designs, far more folks would be programming in assembly language (or at least would know how to).
> I blame Intel for why so many folks avoid assembly language.
x86 (the worst assembly of any of the top 50 most popular ISAs by a massive margin) and tricky MIPS branch delay slots trivia questions at university have done more to turn off programmers from learning assembly than anything else and it's not even close.
This is one reason I'm hoping that RISC-V kills off x86. It actually has a chance of once again allowing your average programmer to learn useful assembly.
I think that's putting the cart before the horse. I think it wouldn't matter which architecture you choose as there will always be deep performance considerations that must be understood in order to write efficient software.
Otherwise your statement might amount down to "I hope there is an ISA that intentionally wastes performance and energy in deference to human standards of beauty."
It's why the annals of expertise rarely makes for good dinner table conversation.
x86 has a parity flag. It only takes the parity of the lowest 8 bits though. Why is it there? Because it was in the 8086 because it was in the 8080 because it was in the 8008 because Intel was trying to win a contract for the Datapoint 2200.
Sometimes the short instruction variant is correct, but not if it makes a single instruction break down into many uops as the microcode is 1000x slower.
Oh, but you need to use those longer variants without extra uops to align functions to cache boundaries because they perform better than NOP.
Floats and SIMD are a mess with x87, AMX (with incompatible variants), SSE1-4 (with incompatible variants), AVX, AVX2, and AVX512 (with incompatible variants) among others.
The segment and offset was off dealing with memory is painful.
What about the weird rules about which registers are reserved for multiply and divide? Half the “general purpose” registers are actually locked in at the ISA level.
Now APX is coming up and you get to choose between shorter instructions with 16 registers and 2 registers syntax or long instructions with 32 registers ands 3 registers instructions.
And this just scratches the surface.
RISCV is better in every way. The instruction density is significantly higher. The instructions are more simple and easier to understand while being just as powerful. Optimizing compilers are easier to write because there’s generally just one way to do things and is guaranteed to be optimized.
What do you find particularly problematic about x86 assembly, from a pedagogical standpoint? I've never noticed any glaring issues with it, except for the weird suffixes and sigils if you use AT&T syntax (which I generally avoid).
I suspect the biggest issue is that courses like to talk about how instructions are encoded, and that can be difficult with x86 considering how complex the encoding scheme is. Personally, I don't think x86 is all that bad as long as you look at a small useful subset of instructions and ignore legacy and encoding.
True, encoding is one thing that really sets x86 apart. But as you say, the assembly itself doesn't seem that uniquely horrible (at least not since the 32-bit era), which is why I found the sentiment confusing as it was phrased.
Maybe it's the haphazard SIMD instruction set, with every extension adding various subtly-different ways to permute bytes and whatnot? But that would hardly seem like a beginner's issue. The integer multiplication and division instructions can also be a bit wonky to use, but hardly unbearably so.
An ordinary developer cannot write performant x86 without a massive optimizing compiler.
Actual instruction encoding is horrible. If you’re arguing that you can write a high-level assembly over the top, then you aren’t so much writing assembly as you are writing something in between.
When you need to start caring about the actual assembly (padding a cache line, avoiding instructions with too many uops, or choosing between using APX 32 registers and more normal shorter instructions, etc) rather than some high level abstraction, the experience is worse than any other popular ISA.
> Actual instruction encoding is horrible. If you’re arguing that you can write a high-level assembly over the top, then you aren’t so much writing assembly as you are writing something in between.
One instruction in x86 assembly is one instruction in the machine code, and one instruction as recognized by the processor. And except for legacy instructions that we shouldn't teach people to use, each of these is not much higher-level than an instruction in any other assembly language. So I still don't see what the issue is, apart from "the byte encoding is wacky".
(There are μops beneath it of course, but these are reordered and placed into execution units in a very implementation-dependent manner that can't easily be optimized until runtime. Recall how VLIW failed at exposing this to the programmer/compiler.)
> padding a cache line, avoiding instructions with too many uops
Any realistic ARM or RISC-V processor these days also supports out-of-order execution with an instruction cache. The Cortex processors even have μops to support this! The classic 5-stage pipeline is obsolete outside the classroom. So if you're aiming for maximum performance, I don't see how these are concerns that arise far less in other assembly languages. E.g., you'll always have to be worried about register dependencies, execution units, optimal loop unrolling, etc. It's not like a typical program will be blocked on the μop cache in any case.
> APX 32 registers and more normal shorter instructions
APX doesn't exist yet, and I'd wager there's a good chance it will never reach consumer CPUs.
Which “one instruction” is the right one? You can have the exact same instruction represented many different ways in x86. Some are shorter and some are longer (and some are different, but the same length). When you say to add two numbers, which instruction variant is correct?
This isn’t a straightforward answer. As I alluded to in another statement you quoted, padding cache lines is a fascinating example. Functions should ideally start at the beginning of a cache line. This means the preceding cache line needs to be filled with something. NOP seems like the perfect solution, but it adds unnecessary instructions in the uop cache which slow things down. Instead, the compiler goes to the code on the previous cache line and expands instructions to their largest size to fill the space because adding a few bytes of useless prefix is allowed and doesn’t generate NOPs in the uop cache.
Your uop statements aren’t what I was talking about. Almost all RISCV instructions result in just one uop. On fast machines, they may even generate less than one uop if they get fused together.
x86 has a different situation. If your instruction generates too many uops (or is too esoteric), it will skip the fast hardware decoders and be sent to a microcode decoder. There’s a massive penalty for doing this that slows performance to a crawl. To my knowledge, no such instructions exist in any modern ISA.
Intel only has one high performance core. When they introduce APX, it’ll be on everything. There’s good reason to believe this will be introduced. It adds lots of good features and offers increased code density in quite a few situations which is something we haven’t seen in a meaningful way since AMD64.
> x86 has a different situation. If your instruction generates too many uops (or is too esoteric), it will skip the fast hardware decoders and be sent to a microcode decoder. There’s a massive penalty for doing this that slows performance to a crawl. To my knowledge, no such instructions exist in any modern ISA.
Not surprising that modern ISAs are missing legacy instructions
What's crazy is that depending on how deep you want to go, a lot of the information is not available in documents published by Intel. Fortunately, if it matters for emulators it typically can be/has been reverse engineering.
Wow, is that true? And the documentation is thousands of pages! I can't understand how Intel keep these processors consistent during development. It must be a really, really hard job.
It was a nice period in history when proper RISC was working well, when we could get stuff to run faster, but getting more transistors was expensive. Now we can't get it to run faster but we can get billions of transistors, making stuff more complicated being the way to more performance.
I wonder if we ever again will get a time where simplification is a valid option...
Notably things like undocumented opcodes, processor modes, undocumented registers, undocumented bits within registers, etc. It's not uncommon to release a specific set of documentation to trusted partners only. Back in the day intel called these "gold books".
Where did he say this? In any case, I can confirm it from personal experience. x86 isn't that annoying if you ignore most of it. The worst thing is the edge cases with how some instructions can only use some registers, like shifting/rotating using cl.
Just as an adjacent aside from a random about learning by doing:
Implementing a ThingDoer is a huge learning experience. I remember doing co-op "write-a-compiler" coursework with another person. We were doing great, everything was working and then we got to the oral exam...
"Why is your Stack Pointer growing upwards"?
... I was kinda stunned. I'd never thought about that. We understood most of the things, but sometimes we kind of just bashed at things until they worked... and it turned out upward-growing SP did work (up to a point) on the architecture our toy compiler was targeting.
Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.
APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...
[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748