The microcode and hardware in the 8086 processor that perform string operations

userbinator · on April 4, 2023

The solution is that a string instruction can be interrupted in the middle of the instruction, unlike most instructions.

This is true even for later x86 processors that move entire cachelines at a time, and also leads to one sneaky technique for detection of VMs or being debugged/emulated:

https://repzret.org/p/rep-prefix-and-detecting-valgrind/

https://silviocesare.wordpress.com/2009/02/02/anti-debugging...

rep_lodsb · on April 4, 2023

Small nitpick: the REP STOSB isn't left in the prefetch queue while it's being executed by the microcode engine. The queue always contains the next instructions to execute (obviously it's not as simple for out-of-order architectures).

The following should load AL with 0 on an 8088, 1 on an 8086:

        mov bx,test_instr+1
        mov byte [bx],1    ;make sure immediate operand is set to one
        mov cl,255         ;max shift count to give time for prefetch
        cli                ;no interrupts allowed
    
        align 2
        shl byte [bx],cl  ;clears byte at [bx] after wasting lots of time
        nop               ;first byte in queue
        nop               ;2nd
        nop               ;3rd
        sti               ;4th (intr. recognized after next instruction)
    test_instr:
        mov al,0ffh       ;opcode is 5th, operand is 6th

The 255x left shift is done in an internal register, during which the 8086 has more than enough time to fetch all six following bytes. I didn't test this code but am 99% certain it works.

secondcoming · on April 4, 2023

username checks out!

saagarjha · on April 4, 2023

I’m curious what rep lodsb even does lol

rep_lodsb · on April 4, 2023

Potentially read lots of stuff, without doing anything useful ;)

Sort of a self-deprecating name I chose. But it's also an interesting instruction, in that Intel had to actually write microcode specifically to support it in every x86 compatible chip going forward.

It loads CX bytes from memory at DS:SI into the accumulator. Overwriting whatever was there previously every time.

The only uses I can think of:

     - test memory speed
     - if CX is known to be 0 or 1, conditionally read one byte
     - trigger some I/O device where the value read isn't important

1000100_1000101 · on April 5, 2023

I think rep is expected to be combined with a condition. repne lodsb will read while not fetching 0, so a search for end of C string. Using the remaining cx or si value you can determine the length fetched until the condition failed.

rep_lodsb · on April 5, 2023

LODS, STOS, MOVS, INS and OUTS are all unconditional (doesn't matter which version of the prefix you use, but officially REP should be the same opcode as REPE).

SCAS is the one that scans for e.g. a nul byte.

ataylor284_ · on April 4, 2023

These 8086 instructions were pretty cool coming from a 6502 background. The usual way to copy a block of memory in 6502 was something like this:

    ldy #len
  loop:
    lda source-1,y   ; memory accesses: loads opcode, address, value
    sta dest-1,y     ; memory accesses: loads opcode, address, stores value
    dey              ; memory accesses: loads opcode
    bne loop         ; memory accesses: loads opcode, offset

So 4 instructions and a minimum of 9 memory accesses per byte copied, more if not in the zero page. Even unrolling the loop gets you down to 6.

Compare this to the 8086:

      mov cx, len
      mov si, source
      mov di, dest
  rep movsb            ; memory accesses: read the opcodes, then 1 load and 1 store per byte

Even forgetting about word moves, you're down to 2 memory accesses per byte. No instruction opcode or branching overhead.

tom_ · on April 5, 2023

It's worse than this for the 6502! DEY takes 2 cycles, and BNE (when taken) takes 3. LDA abs,Y takes 4-5 cycles (the extra one happens if a page boundary is crossed when performing the address calculation - in practice you'd try to arrange for this not to happen), and STA abs,Y takes 5 cycles.

So out of the 24 cycles per iteration, there's 2 useful ones: 1 read, and 1 write. You need to unroll this sort of loop a few times to get anything near 9 cycles per byte, and ~80% of the cycles are still going to be kind of wasted.

(Still annoys me to this day if I find myself writing code to copy stuff around. Why isn't it in the right place already?!)

tenebrisalietum · on April 5, 2023

The enhanced 6502 derivative in the PC Engine/TurboGrafx 16 games console of the early 90's was enhanced with block move instructions (MVI, MVN?) that worked similarly I think. (Hu62C80 or similar was the CPU name...)

rasz · on April 5, 2023

sadly Hu6502 sucked just as much. It had dedicated instructions, but cycle cost was ridiculous (17 + 6x) = ~160KB/s http://shu.emuunlim.com/download/pcedocs/pce_cpu.html

Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII)

For contrast 5 years older 80286 already did 'rep movsw' at afaik 2 cycles per byte. 6 years later Pentium did 'rep movsd' at 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb' full cachelines at a time at full cache/memory controller speed.

rep_lodsb · on April 5, 2023

Well the Z80 was worse. 21 cycles per byte! The reason is that instead of running a loop in microcode, it decremented PC by 2, then fetched the instruction again every time.

peterfirefly · on April 5, 2023

If you wanted to clear an area, LDIR/LDDR were slower than using a string of PUSH instructions.

When moving data PUSH/POP were slightly slower than using LDIR/LDDR, though.

kabdib · on April 5, 2023

Don't forget the direction bit in the 8086's status register.

[Don't forget the decimal mode bit in the 6502's status register :-)]

ataylor284_ · on April 5, 2023

Learned that one the hard way. I built a 6502 single board computer and tested it out running my customer built monitor. I spent hours verifying the hardware to isolate a problem where programs would crash and memory would become corrupted. Turns out my monitor was popping garbage into the flags register when it returned control and sometimes setting decimal mode.

chrononaut · on April 5, 2023

What's noteworthy about the decimal mode bit in OP's example? (Not familiar with 6502 nuances, so this sounds interesting.)

ataylor284_ · on April 5, 2023

They're both called the "D" flag, and they both globally control the behavior of certain processor instructions. Besides that, they're totally different.

The 8086 D flag is the direction for string instructions, whether to increment or decrement the index registers after each repetition. One use was for moving memory between overlapping regions by starting at the end and working backwards.

The 6502 D flag is for decimal mode. When set, the instructions ADC and SBC do BCD arithmetic instead of binary, e.g. 0x09 + 0x01 = 0x10 instead of 0x0A as you'd expect. The 8086 also has support for BCD, but took a different approach, using separate instructions DAA/DAS for the same purpose.

Both can lead to bugs when you assume the flag is in the "normal" state but it somehow got flipped.

mikepurvis · on April 4, 2023

It still boggles my mind that processors in the 1970s already had microcode— obviously it was vastly simpler and not doing the crazy branch prediction, speculative execution, and operation reordering that modern processors do, but it's still amazing to me that this architecture was there at all.

kens · on April 4, 2023

Historically, microcode was invented around 1951 and was heavily used in most of the IBM System/360 computers (1964). Once your instruction set is even moderately complicated, using microcode is a lot easier than hard-wired control circuitry.

You can view microcode as one of the many technologies that started in mainframes, moved into minicomputers, and then moved into microprocessors. In a sense, the idea behind RISC was the recognition that this natural progression was not necessarily the best way to build microprocessors.

msla · on April 4, 2023

There's something more fundamental going on here:

The concept of the ISA as a standard implemented by hardware, as opposed to the ISA being a document of what a specific hardware design does, goes back to the IBM System/360 family of computers from 1964. That was the first range of computers, from large to (let's be honest) somewhat less large, that could all run most of the same software, modulo some specialized instructions and peripheral hardware like disk drives not present on the cheapest systems, at different levels of performance. As others have said, microcode is plenty old, but the idea of architectural families is one of the things which really gets us to where we are now.

pwg · on April 4, 2023

Microcode (the concept) is very much older than the 1970's. Per Wikipedia ([1]) the earliest system with something like 'microcode' dates back to 1947.

What is 'special' for the 8086 is that this was a low cost microprocessor using microcode, as opposed to a multi-million dollar mainframe CPU.

[1] https://en.wikipedia.org/wiki/Microcode#History

edit: fix typo

plesner · on April 4, 2023

Babbage's analytical engine used a form of microcode ~100 years earlier. Complex operations such as multiplication and division were implemented by lower-level instructions encoded as a series of pegs on a rotating barrel. Each micro-instruction ("vertical" in Babbage's terminology) could take several seconds to execute so a complete multiplication or division would take minutes.

kens · on April 4, 2023

You mean the 8086, not the 8-bit 8080, right? The 8080 wasn't microcoded.

pwg · on April 4, 2023

Indeed, that was a typo. Fixed now. Thanks.

mikepurvis · on April 5, 2023

Thanks for that link! The concept of "Directly Executable High Level Language" is also kind of fascinating— seems almost like an early pitch for something like a native LISP or Java machine.

morcheeba · on April 4, 2023

I love that the TMS34082 math coprocessor had user-definable microcode! This was intended to work with 3D graphics, so you could really get the most out of the ALU by chaining operations for your specific need (e.g. divide-and-sqrt single instruction)

See chapter 8: http://www.bitsavers.org/components/ti/TMS340xx/TMS34082_Des...

kjs3 · on April 5, 2023

Numerous machines had user-definable microcode. For example, the PDP-11/60 had a "Writeable Control Store" (WCS) option, as did the early VAX-11s. I once knew someone who actually did create his own VAX instruction (I want to say it was an FFT, but that might be wrong).

pjmlp · on April 5, 2023

Besides the sibling comments, microcode is how all Xerox computers worked, each language flavour had their own bytecode (Smalltalk, Interlisp, Mesa, Mesa/Cedar), and the respective microcode for the specific bytecode would be loaded on boot.

jecel · on April 4, 2023

The 286 added "in string" and "out string" instructions. It turned out that if peripherals had a small internal buffer it was more efficient to use these new instructions instead of using DMA hardware. This is why hard disk controllers for the original PC and XT used DMA but from the AT on they normally didn't.

rep_lodsb · on April 4, 2023

It was the 186 which added those. Wasn't widely used in PCs because it included some built-in peripherals that weren't compatible with the "standard" ones used by IBM in their original PC.

And that standard PC's DMA controller was originally intended for use with the 8-bit 8080 (eighty-eighty, not eighty-eight) processor. It was slow and only supported 16 bit addresses. IBM added a latch that would store the high address bits, but they weren't incremented when the lower bits wrapped around. So buffers had to be allocated in a way that they didn't cross those boundaries, and the BIOS disk functions would return an error if your code didn't do this properly.

anyfoo · on April 4, 2023

I grew up with a "Siemens PC-D", 80186 and not fully IBM compatible for the reasons you said. However, it seems that the folks at Siemens were very enthusiastic about it, and ported a lot of software that accessed the hardware more directly, including most Microsoft software (Windows, Word, the C compiler...) and GEM. The BIOS was actually on disk, so updatable, and the BIOS updates came with a README file from the PC-D team that again give me the impression that some at least really liked working on it.

A marvelous machine. High resolution (if monochrome), Hercules-like graphics, for example.

But the kicker was that it actually had an MMU[1], built from discrete logic chips, in order to run SINIX (a Xenix 8086 derivate, but with added MMU support among other things) and enjoy full memory protection. I reverse engineered that MMU, it's quite fun. Paging wasn't possible though.

[1] Early on as a variant called "PC-X", but they apparently did away with the early board revisions, and all machines I've seen say "PC-D/PC-X" on the back, so they all have the MMU and can be both.

adrian_b · on April 4, 2023

80186 and 80286 have been launched simultaneously in February 1982, so you cannot say that one of them has introduced INS and OUTS before the other.

Nevertheless, due to their high cost for those times none of these 2 CPUs had seen much use before two years and a half later, when IBM launched the IBM PC/AT in August 1984.

Most programmers have encountered INS and OUTS for the first time while using PC/AT clones with 80286. Embedded computers with 80186 became common only some years later, when the price of 80186 dropped a lot.

rasz · on April 5, 2023

NEC V20/V30 were also very popular as an easy cheap upgrade for older computers.

adrian_b · on April 5, 2023

That is right and I have also used some cheap and fast PC/XT clones made with those.

However they have been launched only in 1984 and computers made with them have become available only some years after the PC/AT with 80286, most likely even later than the launch of 80386, so despite that they implemented all the 80186 extensions they do not count when discussing the timeline of the INS/OUTS introduction.

kens · on April 4, 2023

Author here for all your 8086 questions :-)

myself248 · on April 4, 2023

It's always boggled my mind that they'd have 32 bits of addressing registers, but overlap 12 bits, leaving just 20 useful bits. What a waste. I guess the segment:offset scheme had some advantages, but honestly, I've never felt like I had a good understanding of them.

But, what if the overlap had been just 10 bits, or just 8, leaving a much larger functional address range before we got clobbered by the 20-bit limit; what if it was 22 or 24 bits of useful range? Can you speculate what effect such a decision would've had, and why it wasn't taken at the time? I understand in 1978 even 20 bits was a lot, but was that optimal?

kens · on April 4, 2023

According to Steve Morse, the 8086 designer:

"Various alternatives for extending the 8080 address space were considered. One such alternative consisted of appending 8 rather than 4 low-order zero bits to the contents of a segment register, thereby providing a 24-bit physical address capable of addressing up to 16 megabytes of memory. This was rejected for the following reasons:

Segments would be forced to start on 256-byte boundaries, resulting in excessive memory fragmentation.

The 4 additional pins that would be required on the chip were not available.

It was felt that a 1-megabyte address space was sufficient. "

Ref: page 16 of https://www.stevemorse.org/8086history/8086history.pdf

Credit to mschaef for finding this.

anyfoo · on April 4, 2023

And it's been stated in related threads numerous times (and by you as well if I recall correctly) how Intel at that time strangely treated their low pin count limit as almost a religion... apparently that influenced many decisions.

ataylor284_ · on April 5, 2023

The 8086 was designed to be a 16-bit 8080, and 16 times the address space of that chip was a nice incremental leap forward. If you didn't expect the chip to become the foundation of the PC industry for the next 4 decades, it was probably a reasonable choice. At the same time, Intel made the iAPX432 chip, which was the one that was supposed to be future proof and look how that turned out.

bell-cot · on April 4, 2023

But they revisited and used that alternative later, in designing the 80286. (Launched Feb'82, with ~135K transistors, according to Wikipedia.)

If only there was a famous Bill Gates quote, about how 16MB (the limit of 24-bit physical addresses) ought to be more than enough...

rep_lodsb · on April 4, 2023

Only in protected mode, and then segments were defined by descriptor table entries having a full 24 bit base address.

kabdib · on April 5, 2023

> Segments would be forced to start on 256-byte boundaries, resulting in excessive memory fragmentation.

Well, they were dead wrong about /that/.

rep_lodsb · on April 5, 2023

Depends on how many segments you allocate. With 16 byte alignment, it can be practical to have each object (e.g. a linked list node) in its own segment so that the pointers will be only 16 bits while giving access to 1 MB of address space.

peterfirefly · on April 5, 2023

There were even some DOS C compilers that supported it -- similar to near/far/huge pointers. I can't remember what they were called, though.

pwg · on April 4, 2023

The 20 bits of address was likely a direct result of Intel's decision to package the 8086 in a 40-pin package.

They already multiplex the 16-bit data bus on top of 16 of the 20 pins of the address bus. But with only 40 pins, that 20-bit address bus was already using half of the total chip pinout.

And, at the time the 8086 was being designed (circa. 1976-1978) with other microprocessors of the time having 16-bit or smaller address buses, the jump to 1M of possible total address space was likely seen as huge. We look back now, comfortable with 12+GB of RAM as a common size, and see 1M as small. But when your common size is 16-64k, having 1M as a possibility would likely seem huge.

myself248 · on April 4, 2023

Packaging makes a lot of sense. I've handled a 68000 in DIP64 and it's just comically huge, and trying to fit the cursed thing into a socket quickly explains why DIPs larger than 40 are ultra rare.

I'm sure there must be architectures that use a multiplexed low/high address bus, like a latch signal that says "the 16 bits on the address bus right now are the segment", then a moment later "okay here's the offset", and leave it to the decoding circuitry on the motherboard, to determine how much to overlap them, or not at all. Doing it this way, you could scale the same chip from 64kB to 4GB, and the decision would be part of the system architecture, rather than the processor. (Could even have mode bits, like the infamous A20 Gate, that would vary the offset and thus the addressable space...)

But, yeah, it was surely seen as unnecessary at the time. Nobody was expecting the x86 to spawn forty-plus years of descendants, and even though Moore's Law was over a decade old at the time, it seems like nobody was wrapping their head around its full implications.

cesarb · on April 4, 2023

> I'm sure there must be architectures that use a multiplexed low/high address bus, like a latch signal that says "the 16 bits on the address bus right now are the segment", then a moment later "okay here's the offset"

Most modern architectures do that, see https://en.wikipedia.org/wiki/SDRAM#Control_signals for an older example (DDR4 and DDR5 are more complicated). The high address bits are sent first (RAS is active), then the low address bits are sent (CAS is active).

rep_lodsb · on April 5, 2023

I think DRAM always worked this way. It's because the chips are physically organized into rows and columns, and the row must be selected first.

kens · on April 5, 2023

It seems obvious to access DRAM that way, but multiplexing the address pins was a big innovation from Mostek for the 4K RAM chips, and they "bet the company" on this idea. Earlier DRAMs weren't multiplexed; Intel just kept adding address pins and ended up with a 22-pin 4K chip. Mostek multiplexed the pins so they could use a 16-pin package.

The hard part is figuring out how to avoid slowing down the chip while you're waiting for both parts of the address. Mostek figured out how to implement the timing so the chip can access the row of memory cells while the column address is getting loaded. This required a bunch of clock generating circuitry on the chip, so it wasn't trivial.

I discuss this in more detail in one of my blog posts: https://www.righto.com/2020/11/reverse-engineering-classic-m...

myself248 · on April 5, 2023

That's great, and dovetails perfectly into a video I just watched on the topic, which I really enjoyed. It's a little flashy with the animations at first, but getting into the meat of things, they're used well:

https://www.youtube.com/watch?v=7J7X7aZvMXQ

getpost · on April 4, 2023

Hahaha, somehow I convinced myself that the rationale for the segment register scheme was a long term plan to accommodate virtual memory and memory protection. The idea being that you would only have to validate constraints on access when a segment register was loaded, rather than every single memory access.

pwg · on April 4, 2023

If you read through the PDF Ken links to in another note here, you find this quote starting the "Objectives and Constraints of the 8086" section:

"The processor was to be assembly-language-level-compatible with the 8080 so that existing 8080 software could be reassembled and correctly executed on the 8086. To allow for this, the 8080 register set and instruction set appear as logical subsets of the 8086 registers and instructions."

The segment registers provide a way to address more than a max of 64k, while also maintaining this "assembly-language-level-compatibl[ity]" with existing 8080 programs.

dfox · on April 4, 2023

It was not that much of an engineering decision as a simple requirement for the chip to be feasible. At that time 40pin DIL was effectively the largest package that could be made at production scale.

kens · on April 4, 2023

Intel's decision to use a 40-pin chip was mainly because Intel had a weird drive for small integrated circuits. The Texas Instruments TMS9900 (1976) used a 64-pin package for instance, as did the Motorola 68000 (1979).

For the longest time, Intel was fixated on 16-pin chips, which is why the Intel 4004 processor was crammed into a 16-pin package. The 8008 designers were lucky that they were allowed 18 pins. The Oral History of Federico Faggin [1] describes how 16-pin packages were a completely silly requirement, but the "God-given 16 pins" was like a religion at Intel. He hated this requirement because it was throwing away performance. When Intel was forced to 18 pins by the 1103 memory chip, it "was like the sky had dropped from heaven" and he had "never seen so many long faces at Intel."

[1] pages 55-56 of http://archive.computerhistory.org/resources/text/Oral_Histo...

mardifoufs · on April 5, 2023

I'm really curious about the arguments for sticking to those lower pin numbers. Was it for compatibility? Or packaging costs? Like what was the downside of using more pins (up until a certain point, obviously, but it seems like other manufacturers had no issues with going for higher pin numbers!)

kens · on April 5, 2023

I think the argument for low pin counts was that Intel had a lot of investment in manufacturing and testing to support 16 pin packages, so it would cost money to upgrade to larger packages. But from what I read, it seems like one of those things that turned into a cultural, emotionally-invested issue rather than an accounting issue.

kjs3 · on April 5, 2023

Doubtful. 42- and 48-pin dips same width as 40-pin were a thing around that time or not much later.

garganzol · on April 4, 2023

A similar thing occurs with 64-bit processors nowadays at the silicon level: they have 64-bit virtual address space but only 48 or so bits are used for physical RAM access. Not exactly a 8086 situation but still an interesting observation.

benlivengood · on April 4, 2023

Do you still write 8086 assembly for fun? I grew up with DOS, a86, and nasm and remember those times fondly. Once I had flat mode in Linux I never really looked back, and once I had x86_64 I never looked back, and now I rarely even delve below the interpreted/compiled level unfortunately. x86 had a closeness to the hardware that was fun to play with.

kens · on April 4, 2023

Strangely enough, I haven't written 8086 assembly since 1988, when I wrote the ellipse-drawing code used in Windows.

madmoose · on April 4, 2023

Aaron Giles was just lamenting the other day that the GDI ellipse drawing is apparently slightly lopsided, since he's trying to recreate it :)

https://corteximplant.com/@aaronsgiles/110121906721852861

kens · on April 4, 2023

Hopefully that's not my bug :-) Before my code [1], small ellipses looked terrible: a small circle looked like a stop sign and others looked corrugated.

[1] I can't take much credit. The code was Bresenham's algorithm, as explained to me by the OS/2 group. At the time, OS/2 was the cool operating system that was the wave of the future and Windows was this obscure operating system that nobody cared about. The impression I got was that the OS/2 people didn't really want to waste their time talking to us.

garganzol · on April 4, 2023

Pardon my interest, are you happen to be a former Microsoftee?

kens · on April 4, 2023

I was an intern in the summer of 1988. Instead of staying, I went back to grad school which in retrospect probably cost me a fortune :-)

bonzini · on April 4, 2023

In the SCAS/CMPS annotated microcode, the path with the R DS microinstruction is for CMPS, not SCAS.

SCAS compares AX against [ES:DI] while CMPS compares [DS:SI] against [ES:DI], so the paragraph before is slightly incorrect (should be "SCAS compares against the character in the accumulator, while CMPS reads the comparison character from SI").

kens · on April 4, 2023

Thanks, I've fixed the post.

pwg · on April 4, 2023

Ken, one tiny update. You have "a 4-megabyte (20-bit) address space" near the top of your post. 2^20 is a 1 Megabyte (in base-2 'megabytes') not a 4 megabytes address space.

kens · on April 4, 2023

I got it fixed just before your comment :-)

mysterydip · on April 4, 2023

Is there a performance advantage to using a microcoded instruction vice explicitly calling each instruction? Or is it a space/bugs-while-coding savings?

rep_lodsb · on April 4, 2023

A non-repeated string opcode is one byte for the CPU to fetch vs. several for the corresponding "RISC-like" series of operations (load, increment SI, store/compare, increment DI).

With the REP prefix (another single byte), an entire loop could run in microcode without any additional instruction fetches. Remember that each memory access took 4 clock cycles and there was no cache yet.

--

Eliminating opcode fetches might still speed things up today in some situations, but modern x86 cores are optimized to decode and dispatch several simple operations each cycle without having to go through a microcode ROM (and thus have to use a slower path for the complex instructions).

Also the fact that compilers didn't emit most of the more complex/specialized instructions led to Intel not spending much effort on optimizing those.

userbinator · on April 5, 2023

Also the fact that compilers didn't emit most of the more complex/specialized instructions led to Intel not spending much effort on optimizing those.

I believe some compilers will still emit rep movs for memcpy, rep stos for memset, rep cmps for memcmp, and sometimes even rep scas for memchr/strlen if given the appropriate options.

bell-cot · on April 4, 2023

Directly implementing many & complex instructions in silicon is fast, but takes a lot of transistors. Microcode running on a much simpler "actual" processor is slower, but requires far fewer transistors. IIR, ~all substantial CPU's of the past ~40 years have mixed the two approaches.

(Microcode also makes it possible to design in bug-patching features. Possibly including patches for bugs in your direct-implemented instructions.)

bell-cot · on April 4, 2023

> to create a 4-megabyte (20-bit) address space consisting of 64K segments

sed "s/4-/1-/", perhaps?

kens · on April 4, 2023

Oops, thanks.

userbinator · on April 4, 2023

The 8086/88 does have an effective 4MB address space since it outputs the segment register in use during bus cycles, and I believe one of the official manuals did show that possibility, but I don't know of any systems that used such a configuration.

anyfoo · on April 4, 2023

Today I learned. I in my mind briefly went through anything that would foil such a plan, but with no caches whatsoever... should indeed have worked, even if it's weird (and deeply incompatible with what PCs did, of course).

The only thing that comes to mind is fetching vectors from the IDT. I guess it simply output CS in those cycles?

PaulHoule · on April 4, 2023

Once you got a pipelined architecture that stored instructions in a cache you could write a loop in assembly language that runs as fast as one of those string instructions, but if you didn't have that you'd be wasting most of the bandwidth to the RAM loading instructions instead of loading and storing data.

anyfoo · on April 4, 2023

> Finally, the address adder increments and decrements the index registers used for block operations.

It's kind of surprising to me that it's the address adder that increments/decrements the index registers. Since it's a direct operation on a register, naïvely I would have assumed it's the ALU doing that. Is it busy during that time?

EDIT: Having read a bit further, I guess it's because of the microcode's simplicity... so, with the microcode being as it is, using the BL flag to instruct the logic to use the address adder to increment/decrement the index register prevents needing additional ALU uops, which would make the instructions slower? So then I guess it would not have worked to make that BL flag instruct the ALU instead? Does anyone know any details?

kens · on April 4, 2023

I think the motivation is that the address adder already has the capability to increment/decrement registers by various constants: incrementing the PC, incrementing the address for an unaligned word access, correcting the PC from the queue size, etc. If you updated the SI/DI registers through the ALU, you'd need a separate mechanism to get the constants into the ALU, as well as controlling the ALU to do this. But using the address adder gets this functionality almost for free.

anyfoo · on April 4, 2023

> SI → IND R DS,BL 1: Read byte/word from SI

> IND → SI JMPS X0 8 test instruction bit 3: jump if LODS

> DI → IND W DA,BL MOVS path: write to DI

Is "W DA,BL" a typo, should it be "W DS,BL"?

kens · on April 4, 2023

Not a typo, but something I should clean up. The move instruction moves from DS:SI to ES:DI. The 8086 registers originally had different names (which is why AX, BX, CX, DX aren't in order) and the "Extra Segment" was the "Alternate Segment". So the 8086 patent uses "DA" to indicate what is now ES. ("A" for Alternate but I'm not sure why "D" is used for all the segments.) The register names were cleaned up before the 8086 was released.

Andrew Jenner's microcode disassembly used the original names. I've been changing the register names to the modern names, but missed that one. So it should probably be "W ES,BL", although I don't really like "BL" either. (Maybe "PM" would make more sense for plus/minus?)

A bit more on the original register names. The original 8080-inspired registers XA, BC, DE, HL were renamed to AX, CX, DX, BX. The original registers MP, IJ, IK were renamed to BP, SI, DI.

anyfoo · on April 5, 2023

I forgot that it moves from DS:SI to ES:DI, but that's actually super interesting that the registers were named differently originally (and I did notice that the AX/BX/CX/DX order is weird before)... What were CS and SS called? DC and DS?

kens · on April 5, 2023

The original segment registers were RC (code base), RD (data base), RS (stack base), and RA (alternate). I think the "R" is for "relocation register". But in the microcode, DA used the alternate segment, DD used the data segment, DS used the stack segment, and I don't think the microcode ever explicitly accessed the code segment. D0 accessed "segment 0" for interrupt vectors, etc. I don't know what "D" stands for.

There's a diagram showing the original registers in the patent: https://patents.google.com/patent/US4449184A

anyfoo · on April 5, 2023

Super interesting, thanks.

Another thing I just noticed:

> Apparently a comprehensive solution (e.g. counting the prefixes or providing a buffer to hold prefixed during an interrupt) was impractical for the 8086.

Did you have a particular kind of buffer in mind? The buffer could not be internal (i.e. not exposed to the programmer), because the interrupt handler itself could use string instructions. Or, it could just switch over to some other code entirely, and switch back much later.

I guess since interrupts already push not just the usual CS:IP on the stack like a far call would, but also the FLAGS register, so that the special IRET could restore it, the x86 could also have pushed additional data describing the "prefix state"?

Or maybe even try to cram it into the FLAGS bits that were still unused originally? With the 7 spare bits of the 8086, the 4 segment prefixes (of which only can be active at any time, so 2 bits), plus LOCK, would fit and leave 4 bits to spare.

kens · on April 5, 2023

I didn't think about a solution too deeply, but I was thinking of something like "shadow" prefix registers: when the chip takes an interrupt, copy the prefix latches to the shadow register, and then when you return from the interrupt, restore the prefixes from the shadow register. This would handle string operations in the interrupt handler but wouldn't handle nested interrupts. Your FLAGS solution would handle both since the state goes on the stack.

It would be interesting to find out what the real solution was in later x86 processors. I'm sure someone has looked into this.

rep_lodsb · on April 5, 2023

The 186 had another condition flag for 'segment prefix present' and decremented PC twice if it was set:

    ** 0.1010101x STOS
     0  IK    -> IND   0 F1   6     ;check count if repeated
     1  M     -> OPR   6 W DA,BL
     2                 1 DEC  tmpa
     3  IND   -> IK    0 NF1  7     ;exit if no repeat
     4  SIGMA -> BC    5 INT  RPTI  ;handle interrupt
     5  BC    -> tmpa  0 NZ   1     ;redundant check???
     6  BC    -> tmpa  0 NBCZ 1     ;loop while Not BC Zero
     7                 4 none  RNI   

    ** 1.10100100 string instr interrupted (RPTI:)
     0  PC    -> tmpb  4 none  SUSP
     1                 1 DEC  tmpb
     2  SIGMA -> tmpb  0 F1   5 
     3                 0 PFX  6 
     4  SIGMA -> PC    4 FLUSH RNI          ;PC corrected
     5  SIGMA -> tmpb  0 UNC  3             ;has REP
     6  SIGMA -> tmpb  0 UNC  4             ;has seg pfx
     7                 4 none  none

edit

- Posting code for STOS since it's simple, even though segment prefixes don't affect its function.

- No idea if line 5 could be removed, maybe needed for timing?

- MOVS, INS and OUTS use a special memory-to-memory/IO operation encoded as 'W IRQ DD,BL'. It doesn't use the IND register at all, maybe the internal DMA controller instead?

bonzini · on April 5, 2023

Is the 186 microcode from a patent as well?

rep_lodsb · on April 5, 2023

No, from long hours of staring at photos of the actual die :)

The 8086 patent also only included a few examples, and the naming conventions for microcode registers and instructions. The full microcode for that one was disassembled by Andrew Jenner.

bonzini · on April 5, 2023

Yeah, I meant this specific example; I didn't know that the 80186 microcode was also disassembled, at least in part.

It would be interesting to know how they sped up multiplication and division between the 8086 and 80186!

monocasa · on April 5, 2023

> It would be interesting to find out what the real solution was in later x86 processors. I'm sure someone has looked into this.

At least now the solution is to simply pass the full PC of the instruction thorough the pipeline. Or at the very least mark the PC for each ROB entry, and offsets for each instruction in the ROB entry.

As an aside do you know yet what you're going to work on once you've mined out the 8086?

rep_lodsb · on April 4, 2023

DA = "Data segment Alternate", or something like that. It refers to the segment register normally called ES, which string instruction implicitly use as destination (source being DS unless overridden).

The names are those used in a patent filed by Intel and come from before the official documentation for the 8086 was written. It might make the microcode a bit more readable to invent a different notation that is more in line with normal x86 assembly, but then again this is a very niche topic :)

anyfoo · on April 5, 2023

Ah, I forgot that it's DS:SI to ES:DI. Thanks.

ajenner · on April 4, 2023

It's not - the "DA" means that the write happens to the ES segment, while "DS" means that the read happens from the DS segment. The use of different segments for the source and destination gives these instructions extra flexibility.

elzbardico · on April 5, 2023

String Operations! Amazing. Putting the emphasis on “Complex” in CISC.

userbinator · on April 5, 2023

ARM, which is decidedly considered RISC, has STM/LDM.

It also uses microcode and Ken has not surprisingly written about it too: http://www.righto.com/2016/02/reverse-engineering-arm1-proce...

monocasa · on April 5, 2023

FWIW, there's a lot of people that consider ARM a hybrid RISC/CISC, among other reasons, because of the presence of LDM/STM.

Yeah, it literally has RISC in the name, but the SuperFX is called RISC by it's creators as well, but it's about as far aawy as you can get. RISC was a buzzword to the point that it's still a bit muddled to this day 40 years later.