I only have a vague understanding of assembly in general, and x86_64 in general ...

dkopi · on May 15, 2016

That's less of an assembly issue and more of a compiler/calling convention issue. Assembly allows you to pass parameters on the stack or in registers. Calling conventions simply define a protocol for doing this consistently.

Annatar · on May 15, 2016

Actually Id'd say that it's mostly the crappy processor design issue: for example, UltraSPARC has 32 physical registers in 32-bit mode, and 256 virtual registers (through register windows, specifically designed for compilers). Even Motorola 68000 with eight general purpose address registers and eight general purpose data registers is far more elegant than a 32-bit four register intel CPU.

intel CPU is just crap from a design standpoint, and since they had to remain backward compatible, it's gotten a lot faster with lots and lots of tricks, but it still sucks in 32-bit mode. No amount of tricks will change that. It has to be run in 64-bit mode to gain a performance boost and simplify the code, whereas processors with fixed 32-bit instruction encoding run faster in 32-bit mode and code simplicity is a constant.

vardump · on May 15, 2016

While x86 is arguably pretty ugly, I think you're being unfair.

> ... far more elegant than a 32-bit four register intel CPU.

I count eight: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP.

x86 has also advantage of supporting arbitrary immediate values, so you don't need to allocate registers just for constants.

Annatar · on May 15, 2016

eax is the accumulator register, ebx the base register, ecx the counter register, edx the data register, esi is the source index, edi is the destination index, ebp is the base pointer, and esp is the stack pointer.

When I originally wrote four general purpose registers, I had eax, ebx, ecx and edx in mind, but after fully listing them above, I revise my earlier statement: the x86 assembler has two general purpose registers. Crappy processor architecture with lots of specialized registers, but too few general purpose ones.

Compared to MOS 6502, Motorola MC680##, or SPARC, only the eax and edx are really general purpose registers -- even mentioning ecx, the counter register, would be iffy.

vardump · on May 15, 2016

6502 has only accumulator anyways. X and Y are not general purpose.

68k is pretty nice, d0-d7 registers are indeed interchangeable. Of course a0-a7 are just for addressing, I think a7 was usually stack pointer.

SPARC I've never programmed, so no comments about it.

I've written x86 code in the past (20 years ago) using all 8 registers for general purpose task -- yes, even ESP. It was faster that way to implement a texture mapper. Ugly but fast.

Those 8 x86 registers are mostly general purpose, apart from some exceptions.

Multiplication was the only annoying one, getting result in EAX:EDX.

I always succeeded making x86 do whatever I wanted, despite some limitations with register use.

qb45 · on May 16, 2016

Indeed. While repurposing ESP might be rightfully considered ugly, repurposing EBP is quite common and EBX, ECX, ESI, EDI pretty much are general purpose because nobody has been using them for their fixed functions for two decades.

e12e · on May 16, 2016

Is there something wrong with loop and friends?

I agree that stosb/rep would probably be confined to memory management, and saving/restoring register around such ops isn't the end of the world. Not sure about movsb -- I suppose if you're copying enough data, store/restore is going to be negligible overhead in terms of speed, but if you're actually trying to write clear code, it would certainly be easier to not have to worry about the book keeping?

vardump · on May 16, 2016

I don't know what's the current status quo, but most of the time after 80286/80386, "rep stos" and "rep movs" has been significantly slower than just (loading and) storing data in an unrolled loop. This limits usefulness just to very short spans or when code size is most important. But most short spans are also predictable (static), so compilers can often just generate an instruction or two to do so (like xor eax, eax / mov <target>, eax).

Currently the fastest way to memset large chunks of memory is probably to use SSE or AVX. I'd guess this is what gets generated if compiler target arch allows.

With SSE/AVX you also have an option to use non-temporal moves to avoid polluting caches. This might have a negative impact on any memset micro-benchmark [1], but significantly help any concurrently executing memory bound CPU cores.

Properly aligned (cache line 64-byte boundary) you might be able to avoid read-for-ownership as well, further reducing memory bus traffic.

So most use of rep-prefix might be pointless, unless you can accept the performance hit.

[1]: Just like micro-benchmarking any other resource constrained operation. Micro-benchmarks can give you very wrong idea of what is best for the system as a whole.

qb45 · on May 16, 2016

Those instructions are treated as "legacy PITA" by CPU vendors and, being complex and harder to implement than simpler ones, aren't implemented as efficiently.

The CPUs have lots of duplicated logic to process many instructions in parallel and, on "friendly" code, can sustain average throughput of 2 or more instructions per clock cycle, provided that the instructions are simple enough.

The end result is that a loop made with normal adds, cmps and jnes outperforms those dedicated looping instructions.

They are only used by compilers when optimizing for code size and maybe by people who want concise hand written assembly, though I'm not sure why wouldn't they just use C in such case.

See "Software Optimization Guides" released by AMD/Intel for more info.