
The Hidden Power of BCD Instructions - nkurz
http://www.hugi.scene.org/online/coding/hugi%2017%20-%20coaax.htm
======
userbinator
AAD is basically an 8-bit immediate multiply-accumulate, so it could be said
that the 8088 had a MAC instruction - with self-modifying code you could do a
real 8-bit multiply-accumulate. ;-)

There's also this 5-byte sequence to convert a nybble (0-F) in AL into the
appropriate ASCII hex digit (30h-39h, 41h-46h):

    
    
        cmp al, 10
        sbb al, 69h
        das
    

x86 code can be insanely dense - check out the 256-byte and below categories
in the demoscene, for example.

~~~
aktau
That looks much cooler than the way I recently tried to print numbers in hex
when teaching myself some more asm
([https://gist.github.com/aktau/a85a925282fbe66d13b0](https://gist.github.com/aktau/a85a925282fbe66d13b0)).
I wonder how it performs... (afaik, old rarely used instructions like that
could get deprioritized by intel)

------
zxcdw
This is good stuff. If anybody wants to dig deeper to articles like this, I
have to mention the Hugi Coding Digest[1] (an executable "diskmag") from 2003
which contains all the articles related to programming from Hugi #11 to Hugi
#27, including this one.

The _topics_ of the articles are as follows: "Mathematics & Theoretical
Computer Science", "General Programming Techniques", "Searching & Sorting",
"Object-Orientated Programming", "File Formats", "Text Processing", "2D
Graphics Programming", "3D Graphics Programming", "Windows Graphics
Programming (GDI, DirectDraw, Direct3D)", "OpenGL", "Sound Programming",
"Synchronization & Scripting for Demos", "Hardware-centered Programming",
"Code Optimization, FPU", "Data Compression", "64k, 4k and even smaller
intros", "Windows", "Linux", "Other Non-Wintel platforms", "Active Server
Pages", "ActiveX", "Assembler", "C++", "Flash", "Java", "JavaScript", "PHP",
"Other Programming Languages", "Miscellaneous".

Hell, it also has some nice tracker music on the background.

Obviously the format is a bit cumbersome -- but I think it's a good dive into
the demoscene culture. Also most of the articles are written by hobbyists --
the real young hackers (oh and a few crackers too!) who just want to share
what they have learned.

I think it should run natively on Windows and runs on Linux via Wine. Just
launch the hugicode.exe -- of course with appropriate security caution, and if
you trust me, Hugi and scene.org to have no malicious intent. :)

Why is the hacking culture like this dead? It was still somewhat well alive
just 10 years ago, never mind 15 or 20 years ago. Even after so many years, it
still saddens me to look back into gems like this Hugi Special Digest from a
decade ago and see it forgotten and gone. Not just the contents or the release
itself, but the _computing culture_ which has died along with the demoscene.

[1]:
[https://www.scene.org/file.php?file=/mags/hugi/hugise01.zip&...](https://www.scene.org/file.php?file=/mags/hugi/hugise01.zip&fileinfo)

~~~
userbinator
_Why is the hacking culture like this dead?_

The demoscene is very much alive, if you look at places like pouet.net there's
plenty of new demos released even in the sub-1k categories. The newest ones
there are from this month. However, you might be correct to say that it's
become less known amongst general computer users and programmers, and I think
the consumer-oriented nature of computers today (especially mobile devices) is
mostly to blame; users are restrained and actively discouraged from tinkering
with their machines software and hardware-wise, and isolated from knowledge by
many layers of abstraction and complexity. There's a big movement against
users sharing executables with each other and running them, and while the
security concerns are real, I think it's also had a chilling effect on the
hobbyists. The fact that antimalware software tends to detect packed demos as
suspicious/infected (false positives) doesn't help either. In addition, many
people probably found their way into demoscene via the warez scene that it
grew from - and with the growing antipiracy concerns, that route is becoming
narrower too.

While I don't think the demoscene is currently "dead" per se, it's certainly
at risk of becoming even more of an obscure and fringe culture than it is now.

~~~
ChuckMcM
I agree with this, I think the fraction of people who are looking at computers
in this deep way is similar to what it has always been, but it is still a
small fraction. And as such its activities are swamped in the noise of other
things with the same name.

Perhaps part of the difference is that before (when RAM/CPU was
expensive/slow) you were forced to do this to make something impressive and
now we have an excess of compute and RAM. So to rekindle that challenge we set
an artificial limit.

------
davidst
This really brings back memories. 25 years ago I used (what was then) the
undocumented variation in the itoa() routine for the Borland C run-time
library. The purpose was to eliminate the need for a 16-byte table to generate
hex codes when base-16 output was desired. itoa() was a part of the printf()
library so this table became embedded in virtually every executable. Knocking
that out was a meaningful size optimization in those days.

------
Tuna-Fish
Note that all of the decimal arithmetic instructions are invalid in 64-bit
mode.

They had to scavenge opcode space from _somewhere_ , and the bcd were deemed
unnecessary.

~~~
userbinator
Unfortunately they (AMD) didn't reassign those opcodes for some other purpose
- they've just become completely invalid.

Instead, a whole row of useful general-purpose increment and decrement
instructions was replaced by 16 REX prefices. A bit odd if you consider that
the number of BCD and segment prefix opcodes they made invalid would've been
more than enough to be assigned to the new REXes, and still maintain a
consistent encoding...

------
2510c39011c5
Since both AAD (opcode 0x37) and AAS (opcode 0x3F) are 1 byte long, and a
sequential combination of them won't change the program semantics...another
bonus is that ascii characters for 0x37 and 0x3F are printable characters...

This would make them good replacement for a sequence of NOP's...

------
WalterBright
Anyone want to benchmark to see if these short sequences are faster?

~~~
dtech
The author himself states that they are slow. I reckon that at that these
instructions are translated into a whole slew of micro-ops. Rarely used
instructions like this often are not well optimized.

~~~
nkurz
I posted this before I looked up the timings. Some of the historically slow
instructions like BT/BTS/BTR are now fast. I was hoping these might be among
them, but unfortunately they are not.

For Haswell, Agner
([http://www.agner.org/optimize/](http://www.agner.org/optimize/)) says:

    
    
      AAA 2 uops, Latency 4
      AAS 2 uops, Latency 6
      AAD 3 uops, Latency 4
      AAM 8 uops, Latency 21

~~~
userbinator
They're relatively slow for a _single_ instruction, but if you compare the
uops they generate with the number of uops needed for an equivalent sequence
of multiple simple ("RISC style") instructions, I'd bet they're the same or
even slightly better - after all, an equivalent sequence of instructions would
need to perform the same operations the same way - by generating uops for the
execution units. The difference is that the RISC-style instructions take up
more cache and fetch and decode bandwidth, whereas these CISC instructions get
expanded into uops inside the decoder - so the speed of these instructions is
dependent upon how fast the decoder can emit the uops.

...and looking at the same Haswell instruction tables for the simpler sequence
of instructions, we find that:

AAA performs a compare and decides whether or not to add some constant, along
with some flag operations; 2 uops is what one compare and one add instruction
would already generate, plus you'd have to take into account a (possibly
mispredicted) conditional jump. If you find out a way to do it using a CMOV,
that alone is 2 uops and a latency of 2 cycles. AAS is similar except it's
doing a subtraction, but maybe there's something else that increases its
latency by another 2 clocks...

AAD is an 8-bit multiply followed by an addition and clearing of a register.
MUL/IMUL r8 generates 1 uop and has a latency of 3, ADD r, i is another uop
with a latency of 1, and clearing the register is another uop (no latency due
to register renaming, I'd guess.) This would be 3 uops and a latency of 4,
exactly the same as the single instruction.

AAM is an 8-bit divide; a DIV r8 generates 9 uops and has a latency of 22-25,
compared with 8 uops and a latency of 21 for AAM.

So it would appear that Intel has pretty much made these instructions as fast
as they could for the microarchitecture, and glancing through the tables this
appears to have been true since the Pentium II (with two exceptions - the Atom
series, and the ill-fated NetBurst); e.g. in the Nehalem, we have

    
    
        AAA/AAS/DAA/DAS  1 uop, latency 3
        AAD 3 uops, latency 15(?)
        AAM 5 uops, latency 20
    

and the Pentium M has

    
    
        AAA/AAS/DAA/DAS 1 uop, no latency listed
        AAD 3 uops, latency 2
        AAM 4 uops, latency 15
    

The Atom is rather disappointing:

    
    
        AAA/AAS 13 uops, latency 16
        AAD 4 uops, latency 7
        AAM 10 uops, latency 24
    

The P4 is _extremely_ disappointing:

    
    
        AAA/AAS 4+27 uops, latency 90
        AAD 4+10 uops, latency 22
        AAM 4+22 uops, latency 56
    

AMD has historically been worse on the complex instructions, and although
they've improved a bit, are still behind Intel's latest; e.g. for Steamroller
the timings are

    
    
        AAA/AAS 10 uops, latency 6
        AAD 4 uops, latency 6
        AAM 10 uops, latency 15 (on par with 9 uops/latency 17-22 for DIV r8)
    

Edit: I benchmarked AAM vs DIV and AAD vs MUL+ADD (with a dependency chain, so
the real latencies are being tested instead of being hidden by something else)
on a Nehalem (i7 870) and for 500000 iterations,

    
    
        5250303 clock cycles for DIV
        5250258 clock cycles for AAM
        3818905 clock cycles for MUL+ADD
        3818907 clock cycles for AAD
    

So it's safe to say they're really just as fast.

~~~
WalterBright
Very interesting. One reasonable benchmark would be uppercasing a string as
described in the article vs the usual way.

~~~
pbsd
As great as BCD instructions may be, they can't possibly compete with vector
instructions for that kind of operation.

