Hacker News new | past | comments | ask | show | jobs | submit login
AVX Bitwise ternary logic instruction busted (arnaud-carre.github.io)
320 points by msephton 45 days ago | hide | past | favorite | 68 comments



There is a simple way to get that immediate from expression you want to calculate. For example, if you want to calculate following expression:

    (NOT A) OR ((NOT B) XOR (C AND A))
then you simply write

    ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))
Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:

    typedef enum {
      _MM_TERNLOG_A = 0xF0,
      _MM_TERNLOG_B = 0xCC,
      _MM_TERNLOG_C = 0xAA
    } _MM_TERNLOG_ENUM;
For MSVC you define them yourself.


To take the magic away, write it in binary:

  A = 0b11110000
  B = 0b11001100
  C = 0b10101010


The Amiga manual suggests normalizing to a conjunctive normal form.


The constant-math method above doesn't require that and works on denormalized expressions, too.

That said, the trick for turning four or more argument bitwise operations into a series of vpternlogd operations has yet to be posted


Suppose you have four boolean variables A, B, C, and D. You can calculate X as the result from A, B, C assuming that D is false, and Y as the result assuming that D is true. Then, a third ternary operation can switch between X and Y depending on D. This creates a tree of 3 operations, which I suspect is the best you can do in the worst case.

For five or more arguments, this naturally extends into a tree, though it likely isn't the most efficient encoding.


Oh, I thought the title was saying that the instruction doesn't work properly! (The article actually just explains how it works.)


Agreed on initial interpretation. Terrible title!


Amusingly, I had a third interpretation, which is "busted" as being too strong. I realized that when the author started talking about the Amiga though that's probably not what they meant (as busted is a fairly modern gaming term, I'd be surprised to see someone as old as to be familiar with Amiga to use it. Sorry to anyone that feels personally attacked by this description :P)


Is broken not cool enough anymore?


I’ve heard broken and cracked used in this way, usually in a game-mechanics balance context implying something is overpowered, but busted in that same context has a negative connotation to me for some reason, signifying something being underpowered or bugged, if that makes sense. I can’t even tell if this is ironic or unironic at this point.


My understanding of the "busted" please was that he'd caught the Intel folks being Amiga fans. Guess something was lost in translation to English from the author's native French.


Yeah I mean the thing though is that it's just kind of the obvious way to do this instruction. So two different groups doing it the same way isn't really any evidence of copying.


My teenage self did not write "CRAP!" on that page of the hardware manual, but I stared at it for so long trying to figure it out.

In the end I did what pretty much everyone else did, Found the BLTCON0 for Bobs and straight copies and then pretended I newer saw the thing.

I did however get an A+ in computational logic at university years later, so maybe some of the trauma turned out to be beneficial.


Yup. Learned so much from that page in Amiga Hardware Reference Manual!


About the title: "Ternary logic" usually means "logic with three truth values". But this piece covers a compiler instruction which handles all binary logic gates with three inputs.


The x86 instruction is named 'ternlog', and intrinsic - 'ternarylogic' though; while perhaps unfortunate, the title is appropriate. (and even then 'bitwise' already sort of takes place of what 'ternary'-as-three-valued would, and 'ternary' is also very often three-input, so much so that 'a ? b : c' is often called the ternary operator (and in fact ternlog can simulate this ternary operation; and in fact the article is even about exactly that))


Yeah, though the article describes 0xE2, which is 'b ? a : c'. 'a ? b : c' would be 0xCA.


It’s “ternary (logic instruction)”, not “(ternary logic) instruction”.


What’s logic with three truth values? Are you thinking of trinary and not ternary? I think of a ternary expression as one of the form “(a < b) ? 5 : 2” and that seems like the most common usage from a C++, JavaScript, and Python perspective. https://www.programiz.com/cpp-programming/ternary-operator

OTOH, it doesn’t matter what assumptions we make, right? Words and phrases often have multiple meanings. The word ternary means composed of three parts, and that fits here.


Ternary is the right word. Ternary numbers or logic have 0,1,2 or true, false, and something else. Ternary functions just have three arguments that could be binary logic, similar to a second order polynomial.

Ternary comes from the Latin root word ternarius (composed of three things) which contains the root word ter- or tern- which is an adverb variation of the original root terti- meaning 3rd. Same with binary, it comes from the adverb variation bi- or bin- meaning second. Bi- being derived from the original Greek dy- or di-. However, ternary is more correct than trinary because in the binary, we are not using the Latin root for the cardinal two, duo-. We use the adverb version of two, bi-. Trinary just sounds better to some people because tri- and bi- rhyme. For a numerical system composed of four components, we use quarternary which also uses the adverb form of 4, quarter(n)-, instead of the cardinal form, quadr(i)-.

Personal opinion, saying trinary makes you sound like you don't have any formal education.


Haha don’t get me started! And people who say ‘base 3’ and 3-ary sound like complete morons.

You’re right; ternary is totally valid and does seem more common now that I look it up. Trinary is a valid synonym, and in the dictionary, and mentioned in the Wikipedia for ternary, and was in fact used by many people during my formal education. Your personal opinion sounds like it’s rather presumptuous, not correct, and designed to start a fight. You have the right to keep it to yourself… or change your mind.


That's not what ternary refers to here.

In C, '+' is a binary operator because it accepts two inputs. '?:' is a ternary operator because it accepts three inputs. It is usually referred to as the ternary operator because it is unique in C, but there's nothing fundamental about that.

vpternlogd implements all bitwise ternary operators - those operators that accept three inputs.


Agree, I was also confused on this point. I guess the name “evaluate a three term binary expression” is less snappy though.


I'm also not sure what was "busted" here.


Is this similar to the Windows (since at least 3.1 I think?) BitBlt function, that takes an `op` parameter to decide how to combine the source, destination and mask?

I remember there are names for some of the codes like BLACKNESS for producing black whatever the inputs are, COPY (or something like that) to just copy the source to the destination etc. I always thought BLACKNESS and WHITENESS had a kind of poetic ring to them.

As far as I know, I think this is from Petzold, it's implemented in software but the opcode is actually converted to custom assembly inside the function when you call it, a rare example of self-modifying code in the Windows operating system.


Yep. BitBlt originally used complex 16-bit "operation codes" that store the binary operations in reverse Polish notation. Then, they added "operation index" that stores the same information in a byte, like in Amiga, which is shorter and more elegant. The coding is now redundant because each raster operation code contains both an operation index and an operation code. See https://devblogs.microsoft.com/oldnewthing/20180528-00/?p=98...


I'll point out that this is the same way that FPGAs implement arbitrary logic functions, as lookup tables (LUTs).


Basically CPUs, GPUs and FPGAs all converge to the crab equivalent of computation. They all expose the same capability with different areas of optimization.


All logic can be implemented with memory, and all memory can be implemented with logic (in some sort of feedback configuration).

Neither is typically done in practice except for specialized purposes like FPGAs and the instruction described in this article. High-speed registers and static RAM are sometimes built with logic but it's more common to build them directly with transistors than with gates.


Most but not all. Actel/Microsemi use a small tree of muxes and gates.


So does the 74181 ALU.


I don't think the 74181 is implemented with a LUT.

http://www.righto.com/2017/01/die-photos-and-reverse-enginee...


Head over to https://www.sandpile.org, and find VPTERNLOG on the 3-byte opcode page https://www.sandpile.org/x86/opc_3.htm and you will not only see Intel's apparent past plan for the variants with byte and word masking (AVX512BITALG2), but also the links from the Ib operand to the ternary logic table page https://www.sandpile.org/x86/ternlog.htm with all 256 cases.


Re the choice of function "E2" for the example in the docs: it's sort of the most basic, canonical boolean function on 3 inputs, named mux: A if B else C. It's universal -- you don't need to be an Amiga fan to pick it (though for all I know they might've been).


Another example of packing bitwise ops into an integer is win32's GDI ROP codes: https://learn.microsoft.com/en-us/windows/win32/gdi/ternary-...


I didn't have the official Amiga hardware manual, but instead the book "Mapping the Amiga". It said the same thing in a slight more verbose way. I don't remember which minterms I used back then but I think I managed to work things out from this book to do shadebobs, bobs, XOR 3D line drawing and other things.

The page in Mapping the Amiga: https://archive.org/details/1993-thomson-randy-rhett-anderso...


Nvidia SASS has a similar instruction too (LOP3.LUT)


It's nice that they're finally starting to "compress" the instruction space.

To take a related concept further, it would be nice if there were totally unportable, chip-superspecific ways of feeding uops directly, particularly with raw access to the unrenamed register file.

Say you have an inner loop, and a chip is popular. Let your compiler take a swing at it. If it's way faster than the ISA translation, add a special case to the fat binary for a single function.

Alas, it will probably never happen due to security, integrity, and testing costs.


Is it only security which stands in the way of compilers devising their own instructions. Certain architectures potentially would allow this, and most of the chips, even MCUs as ESP32 (which run RISC-V-like LX7 cores), have microcode. It is apparent what the gains can be, what is the typical show stopper - security or proprietary microcode lang?


The uops aren't smart or terribly proprietary IIRC. Aside from security/integrity/testing, the documentation would have to be automated, and releasing it early might tip off competitors to some stuff. I don't think you'd even need to roll your own ISA, you'd just issue uops as a native ultra-low-level ISA.

One issue (without further architectural help) is that you'd save and restore every register you touch, which is fine for an inner loop that's consequential enough. Another obstacle is any hardware that's shared among cores, such as memory coherency.

In general, moving super-ad-hoc memory management into the compiler would suck for the compiler writers. Maybe some of your compiler can be a LLM?

If performance is key, I would totally deploy this if I have faith in my tests - and maybe a way for the user to turn it off just in case.


Thanks, very informal. The part you save/restore every register you touch I don't quite ge. Is this a side effect of the uops execution or you meant something else?

I like the idea LLM assistss in the process. not because llms can reason better than compiler, but because with this particular translation task, it should excel at. perhaps we gonna see some new compiler in 2025... there is some writing of it here https://arxiv.org/abs/2408.03408 and there is this https://medium.com/@zergtant/meta-releases-llm-compiler-a-co... both from 2024. Model is on HF.


Some systems in the past exposed the ability to directly generate microprograms, but these days it does not really provide much benefit, especially given how different chips can be that you run on.


As someone who fits the description rather too well (although neither my teenage or current self would ever use a marker in the Hardware Reference, omg) this was really nice and satisfying to read.

In a weird sense it kind of helped me feel that, yes, I would probably understand stuff better if I tried re-learning the Amiga hardware today and also like I got a bit of it for free already! Is there such a thing as being protected from a nerd snipe? "This article was my nerd trench" ... or something. Thanks! :)


Holy cow. I remember reading that page in the Amiga reference manual, thinking it was utter crap and made up my own way of calculating the value (which worked, lol).


In fact that means that there is a dedicated AVX instruction for Elementary cellular automaton (https://en.wikipedia.org/wiki/Elementary_cellular_automaton).


This is an instruction I would like to implement in RISC-V if it isn't already, (which yeah, I know, isn't very RISC like)

   movei (%r1),(%r2),(%r3),value
Move the contents of memory pointed to by r1, to the contents of memory pointed to by r2, applying the boolean operator <value>, with the memory pointed to by r3. Then increment all three registers by 4 to point to the next word. There was something similar to this in the Intel 82786 graphics chip which had a sort of minimal cpu part that could run simple "programs".

And yeah, I really enjoyed the blitter on the Amiga. It was a really cool bit of hardware.


Conditional move between registers is common. This is like masked load / stores with postincrement. Seems a reasonable instruction to me, if rather specialised.


What's the value (no pun intended) of such an instruction if "value" is an immediate and not register-based?

If it's an immediate then the compiler (human or machine) knows what the operation will be, so could just write that instead?


The value is that a two instruction pipeline can do a blit at cache speed.


I totally get why you'd want mem-to-mem instructions, it just seemed more interesting in a blitting context (and others) to have the operation not be static but rather register-based. But perhaps in practice it matters not, been a long time since I wrote blitting code.

RISC-V already has what you want in terms of R-type instructions[1], ie "dest = src1 op src2" where "op" is effectively an immediate value, but being based on a load-store architecture it's only register-to-register of course.

Though I suppose ISA-wise there's nothing in the way of making an extension that introduces "M-type instructions" which act like the R-type instructions but are mem-to-mem instead. How much that messes with everything else I have no idea.

edit: ah, forgot about you wanting it to behave like movsb. Still, something that could be handled by the extension instruction.

[1]: https://github.com/riscv/riscv-isa-manual/releases/download/... (section 2.4.2)


But the “two instruction” part isn’t important there on most risc-v hardware, is it? That “memory to memory” instruction still routes the data through the CPU, so what does replacing a load instruction and a store instruction by a more complex but probably/likely slightly shorter load-store instruction bring if your CPU is much faster than your memory?


Couldn't every Boolean operation be "busted" as a lookup table?


Yes! But your lookup table will need 2^N bits for a function with N inputs. In this way you can easily enumerate all possible functions from N bits to 1 bit.

As a fun exercise, you can do this for all 2-bit -> 1-bit functions. There's only 16 of them, and most of them have very well known names like "and" (LUT 1000) or "xor" (LUT 0110). Some of them don't depend on some of the inputs (eg. LUT 1100 / 1010 which is "return A" and "return B" respectively) or even any of them (eg. LUT 0000 which always returns 0).


The epiphany for me was that any boolean logic can be replaced by memory.

https://www.youtube.com/watch?v=BA12Z7gQ4P0 (ben eater)

I mean it is obvious in retrospect, sort of along the lines of memoizing a function. but it was mind blowing when I first saw that.

Not that at the hardware level memory is actually any simpler than whatever boolean logic it is pretending to be, but it feels simpler and is easily available in large quantities.


Do compilers actually output this instruction?

So many super-clever instructions are next to impossible for compilers to automatically use.



It is actually supported since .NET 8.


This instruction at least is very trivial to emit - any sequence of bitwise logic ops over three inputs compiles to one ternlog always. This isn't at all a super-clever instruction, it's more a super-boring one. If anything, it simplifies codegen compared to having to emit a varying amount of instructions for three-operand logic, having to find the best one.


> The Amiga blitter user manual didn’t help much either. The “Amiga Hardware Reference Manual” from 1989 tried to explain minterm calculation using confusing symbols, which frustrated many young demo makers at the time.

That is super normal logical calculus that any worthwhile CS degree teaches about.

Granted, probably not what a teenager without access to a BBS, or Aminet, would be able to figure out.


Great little article! Thank you.


It looks like someone paid attention in their undergraduate Discrete Math class.


If you want to calculate the minterms why don't you just get a K-Map?


You don't even need to do that. The 1s and 0s of the 8-bit word are the 1s and 0s of a Karnaugh map.


it’s fundamentally just a lookup table


Any pure function is just a glorified lookup table.

(Yes, I know, finite input domain etc.)


> an obscure instruction

Come on, vpternlog* is not obscure. It subsumes _all_ bitwise instructions, even loading the constant (-1) into a register.


Not if, like most people, you still aren't using a CPU with AVX-512 support. And I don't recall ever seeing it in compiler output in any case. It's not like boolean operations on three variables occur very frequently in most programs, especially (EDIT: this last part apparently isn't the case) not operations that can't be decomposed into a pair of two-variable operations with no worse performance.


As far as everything on uops.info goes, ternlog has the same throughput and latency as the two-operand logic instructions everywhere (with the mild exception of Zen 4 where it goes from 0.50 to 0.56 cycles/instr; which also shows as having 2-cycle latency to one operand but I think that might be measurement error), so it's always bad to decompose ternlog into two two-operand logic ops.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: