Really great algorithm regardless!
Visually correct alpha blending requires all blended pixels to be known at the blend time but the way real time graphic pipelines work is too lossy to achieve that. Intermediate pixel are lost on each fragment step.
In current game engines, when a semi-transparent object is rendered, it can only know the previous blended state of the pixel it is writing into. So instead of being able to do something like `blend(p1, p2, p3, p4, ...)`, you work around it with a blend model that looks like `blend(..., blend(p4, blend(p3, blend(p2, p1))))`. A common working (partially) solution uses a linked list of pixel colors, encoded in a texture (src: NVidia RnD) to be able to preserve these colors until the final image is rendered. It is still costly in terms of computation and space. There are other solutions but each has its own, unavoidable drawbacks.
So I figured as a newbie, maybe this hash table can store pixel location as key and a linked list of colors as value to mitigate for the computation cost of NVidia's algorithm. The space cost is still there though. Or maybe I'm entirely wrong.
Especially at around 15:06 they make a great argument that above a certain limit, the net effect of additional pixels is minimal (because they will be blended behind a significant fraction of the whole picture, so just mushing together a few semi-transparent pixels at the _back_ of the stack is reasonable).
The problem is getting each pixel's data as small as possible so the texture doesn't overflow the limits of texture sizes in GPUs - typically mobile devices have a limit of 128 bits per pixel in texture sizes. Also, mobile devices don't really have enough GPU RAM to do this kind of order independent transparency on the 4x anti-aliased framebuffer.
After solving the above problems you're ready to spill some pixels into the storage buffer (again, because most pixels on the screen don't have 8 semi-transparent pixels in them, so spilling to overflow for the few ones that do may be faster than trying to keep it all in the texture). Now you're ready to do something like this technique.
> The table uses power-of-two sizes instead of prime numbers because pow2/AND masking is a fast single instruction and the modulo operator is much slower.
I was under the impression that the idea of prime number hash table sizes being faster was shredded a couple decades ago? Maybe by Bob Jenkins?
These days, the preferred solution to this is usually to just use a better hash function without periodicity issues, so the bucketing no longer matters.
“The best hash table sizes are powers of 2. There is no need to do mod a prime (mod is sooo slow!). If you need less than 32 bits, use a bitmask. For example, if you need only 10 bits, do h = (h & hashmask(10));”
If you have more than 32GB of RAM (which is not rare nowadays) you can directly map all 32-bit values into a 32GB chunk of memory and be probably faster than this GPU solution on regular CPUs.
Can someone comment on what sort of latency would be involved in something like this? For example, latency to send to pcie device, latency to react to and manipulate the data in a warp(Nvidia?), and latency to send the data back to the host system?