
Show HN: A Simple GPU Hash Table - nosferalatu75
http://nosferalatu.com/SimpleGPUHashTable.html
======
senderista
You might want to check out Bidirectional Linear Probing, which can handle
load factor >90% with good performance. I implemented and benchmarked it here:
[https://github.com/senderista/hashtable-
benchmarks](https://github.com/senderista/hashtable-benchmarks). Storing
integer keys as their hash values (using a reversible hash function) also
helps performance. I implemented a dozen or so 32- and 64-bit reversible
integer hash functions in the same repo.

~~~
nosferalatu75
That's great work. Thanks! Using a reversible hash function is a good idea.

~~~
senderista
FYI here's a concurrent version of BLP (I haven't implemented it but I stole
their simplified insert algorithm):
[https://pdfs.semanticscholar.org/6d6c/ca94c57d408c0b1164d6ff...](https://pdfs.semanticscholar.org/6d6c/ca94c57d408c0b1164d6ff7faea25635fedb.pdf).
Also see a compact version (concurrent Cleary hash table) by the same authors:
[https://www.researchgate.net/profile/Alfons_Laarman/publicat...](https://www.researchgate.net/profile/Alfons_Laarman/publication/221274114_A_Parallel_Compact_Hash_Table/links/00b7d527256971ae55000000/A-Parallel-
Compact-Hash-Table.pdf).

------
diegoperini
I wonder if this technique can be used to implement order independent
transparency on GPU in a performant manner.

Really great algorithm regardless!

Explanatory Edit:

Visually correct alpha blending requires all blended pixels to be known at the
blend time but the way real time graphic pipelines work is too lossy to
achieve that. Intermediate pixel are lost on each fragment step.

In current game engines, when a semi-transparent object is rendered, it can
only know the previous blended state of the pixel it is writing into. So
instead of being able to do something like `blend(p1, p2, p3, p4, ...)`, you
work around it with a blend model that looks like `blend(..., blend(p4,
blend(p3, blend(p2, p1))))`. A common working (partially) solution uses a
linked list of pixel colors, encoded in a texture (src: NVidia RnD) to be able
to preserve these colors until the final image is rendered. It is still costly
in terms of computation and space. There are other solutions but each has its
own, unavoidable drawbacks.

So I figured as a newbie, maybe this hash table can store pixel location as
key and a linked list of colors as value to mitigate for the computation cost
of NVidia's algorithm. The space cost is still there though. Or maybe I'm
entirely wrong.

~~~
sounds
Intel has a great set of papers; the idea is to store a few of the semi-
transparent pixels in a texture, then overflow to a storage buffer as needed.

[https://www.gdcvault.com/play/1014547/Adaptive-Order-
Indepen...](https://www.gdcvault.com/play/1014547/Adaptive-Order-Independent-
Transparency-A)

Especially at around 15:06 they make a great argument that above a certain
limit, the net effect of additional pixels is minimal (because they will be
blended behind a significant fraction of the whole picture, so just mushing
together a few semi-transparent pixels at the _back_ of the stack is
reasonable).

The problem is getting each pixel's data as small as possible so the texture
doesn't overflow the limits of texture sizes in GPUs - typically mobile
devices have a limit of 128 bits per pixel in texture sizes. Also, mobile
devices don't really have enough GPU RAM to do this kind of order independent
transparency on the 4x anti-aliased framebuffer.

After solving the above problems you're ready to spill some pixels into the
storage buffer (again, because most pixels on the screen don't have 8 semi-
transparent pixels in them, so spilling to overflow for the few ones that do
may be faster than trying to keep it all in the texture). Now you're ready to
do something like this technique.

------
dahart
Awesome!

> The table uses power-of-two sizes instead of prime numbers because pow2/AND
> masking is a fast single instruction and the modulo operator is much slower.

I was under the impression that the idea of prime number hash table sizes
being faster was shredded a couple decades ago? Maybe by Bob Jenkins?

~~~
amalcon
The idea behind prime number sizes is that you can use a hash function with
periodicity issues, which would normally cause collisions. E.g. in C++, it's
totally standard-compliant for std::hash<size_t>::operator() to be the
identity function. The prime number of buckets will tend to mitigate this.

These days, the preferred solution to this is usually to just use a better
hash function without periodicity issues, so the bucketing no longer matters.

~~~
dahart
Exactly. Indeed - I found the link I was thinking of - Bob Jenkins wrote about
that in his famous 1997 article on hashes in Dr Dobbs. Having learned in
college that hash table sizes are ideally prime, reading this article was
shocking at the time. Ironically, I learned that hash table sizes should be
prime in college after 1997, and didn’t find this article until later. I guess
best practice knowledge takes time to percolate.

“The best hash table sizes are powers of 2. There is no need to do mod a prime
(mod is sooo slow!). If you need less than 32 bits, use a bitmask. For
example, if you need only 10 bits, do h = (h & hashmask(10));”

[https://burtleburtle.net/bob/hash/doobs.html](https://burtleburtle.net/bob/hash/doobs.html)

------
drfuchs
Concerning: “...as the table is linearly searched, the slot index must be
wrapped at each slot. The cost of a modulo operation at each slot quickly adds
up.” As you linearly search, you don’t need to re-modulo; you just have to set
the index to zero if it has reached the end of the table, so it’s a simple
compare-and-conditionally-assign operation.

------
sprash
A 32-bit key and 32-bit value are some serious constraints. The whole point of
hash tables is usually that you have variable length keys or values.

If you have more than 32GB of RAM (which is not rare nowadays) you can
directly map all 32-bit values into a 32GB chunk of memory and be probably
faster than this GPU solution on regular CPUs.

~~~
dmkolobov
Even so, GPU-based hash tables are useful in that they can be used to speed up
various kinds of simulations for video games and animation:

[https://wickedengine.net/2018/05/21/scalabe-gpu-fluid-
simula...](https://wickedengine.net/2018/05/21/scalabe-gpu-fluid-simulation/)

[https://developer.nvidia.com/gpugems/gpugems3/part-i-
geometr...](https://developer.nvidia.com/gpugems/gpugems3/part-i-
geometry/chapter-7-point-based-visualization-metaballs-gpu)

------
kristofarkas
How does the performance compare to other implementation either for GPU or
not?

------
hak8or
My understanding is that pcie devices incur a very high latency for getting
data to and from them, due to the pcie device itself and setting up a dma
descriptor. Sadly this article doesn't look into that.

Can someone comment on what sort of latency would be involved in something
like this? For example, latency to send to pcie device, latency to react to
and manipulate the data in a warp(Nvidia?), and latency to send the data back
to the host system?

------
malux85
This is just mind blowingly cool

------
michael-ax
That is a most awesome thing you did there. Thank you very much for sharing!

------
winterismute
Isn't the global atomic op a bad idea?

~~~
Impossible
It's not terrible but it might be possible to get better performance within a
warp using some warp function.

