Some time ago I wrote a little function to calculate the nth point's coordinates on a Hilbert curve. It uses pext and pdep intrinsics to avoid loops, it's completely branchless.
This is very good, but I wish it was more up to date:
From the link:
"Finally I optionally challenge the methods gather/scatter and sheep and goats utilizing PEXT and PDEP. I do not know how fast these methods will be since there is not yet any real hardware and no timing tables available. I guess that they are not faster than one AND and one SHL/SHR/ROL instruction, and so I optimize to masking and shifting if possible."
PEXT and PDEP have been in everything since Haswell (2013) and are latency 3, throughput 1. His guess is actually correct.
Another approach to bit permute is to use the byte-level permutes in SSE/AVX2/AVX512. It's slightly ponderous, especially given the fact that AVX512 won't have byte permute before the VBMI instruction extension. But there will be a short instruction sequence that can permute up to a 64-bit word (or even across a pair of 64-bit words) via a trip through SIMD. Smaller scale permutes can be done with PSHUFB now, of course, and sufficiently clever people can figure out how to compose these smaller scale permutes into bigger ones.
https://github.com/leni536/fast_hilbert_curve
https://github.com/leni536/fast_hilbert_curve/wiki/How-effic...
Maybe I will write an explanation how it works and implement the inverse function similarly.