Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hash functions can be as simple as a single modulo.


As they said though, the hash function on a string type requires looking at every element in the string. Also, modulo is historically very slow. So, they still have to deal with O(n) operations, where n is the length of the input string. If they can improve the memory hopping problem associated with lists / graph structures (which they claim to have done in their library), then a trie could would be much fast enough, which is what they observed. Combined with the fact that they claim that 90% of the time there is a miss when querying the trie, then you exit early a lot, whereas you always need to compute the whole hash on the input string when doing the hash map strategy.


The hash does not need to be computed on the whole string. I pointed this out in my other comment but just as a example: a hash function could be as simple as xoring the first 16 and last 16 bits of the string then indexing a 2^16 array. That means hashing is two pointer offsets and an xor (no modulo required). If there are 100 strings that need to be removed, then ~99% of rejections will be a very fast O(1).

And in the case of a match, a fast hash + memcmp will be way faster than a trie traversal. In fact, according to the trie-hard github readme, std::HashMap is already much faster than their trie implementation when searching for a match.


> As they said though, the hash function on a string type requires looking at every element in the string.

Maybe for the default hash function. As another commenter pointed out, your data may make the following hash very effective: s[0] + s[len(s)//2] + s[-1] which would be very fast. The point being is spending a day seeing if such a hash exists is worth it.


Or as simple as using the hardware accelerated CRC32 that we have in our x86 CPUs.

Last time I checked, CRC32 worked surprisingly well as a hash.


Hah, neat.

The 'weird' instructions can often be 3-4 orders of magnitude slower than arithmetic instructions, though I doubt that matters here.


CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).


Sure, in the same way SIMD instructions get a linear speedup in theory.

If you Google the words "CRC32 is slow", you can see hundreds of people complaining about this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: