Hacker News new | past | comments | ask | show | jobs | submit login
A quick and practical “MSI” hash table (nullprogram.com)
73 points by rcarmo on Aug 9, 2022 | hide | past | favorite | 70 comments



C means rolling your own quick and practical hash table in 250 lines without tests, whenever you need it. Kinda answers the question of why people are looking for a better alternative to C that was discussed here recently. I wish we had a name for this, like "required non-innovative effort penalty" for using C, so that we can abstract some of these discussions that end up happening every time C is mentioned.


There is a phrase: accidental complexity.

Systems can be broken down into two sets of complexity: Accidental and Essential. Accidental complexity is there because of something to do with the system, but is not, strictly, inherent to the problem being solved. That would be essential complexity.

Fred Brooks' essay "No Silver Bullet": http://worrydream.com/refs/Brooks-NoSilverBullet.pdf

And a good paper on the topic is Out of the Tarpit (http://curtclifton.net/papers/MoseleyMarks06a.pdf) and a good writeup on that is here: https://blog.acolyer.org/2015/03/20/out-of-the-tar-pit/


You added two definitions but I think you forgot to make a point?


> I wish we had a name for this, like "required non-innovative effort penalty"


It's not the same thing though, accidental/incidental complexity is a much broader term.


That’s one of the few places that genuinely itch in my C experience, but I haven’t seen it solved anywhere, even partially.

The hard part is, there are a myriad tiny variations of each data structure, especially the simplest ones (even prehistoric Knuth vol 1 lists a dozen of versions of a linked list); the C practice is generally to write the most appropriate one in each case (because there is no reason to value a premade one if you don’t have any, and because manual implementation makes to want and appreciate those small affordances that pointer-to-head / dummy-head / cyclic pointer-to-head / cyclic dummy-head / etc. give you). This does not always work well, both because it forces you to optimize with little data, and because natural laziness can lead you to avoid a marginally more complex data structure for longer than you really should (and possibly have the laziness-induced inefficiency be lost and forgotten).

But it does give working with data structures in C a certain feel and style that I haven’t seen reproduced with reusable implementations. When a language uses those, it’s either by choosing a small set of structures and running with them (Lua, Awk, Python, perhaps Tcl as a more interesting example) or by parametrizing everything to the hilt (C++ and D). Even in C, the language more steers you towards rolling your own than forces it (cue qsort & bsearch, BSD and Linux macros, etc.). Yet the small set is frequently too small, and the One Implementation to rule them all is bloated far beyond the mental capacity of a single person and still too inflexible.

Honestly, if there is something I’d like from a C successor, it’s reproducing the hand-rolled-data-structure style without the actual hand rolling (in most cases), and that’s as much a library feature as a language one. I can name a lot of other things I’d like to see in my low-level language, but if I had to choose one improvement, that would be it. So far, I haven’t seen a convincing attempt.


The Zig standard library includes some commonly used data structures, and the perfect time to add your favorites or to improve the APIs of the existing ones is before 1.0 :)


One of the things Gankra (I think it was her?) did a long time ago for Rust before 1.0 was pull out all the more esoteric data structures which had lived in its standard library and banish them to separate crates. Does somebody need an LRU cache? Yes. But unlike say, a HashSet most people will never need one even if you explain what it's for.

The standard library has a high maintenance cost. It's only worth it to pay that cost for data structures people need all the damn time, and for vocabulary types. Fewer people will ever need ControlFlow than HashSet, but having it in the standard library means everybody is agreed that's ControlFlow, there's no risk two popular things invent their own. If two popular Rust crates invent their own LRU cache nobody's day is ruined.

Because C is really old, it's understandable that K&R went really light on the standard library. I can't forgive their terrible string, or the garbage built-in array type, but I can totally imagine in the 1970s it's a lot to ask that the standard library should have say, hash tables. Today it isn't and your language should have them.


Blog post idea: how to build a zig library with an extern C API that provides access to a Zig data structure, build it, link it in a C program and use it from C.


On the other extreme there are environments like Rust/Cargo and JS/Node, where you pull the hash table and you get 20 additional packages of untested who-knows-what that nobody asked for and that are recently vectors for malware. Shall we call it 'non-required pseudo-innovative non-effort penalty'?


This is counter factual information, you might want to double check if you use it to make decisions.

All you need in Rust is the standard library that is well tested and doesn't pull in any crates.

In JS a hash table is built into the language and doesn't require any packages.


Can you name a modern language that doesn’t ship with a map collection type in its standard library?

Rust does. C++ does. Swift does. It’s basic table stakes.


FNV-1a seems like a questionable choice. You can get higher quality and throughput for short strings from a hash that uses AES instructions or wyhash, or for long strings maybe from xxhash.

> If your input is made of integers, or is a short, fixed length, use an integer permutation, particularly multiply-xorshift. It takes very little to get a sufficient distribution. Sometimes one multiplication does the trick. Fixed-sized, integer-permutation hashes tend to be the fastest, easily beating fancier SIMD-based hashes, including AES-NI. For example:

This is... probably not true? If you want to hash exactly 16 bytes, you can do it in 4 instructions on x64 like this:

  const u8x16 chunk0 = _mm_loadu_si128((u8x16*)data);
  const u8x16 h10 = _mm_aesenc_si128(chunk0, key);
  const u8x16 h20 = _mm_aesenc_si128(h10, key);
  const u64 hash = _mm_extract_epi64(h20, 0);
If the total latency of these instructions is a bit higher than the total latency of the 8 instructions in the provided uuid1_hash, but if you want to hash more than 1 value, you may find that hashing more values allows the 4-instruction approach to achieve higher throughput than the 8-instruction approach. Even 2 values at a time should do it. For example on Haswell the combined latency seems to be 16 cycles for load+aesenc+aesenc+store vs. 12 for 2 shifts, 2 multiplies, 2 xors and 2 adds (the adds and multiplies are arranged in such a way that lea may not be used). I tried uuid1_hash on godbolt.org and it actually emitted a bunch of mov's in addition to what I expected.

I am of course not an expert in these matters, and the fine author is invited to publish a benchmark or to demonstrate the superiority of his hash table and his method for hashing short strings at https://highload.fun/tasks/2


> You can get higher quality and throughput for short strings from a hash that uses AES instructions or wyhash, or for long strings maybe from xxhash.

What would you consider a short string? The latency of AESENC is 5 if I remember correctly. In that time you can FNV-1a a "short" string.

For longer strings I agree that WyHash and xxHash is better. I have a benchmark of both further down on FastHash's github page[1]. Of course, that is only a measure of speed, but both are high quality as well[2].

From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.

You are right regarding bulk-keying with SIMD outperforming any kind of scalar hashing one-at-the-time. I've build a database that does just that (requires keys to be available in bulk - like when doing bulk inserts) using SIMD, and the latency becomes negligible.

[1] https://github.com/Genbox/FastHash

[2] https://github.com/rurban/smhasher


When I recently went shopping for fast hashes for short strings, I settled on wyhash, but ahash[0] seemed like it would have been better if I had bothered to port it from Rust.

> In that time you can FNV-1a a "short" string.

Not if you read it one byte at a time like in TFA!

It looks like the best FNV for short strings in smhasher[1] is comparably fast to ahash[2] on short strings.

> From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.

For issues like reading past the ends of buffers and discarding the extra values, it would be nice if programmers could arrange to have buffers that could be used this way. I posted a thing for hashing strings of a fixed length though, to compare with the thing for hashing strings of a fixed length in TFA.

Edit: It looks like your fastest FNV[3] has some issues where it repeatedly increments `p` but reads relative to `uPtr` which is not being incremented. It also seems to expect C# to have some unusual properties like (*uPtr + 4) being parsed as (*(uPtr + 4)) and being equivalent to uPtr[1] when uPtr is a pointer to a type that's 4 bytes wide (I don't know if C# actually has these properties). This may explain why your fastest FNV is much, much faster than the fastest FNV at smhasher.

FNV1a-YT also employs a good trick to avoid the issue of long data dependency chains made of multiplies: it just has three of them and only does a multiply for half of the input words in each of those. While this is fine, I'm not sure if it's really FNV-1a anymore.

[0]: https://github.com/tkaitchuck/aHash/blob/master/src/aes_hash...

[1]: https://github.com/rurban/smhasher/blob/master/doc/FNV1a_YT....

[2]: https://github.com/rurban/smhasher/blob/master/doc/ahash64.t...

[3]: https://github.com/Genbox/FastHash/blob/master/src/FastHash/...


> It looks like your fastest FNV[3] has some issues where it repeatedly increments `p` but reads relative to `uPtr` which is not being incremented.

That is entirely possible. Some hashes like the FNV-YoshimitsuTRIAD variant is not fit for real-world use and lack test-vectors to ensure that ports like mine is correct. I'm slowly but surely working my way through the C/C++/Rust/Go implementations out there and verifying them against my implementation to ensure it is correct. Luckily, a few authors provide extensive test vectors[1]

Thanks for the feedback, it is very much appreciated.

> FNV1a-YT also employs a good trick [...] While this is fine, I'm not sure if it's really FNV-1a anymore. It was made (by Sanmayce) to optimize for instruction-level pipelining, and use the fact that modern CPUs have multiple execution ports. But due to those changes, it is not compatible with FNV1a anymore.

The trick of reading in stripes is employed by many of the fastest hashes. It is kinda funny to see how one author prefers a switch case over for loops, where others prefer while loops. The differences can sometimes have a big impact on what optimizations the compiler decides to use.

[1] https://github.com/Genbox/FastHash/blob/master/src/FastHash....


It's nice, but I think the days of writing your own standard data structures are probably best left behind us.

If we use standard libraries, we can benefit from extreme optimization, both in the library but also in the compiler (eg. A smart enough compiler could see you setting an entry in a map and immediately reading the same entry, and it could optimize out the read).

The data structures can be optimized for CPU architectures and cache sizes for example - something that mini implementations like this would never bother with.

Likewise, a smart compiler could see that you never call "delete" on entries in a data structure, and use an implementation that is more efficient, yet doesn't support deletes.


> If we use standard libraries, we can benefit from extreme optimization

I did a quick test[0] about that since it is a common thing people claim ("use the standard library, it has been battle tested and optimized to heavens", etc). Since the article is mainly just about the hash part of the hash table, i only tried with hash sets instead of full hash maps, though the difference is trivial.

I generated a 256MB file with 67108864 32bit unique random integers and tried to use both C++'s unordered_set as well as a custom set built using the method described in the site (i also used the 32bit integer hash from a link in the article). I used C++ because it is the closest to C that has a standard library that provides a hash set.

The results on my machine (Ryzen 3700X, Linux, GCC 12.1.1) are:

    std version:    19504ms
    custom version: 3979ms
The "std version" is almost five times slower than the custom purpose-built version. Note that this is the "best case" scenario where i specify a number of buckets to minimize resizing - without that the std version needs 41382ms instead (i.e. a naive use of unordered_set would be almost twice as slow as with providing the bucket count and about 10 times slower than the custom version).

So yeah, i don't think the days of writing your own data structures are left behind, at least as far as purpose built structures for improving performance are concerned. Of course sometimes you just want to replace a very simple/naive algorithm with an at least theoretically faster one (i.e. replacing a linear search in a big unordered array with some map) so having the language provide some reusable data structures is convenient. But if that doesn't solve the performance problem you may have and the problem is in the data structure you can most likely do better than the generic structures the language has - after all you know your data and your code at a better and higher level than the compiler ever will.

[0] http://xtra.runtimeterror.com/snippets/dir?ci=tip&name=hash/...


note that the c++ built-in hash table is known to be really bad.


s/be really bad/have an interface that precludes many optimizations.

By saying things like "really bad", it discourages use at all, and for most uses std::unordered_map is fine.


It would be worth somebody gathering real world numbers from systems, to find out where people use std::unordered_map, how much are they leaving on the table, versus where people use say Folly F14, how small is the gain they're getting with their real workload.

I would guess there are a lot of maps out there which in practice have a dozen things in them and so it doesn't matter how it works, a naive linear search would be fine too. But I genuinely have no idea, maybe it's 99% maybe it's 1%.


Yeah i guessed (i haven't really used them much myself), but the argument was about using the functionality in the standard library.

Of course another practical alternative to both C++ and C (since the linked article is about C) is to use some 3rd party library.


The poor quality and inability to fix the the C and C++ std libraries are a large part of why (IMO) C and C++ aren't the future.


30ns per lookup is ok. I think you could do around 6ns per lookup (maybe less?) without changing the hash algorithm, using huge pages, reordering the lookups, or dropping the requirement that it be a hash table.


Note that the time is for both insertion and lookup, i didn't actually check lookup alone, just how general "use" compares between the two.

Pure lookup is

    std version:    8563ms
    custom version: 1660ms
(i.e. std version is ~5.1 times slower, again with the preallocated buckets - without preallocating the buckets it is 15298ms)

That puts lookup to ~24.8ns.

> I think you could do around 6ns per lookup (maybe less?) without changing the hash algorithm, using huge pages, reordering the lookups, or dropping the requirement that it be a hash table.

Yeah, considering this is just a "set of 32bit numbers" i could just as easily use a bitmap - a simple implementation with a bitmap array had it at ~6.3ns per lookup - but the discussion was on hash maps/sets and i wanted to keep it as close to the article.

I'm not sure what you refer to with "huge pages" though. Can you elaborate or provide some info?

Also about reordering the lookups you mean changing the order of the values i check? I wanted to have a random lookup since this was a test for how well arbitrary lookups would perform, this is why i generated a file with unique random numbers ahead of time.


How would pre-allocating affect the lookup time? Am I naive in thinking it would only be insert time that would be affected by pre-allocating?

Also I would recommend swapping the order in which you are testing them, just in case there are any issues there.

While I would expected the std implementation to be slower, that certainly feels like an order of magnitude too slow for the standard library.


You specify the number of buckets but it can still affect how they're built and thus how they are used. This is a guess TBH since i don't know exactly what G++'s unordered_set does.

You are right about execution order, not sure what is going on since both are isolated (i literally just swapped the code around) but doing that almost doubled the std version performance :-/ - the custom one didn't change at all though.

Perhaps the optimizer somehow interferes with both tests since the code is next to each and they use the same source data - a way to work around that might be to implement the lookup tests in separate shared objects with the "fill set" and "lookup set" to be in separate functions. I might do that at some point later.


Oh I see, how many did you choose for the bucket numbers?

( I assumed pre-allocate was a capacity allocation rather than bucket parameter )

I'd recommend 82139 which is the closest prime above 82137 which is Sqrt(Pi/2 * 2^32) which is the expected number of items before a collision.

I'd also be interested to hear the performance under the hash function:

    hash(x) = x
This is the CLR and JRE's prefered int32 hash. It's a poor hash for cryptographic purposes or wanting random bits but it has stellar performance for the most part.


I linked the source above. The bucket count is a bit highter than 82139... just 101473717 :-P. Changing it with 82139 does make it twice as slow though. I used 101473717 because that was the number i got from bucket_count() after running it with the default buckets.

About hash(x)=x, that is interesting, i never thought to just use x itself as the hash :-P. The performance is actually a tiny bit better (~17%) for checking if a number is in the set.


I was informally referring to the insert and the much later lookup as "2 lookups."

> I'm not sure what you refer to with "huge pages" though. Can you elaborate or provide some info?

CPUs for servers and desktops made within my lifetime (?) use what's called a translation lookaside buffer (TLB) to cache some of the most recently used mappings from virtual pages to physical pages. Page sizes on commonly used CPUs are 4KB except for Apple Silicon which uses 16KB pages. Since at least 1996[0], consumer CPUs have shipped with the ability to use 2MB (or larger) pages in addition the default page size. This capability is awkwardly exposed on Linux. Linux users can either use transparent huge pages, which will slowly replace 2MB-aligned 2MB regions allocated by long-running programs with huge pages, or they can use hugetlbfs to directly request huge pages when calling mmap. Windows nominally supports this feature and calls it "large pages," but in practice users can either run their program as administrator right after booting the machine and then never close it or give up. macOS supports this feature since 2016 and calls it "superpages." I haven't used that so I can't comment on how usable it is in practice.

A lot of the runtime of a program like this is TLB misses. So you can just have way, way fewer TLB misses, like >99% fewer, if you use 2MB pages instead of 4KB pages. The mmap man page[1] says how to ask for huge pages when you call mmap. Here are some other[2] pages[3] about setting up hugetlbfs, which you must do before calling mmap in this way will work.

> Also about reordering the lookups you mean changing the order of the values i check? I wanted to have a random lookup since this was a test for how well arbitrary lookups would perform, this is why i generated a file with unique random numbers ahead of time.

I just wanted to be specific, like maybe a program could compute the set and then check that the set was computed correctly more quickly by partially sorting the input. No sorting allowed!

0: http://www.rcollins.org/ddj/Jul96/

1: https://man7.org/linux/man-pages/man2/mmap.2.html

2: https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetl...

3: https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt


I see, i was vaguely aware of them but not to the extent described here. Sadly it seems you need to configure the OS for them - might work for something like a server under your control, but it isn't something you can rely on in general use.

I tried to use posix_memalign + madvice MADV_HUGEPAGE to allocate the memory for use in the test program (if i understood the documentation correctly, the "transparent huge page support" that madvice relies on doesn't need any system setup) but didn't see any difference from just using malloc (perhaps the advice was ignored). Either way, both were much slower than just statically allocating the data like in the test program.


Okay but now the argument has switched from "Use your standard library, because its datastructures have been optimized to the extreme and are better than anything than you'd make yourself" to "use your standard library because it's fast enough, you probably don't spend enough time in hash table algorithms for that 5x performance improvement to be worth it".


To be clear, I am not saying anyone should use std::unordered_map. I am saying that if one has a go at it, one may be able to go 5x faster *again*


Ah, I see. My apologies.


C++'s `std::unordered_[map|set]` has a requirement that the addresses of objects in the map don't change on a reallocation -- this forces it to use buckets with linked lists under the hood. Try absl::flat_hash_set (https://github.com/abseil/abseil-cpp/blob/master/absl/contai...) and see how it goes.


Compilers are generally not permitted to change the layout of your data, and standard library data structures and algorithms are generally not that great. Sometimes this is because of unreasonable constraints imposed by standards or APIs. For example, c++ std::unordered_map basically must use separate chaining, even though this is really bad, and many standard library sort functions are unnecessarily slow because they expose an API that takes a comparator (so they cannot use radix sort).

Even very popular and reputedly performant data structure implementations like absl::flat_hash_map are pretty easy to beat for specific use cases. In this specific case, it's easy to beat because it incurs a cost of at least 2 cache misses per lookup, rather than 1.

Edit: As an aside, I have sort of lost faith in "sufficiently smart compilers" as a result of recent adventures in SIMD programming. Compilers seem to do an ok job of constraint propagation on integer values stored individually, but will not do the same sort of thing for values in SIMD registers. For example, if you'd like to compute the vector of u32's resulting from multiplying some vector of u32's `a` by some constant `b` and taking the high 32 bits of the result, there's no instruction for that (in AVX2). There is an instruction to compute the *full* 64-bit multiply result of the lower 32 bits of 64-bit words though. So you can get the high bits of the multiply results for the 1st, 3rd, 5th, and 7th u32's within each lane using that instruction. To get the other u32's into the right places you can use a 64-bit immediate shift by a constant, a shuffle of 8-bit values by a constant, a permute of 32-bit values by a constant, or probably some other operation, and the compiler won't replace the one you pick with the best one of these options even though it's trivial to show these are all equivalent. Then you can multiply that value by the constant. Now your results are in the 2nd, 4th, 6th, and 8th u32's within each lane of 2 vectors. You again must move some of them over using one of the aforementioned 3 or more equivalent methods, and then you must blend the two vectors together. If you literally write a blend here, you may find that it's slower than using a shuffle and a bitwise or, and the compiler will not perform this substitution either.

Much of your comment is about much more difficult program transformations than these. I would like compilers to do these things, but I'm not holding my breath.


But all these things could be changed... Compilers could be permitted to change the layout of your data. standard libraries could be great. Exposing an API that takes a comparator need not be slow if that API isn't used, or the function passed in is fully understood by the compiler.

Basically, I'm saying that most hours of human effort put into optimising things should be put into compilers and standard libraries, because they benefit every project rather than just your personal project. It's hours better spent on a global basis.


The hard problems are always people problems. If they changed the layout of my data, I couldn't use that compiler, since I really do rely on my data being in a specific order. I know what fields in my struct are hot, and what are cold, and a compiler really can't figure that out for me. That would also break every existing ABI out there. Maybe you'd add a struct to say "make this data ABI-compliant" but then everything would get that tag.

Standard libraries can't be great. They have to be safe, conservative and generic by default. std::vector<bool> is a classic example of what happens when a standard library tries to become too clever with optimization, and ends up screwing itself for an eternity.

C++ is too far gone to introduce good allocation patterns for its core data structures. A custom solution can be tailor-made by knowing what works well together.

I could spend 5,000 hours trying to get basic features through the C++ committee, or I could just write my own in half an hour. The idea that effort should always go to the compilers and standard library is countered by 20 years of complete nothing on any of those fronts (std::unordered_map still sucks), half-baked libraries like <random> making it into the standard, and compilers not even getting remotely better for things like SIMD, instead choosing to focus on more random undefined behavior passages to exploit.


This sounds good, to the extent that it's feasible.

I've recently been thinking about how it's sad that ~everyone uses much worse general-purpose lossless compression and decompression algorithms than those sold by RAD Game Tools. It's difficult to estimate the amount of energy spent daily on the difference between zstd and Oodle. If I had to choose how to spend my personal effort on fixing this issue, I would expect higher ROI over the next few years from submitting patches to zstd than from submitting patches to LLVM that try to get it to emit SIMD-ized inner loops for string processing algorithms like those in zstd.


> Even very popular and reputedly performant data structure implementations like absl::flat_hash_map are pretty easy to beat for specific use cases. In this specific case, it's easy to beat because it incurs a cost of at least 2 cache misses per lookup, rather than 1.

Can you elaborate on this one?


There's a talk about this data structure here: https://www.youtube.com/watch?v=JZE3_0qvrMg

Basically it stores one array of 1-byte markers that contain a full-or-empty bit and a 7-bit prefix of the hash, then a separate array of buckets. The lookup first loads a neighborhood of bytes in the former array to find entries that match in the first hash byte or empty entries. If the table is very highly loaded or if the hash function is low quality, some checks for equality of the full hash or equality of the value (the former probably just costs a cache miss and the latter might be very expensive) can be skipped, but under better conditions for the hash and load factor and known-pretty-bad conditions for randomness of consecutive lookups, it can be better to avoid the cost of doing one load in the small array of markers and another load in the large array of hash table entries. In the talk the presenter also says that they've changed the first load in such a way that it sometimes spans two cache lines, which makes it a bit more expensive, but that this was nonetheless a win for them.


Standard libraries are often a lot less impressive than you might expect. I think it’s important that people retain the ability to write their own data structures so that when it is helpful and relevant that skill is in the toolkit.


Compilers don't do any of these things today... but they could.


Or instead of standard libraries, a compiler you can add optimisations to that recognise your structures.


I've been implementing data structures like hash tables for a project I've been working on so it's interesting to see alternative approaches for easy to implement hash tables.

One particular note of agreement is regarding deletion. An "Insert only hash table" is a very useful data structure which can have performance improvements given the extra overhead needed to support deletion.

This is very similar to the design that dotnet uses for HashSet[1]. For deletion they use the term "freelist" rather than "gravestones" which is definitely less emotive language. ( They also use linear probing because it's generally much faster than double-hashing because branch prediction and memory look ahead and other things I don't fully understand mean that it can just grab the next X items from memory all at once before starting to compare them rather than having to jump around the array. )

One choice from OP's design which I find odd is the choice of power-of-2 for the table size. I was under the impression that choosing a table size of a prime number has significant advantages such as being able to more easily expand the table.

On a related note, my favourite hash-table is a fixed-size zero-probing "probabilistic" hash table where hash collisions are dealt with by just overwriting the hash. That makes the choice of a good size for the hash table a fun problem in itself. It's extremely fast to both implement and run if you don't mind the occasional bit of data loss, although the probability of data loss can be well controlled by choice of table size.

[1] https://github.com/dotnet/runtime/blob/4017327955f1d8ddc4398...


> One choice from OP's design which I find odd is the choice of power-of-2 for the table size. I was under the impression that choosing a table size of a prime number has significant advantages such as being able to more easily expand the table.

One advantage of power-of-two sizes is that you do not need an expensive integer modulus.


If your hash is decently uniform and fills 32 bits, you can avoid the modulus with

  i = ((uint64_t)hash * (uint64_t)size) >> 32;


Testing this using a size of 101, it only appears to give results up to 50. In fact in general appears to only give numbers up to size / 2, should that have been >> 31?

edit: Nevermind, I was converting to Uint but should have been returning Int, this gives results in (-50,50).

Adding size / 2 will get you back to the range (0,size-1) that you would want for assigning buckets.

I was also going to ask you why it needed to be "decently uniform" but then implementing it saw that the first thousand numbers all hash to the same value.

In fact, the first 21,262,215 numbers all hash to the same value with size = 101.

Given that .net implements Int32.GetHashCode(value) as "return value"[1], I would avoid this trick in practice in .net.

Please correct me if I've stuffed up my implementation!

https://referencesource.microsoft.com/#mscorlib/system/int32...


might you have used a signed hash value instead of unsigned? (or just one that never sets the top bit)

(0xffffffff*101)>>32 is 100, which is the expected maximum.


Yes this was it thanks, I was trying to fit it around where I use GetHashCode in my workflow which is signed in .net.

As you say the algorithm was correct.


The algorithm above is a 64bit to 32bit reduction method. It performs a 64bit mult and then discards the lower 32bit. I'm not sure if Daniel Lemire invented it as a alternative to modulo, as I've seen the method used before, but he certainly made it popular[1]

There is a 64bit version as well (2x64bit -> 64bit), that when compiled using 128bit multiplication is quite fast too.

The method made it's way into .NET runtime[2]. It is used in HashSet, Dictionary and ConcurrentDictionary today.

[1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...

[2] https://github.com/dotnet/runtime/blob/c22f7f8c73766830b3262...


It is in my 1990 edition of Cormen, Leiserson & Rivest, chapter 12 “hash tables” section 12.3.2 “the multiplication method”, though they don’t explain the non-power-of-2 case in detail.

In this section of CLR the hash function is multiplication by A where 0 < A < 1, and take the fractional part of the result. In practice, as they explain, you can multiply by A * 2^32 and keep only the lower 32 bits of the result. Anyway, that gives you a full 32 bit hash (if A is good).

Then multiply by m (the size of the hash table) and take the floor (or drop the lower 32 bits of a 64 bit integer result). In CLR they discuss the case where m is a power of 2 and you don’t have a double-width multiply, so they pull the modulus out of the lower word, but the basic idea is still there.


Good trick. I thought about the approach of storing or computing the magic numbers for division by the primes you've chosen for the hash table lengths, but this is better.


> They also use linear probing because it's generally much faster than double-hashing

Optimal is probably to use linear probing for a few indices at a time (depending on size of hash table entries, say, 4 or 8), and double hashing to move to a new region if no empty elements can be found in that run.


A similar idea is to think of the linear probe segments as large buckets that can store multiple items. The difference is that the hash identifies the bucket rather than an individual slot, so you always scan the whole bucket rather than maybe landing in the middle of a segment.


If you can choose to have a quality hash function, it's probably better to do that.


Sure, but if you have a sufficiently high quality hash function this approach is equivalent to simple linear probing, as you will never need to probe more than the initial batch. So why not have this behaviour as a fallback?


>probably

based on what?


The fact that this provides the benefits of both approaches, namely that it is (branch) predictable and amortises memory latency while obtaining the probabilistic benefits of double hashing.


How is a prime number better for expanding the hash table?


I imagine having a prime number size might reduce the chance of collision, probably as a result of some property of modular arithmetic.

But if you’re using a decent (i.e. actually pseudorandom) hash function I would think this is moot?


The acronym 'MSI' is already heavily in use:

https://en.wikipedia.org/wiki/MSI

Might be an idea to re-think the hash table name?


For an ultra simple hash table, you just need a few things (the following is from my head).

First is a 'reasonable' hash function (although with a few more lines you can do better [1]):

    unsigned int H(char* s){ int h; for(h = 0; *s; s++) h = ((h << 5) + h) + *s; return h; }
And then to build your hash table:

    int L = <table length>; // Large enough to reduce collisions
    char** t = (char**)malloc(L * sizeof(char*));
    memset(t, 0, L * sizeof(char*)); // Set it all to empty by default
Then to use it:

    /* Put example */
    t[H("key string") % L] = "value string";
    /* Get example */
    char* v = t[H("some key") % L];
    if(v) printf("value -> %s\n", v);
    /* Remove example */
    t[H("another key") % L] = NULL;
For them most part you can get pretty far with just this in C.

[1] https://coffeespace.org.uk/projects/hash-functions-part-2.ht...


This insert routine can destroy an unrelated key, so I'm not sure it's appropriate to call it a hash table.


You would of course check whether there is a collision if you care. You would size it such that a collision is low probability though.


Good luck when key collision happens and wrong value comes out causing errors in other part of your code...


Key collisions are of course a problem. This is a quick and dirty implementation. If you care about checking for collisions, you could instead have:

    typedef struct{ char* k; char* v; } KV;
    int L = <table length>; // Large enough to reduce collisions
    KV* t = (KV*)malloc(L * sizeof(KV));
    memset(t, 0, L * sizeof(KV)); // Set it all to empty by default
Then:

    /* Put example (detect collisions) */
    char* k;
    if(t[H(k) % L] == NULL || strcmp(t[H(k) % L].k, k) == 0)
      t[H(k) % L] = "value string";
    else
      printf("Collision\n"); // Handle as you like
Of course this can get more and more advanced. The point wasn't to make a zero-collision table, it was just something that could be used quickly.


The point of the replies to your original comment is that it can't be used for most purposes, and writing all the code to make it usable results in something similar to the implementation in the article.


Hence why I wrote "For an ultra simple hash table". I think that for C that doesn't natively have such a data structure, a couple of lines to introduce such a structure really isn't too bad.


It's not clear what 'ultra simple' means. Could have just said that it doesn't handle collisions.

And I think for most cases, you do want to handle collisions.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: