Hacker News new | past | comments | ask | show | jobs | submit login
How to implement a hash table in C (benhoyt.com)
302 points by benhoyt 46 days ago | hide | past | favorite | 156 comments



> When the hash table gets too full, we need to allocate a larger array and move the items over. This is absolutely required when the number of items in the hash table has reached the size of the array, but usually you want to do it when the table is half or three-quarters full.

Yes, but how you resize is important too: if you have a threshold size (like 3/4 full) at which you block and re-distribute all elements into a new array, you will incur a significant pause when this happens, e.g. https://log.kv.io/post/2009/05/15/leaky-hashtables

Instead, when you reach the threshold amount you can create the new array and then gradually migrate the keys in small batches either with each operation or with a background thread. So on `get` if we're in the process of resizing: first check the new table, then the old one, and before returning migrate N keys from the old table to the new one. Free the old array once all the keys are migrated.

I wrote a small hash table implementation with gradual re-hashing a while back, search for DICT_MAX_LOAD and dict_rehash here: https://github.com/nicolasff/ht/blob/master/dict.c#L193


It’s usually not the right call to use a hash table like you’re describing. You’re not saving any work (all keys have to be migrated sooner or later) and by doing it like this you incur a non-trivial amount of overhead on most operations.

Hash tables generally work on the assumption that operations are O(1) in the amortized sense, that individual inserts might be expensive because of rehashing, but in aggregate they are very cheap. In practice, this is almost always a fine assumption, and it is how the vast majority of hash tables in all programming languages work.

The kinds of hash tables you are talking about have their uses in places where you have soft real-time constraints, but that is honestly a pretty niche use case.


The progressive rehashing described in the article is very similar to what Redis does [1].

Just thought I'd share a example of a use case where the incremental rehashing logic makes sense.

[1]: https://github.com/redis/redis/blob/unstable/src/dict.c


iirc memcached either uses or used to use https://en.wikipedia.org/wiki/Linear_hashing to avoid rehashing induced pauses. it definitely makes sense for that use case but you're correct that it's not the right call for a general purpose hash table implementation.


If your application is not negatively affected by such jitter why not use a garbage collected language?


Programs have different phases and jitter might be relevant only in some of them.

Imagine hash tables used to store configuration data. You don't care about performance of "reload config" operation, but once config is loaded and the worker is handling traffic, read operations must be fast.


Maybe you still want fine-grained control over code generation and memory allocation. But it's a valid question to ask.


I agree. I’m rarely using a hash tables for growing data sets. In my experience, I’ve been more inclined to redesign the algorithm or use a different data structures at the point were a migrating hash table makes sense.


Agreed this seems like major overkill, but then again so does writing your own hash table from scratch in C. :shrug:


> Hash tables generally work on the assumption that operations are O(1) in the amortized sense, that individual inserts might be expensive because of rehashing, but in aggregate they are very cheap.

Hash tables aren't O(1) in an amortized sense, they are O(1) in a statistical sense. The O(1) complexity of hash tables is dependent entirely on the size of the underlying array, the method of collision resolution, and the quality of the (pre)hash function in relation to the distribution of data you're storing in the table. Whether you insert 1 item or 1 million items into the table, you're still getting O(1) computational complexity if the table is designed well. If not, you could be getting O(n) complexity in the worst case.

While the fixed cost of hashing is irrelevant in the O(1) nature of hash table operations, it does mean that a hash table in practice could be slower (in terms of wall time) than a linked list for insert/find/remove operations. Despite a linked list being O(n) for those operations, a linked list doesn't have to pay the fixed cost of hashing. So for low values of n, the linked list might be a better choice. I think this is what you mean by "O(1) in the amortized sense" but it's not what we mean when we say the computational complexity of these operations on a hash table is O(1).

What the OP is talking about is an example of amortization of resize(), essentially adding a fixed cost to insert() and remove() so you don't have to pay a huge sum when you call resize(). But that doesn't mean insert() and remove() are no longer O(1).


Amortized cost is a technical term in complexity analysis: https://en.wikipedia.org/wiki/Amortized_analysis

If you use the terminology precisely, the worst case cost of a regular hash table insertion is O(n) because of the potential resize, but its amortized cost is O(1) because that's what it "averages out to".


What happens if you have a sequence of puts that triggers an expansion and some keys get migrated, but all of a sudden you get a series of removes, such that you need to perform a shrink to keep memory usage linear?

Do you now end up with three live arrays? You can probably make things even more pathological...


Hash tables are not great for data structures that change dramatically in size. If you have this situation, the best thing to do is to copy one hash table into another and discard the original data structure.


I think parent’s construction works just fine if you disallow removes, so it grows monotonically, regardless of how dramatic the increase in keys, so long as you move over 2 keys for every update to your map.

My comment was a description of a precise “drastic change” where incrementality becomes sophisticated. (Though in part I was hoping for a response which identifies an elegant approach without the complication I mentioned.)


Hash tables do that automatically.

Rehashing is controlled by various tunable strategies: Load factor (0.5-1), growth policy (primed or power2), growth factor (1.5, golden ratio or 2).

Linked list tables don't need to throw away nodes, they can just be relinked. But compaction is always better, because cache dominates. Most rehash do exactly that. Esp. multithreaded.

The best thing to do is to take a good one and don't touch it, unless you want a monster as in perl5. Eg the recent siphash security theater broke performance in all dynamic languages, even the linux kernel. Everything is perl now. You don't want that.


I'm not a particularly good programmer, so I may be missing something here, but I'd think that moving the items should be its own function that happens in between gets and sets, no? Besides adding a decidedly non-read-only operation to what should be read-only, shouldn't the movement happen when you're not trying to do anything with the hash table?


typically rehashing is done during insert. if you were going to do it while not doing anything else with the hash table you'd have to use a background thread, and then you'd have to do some form of synchronization, so there would be significant additional complexity and overhead.


If get() mutates the array internally, does that mean even "read-only" operations aren't thread-safe?


> So on `get` if we're in the process of resizing

Not just get, but set also, right? Otherwise I would think a large group of set operations that overflowed more than once might be problematic to deal with. Or maybe I'm missing something that makes that a non issue?


yes, set also. go's map implementation uses this technique - allocate a new array of buckets and gradually migrate from old to new buckets (i think it does the migration during set and delete only, not get).


You can also allocate a small hash table at first, and reallocate it in a large virtual memory region once it about the size of a page (4k) so you won't have to reallocate it later.


can you do hashtable sharding? like have a hashtable point to another hashtable by using something appended to the key? just a thought on a procrastination filled Friday afternoon.


Of course! This is a distributed hash table. You can have one hashing function determine the shard and another determine the index in the hash table.


IIRC, a very simple implementation of a hash table was shown in the classic Kernighan & Ritchie C book, "The C Programming Language" (which was one of the first good programming books that I bought and read, early in my programming career).

It was in either the 1st (pre-ANSI C) or the 2nd (ANSI C) edition. Just a handful of lines of C code, and easy enough for even beginners to understand. Of course, it was not a production-quality hash table implementation, nor was it meant to be. I still remember my excitement at understanding the logic and its usefulness at the time, even though I did not know about the importance of data structures and algorithms then.

Edit: Even though it was simple, IIRC, it handled at least collisions.


Nice. Yeah, looks like it's in section "6.6 Table Lookup", and it implements a simple, fixed-size hash table (array size 101) of char* to char*, and it handles collisions via linked lists.


Cool, thanks for looking it up :)


>looking it up

Pun not intended but noticed, ha ha.


You skipped hs_remove but depending on implementation, hs_remove strategy may change your hs_set, e.g using tombstones. I’ve recently learnt that you don’t need tombstones at all in linear hashmaps, you just do backshift removals, it may be a simple addition to your map. Check out these repos for the example implementation:

(C) https://github.com/tezc/sc/tree/master/map

(C++) https://github.com/rigtorp/HashMap


I wrote a blog post [1] on this a couple of years ago. This is a quite old technique [2] but was not widely used until recently. I was surprised that I hadn't found this earlier.

[1] https://attractivechaos.wordpress.com/2019/12/28/deletion-fr...

[2] https://en.wikipedia.org/wiki/Linear_probing#Deletion


Thanks for this (and parent reply for mentioning it), that's simple and clever!

> The basic idea is not complex: when we delete an element, we move the next element to the deleted location if it is pushed away by the deleted element due to hash collision; we repeat this process until we come to an empty bucket.

I wonder, is it still amortized constant time? My gut (which is not very good at computer science) tells me that it still may be, as with an average probe length of ~1.4 (which I was seeing), you won't have to walk more than 1 or 2 elements on average.


It shouldn't take more operations (asymptotically) to delete than to probe/insert so as long as the original list is still amortized constant time then the delete operation still is.


Surely this is an older technique? When I took data structures in college I implemented deletions that way, because I wasn't smart enough to invent tombstones. Looking through the wikipedia history, the sketch of it wasn't added yet.

I had wanted to use "better" techniques like quadratic probing, but I couldn't figure out how to perform deletions, so I fell back on linear probing.


I can’t find right now but I saw one older version of wiki page, it had C code example how to do deletion with this. This is how I discovered it. I knew your blog but apperantly I didn’t visit recently.

> I think the new deletion algorithm without tombstones is overall better than traditional algorithms. It should become the preferred way to implement hash tables

I agree with you 100%. Tombstone has no advantage over this.


The advantage of tombstones is that it is dead simple to implement. Backshift removals can cause a lot of memory shuffling if you are removing an element from a very congested part of the hash table.


Backshift has two potential disadvantages: 1) copying objects may be slower than testing equality (case dependent); 2) each deletion involves the entire probe chain (with tombstone, you inspect half of the probe chain in average). On the other hand, with tombstone, each probe chain is longer. This slows down searching and insertion as well as deletion. Tombstones also increase the frequency of rehashing.

> The advantage of tombstones is that it is dead simple to implement.

Backshift is harder to understand but it actually takes fewer LOC to implement at my hand. That is because insertion and searching become simpler (and slightly faster) with backshift.

> Backshift removals can cause a lot of memory shuffling

Not sure what you mean by memory shuffling. Backshift is applied to linear probing only. You don't jump around except when you wrap around the end of the bucket array, which is rare.


> Backshift has two potential disadvantages: 1) copying objects may be slower than testing equality (case dependent);

I would wager that is true in most cases.

> This slows down searching and insertion as well as deletion. Tombstones also increase the frequency of rehashing.

Yes, but you could argue that is an advantage; the hash table reaches its "natural size" faster.

> Backshift is harder to understand but it actually takes fewer LOC to implement at my hand. That is because insertion and searching become simpler (and slightly faster) with backshift.

I looked at your code and, afaict, it only supports linear probing. Backshift removals would be much harder or maybe even impossible to implement with quadratic probing or double hashing.


> I would wager that is true in most cases.

Strings. In C, comparing two strings involves a function call, a loop and a potential cache miss. Copying a string just moves a pointer, which is much cheaper. The additional copying costs little for strings.

> the hash table reaches its "natural size" faster.

Not sure what you mean by this. The "nature size" is the size without tombstones. Clearly backshift is the better one.

> it only supports linear probing

Yes, backshift only supports linear probing so far as I know. Most high-performance hash table libraries use linear probing these days. A few use quadratic probing. I rarely see double hashing.

Anyway, backshift always uses less memory and is always faster on insertions and searching. Under certain deletion load, backshift may be slower due to 1) and 2), but the difference is fairly small considering the save on memory.


> Strings. In C, comparing two strings involves a function call, a loop and a potential cache miss. Copying a string just moves a pointer, which is much cheaper. The additional copying costs little for strings.

A better approach would be to store the string's hashcode somewhere and only perform the element-wise comparison in case they match. For most strings they are unlikely to match.

> Not sure what you mean by this. The "nature size" is the size without tombstones. Clearly backshift is the better one.

You can count the fraction of the table consumed by tombstones and grow it a little earlier if that fraction is significant.

> Yes, backshift only supports linear probing so far as I know. Most high-performance hash table libraries use linear probing these days. A few use quadratic probing. I rarely see double hashing.

Yes, but you are trading amazing performance in the average case against awful performance in the degenerate case. Poor hash functions or adversarial input could cause congestion in parts of the hash table.


> Yes, but you are trading amazing performance in the average case against awful performance in the degenerate case. Poor hash functions or adversarial input could cause congestion in parts of the hash table.

It is not just me. As I said, most high-performance hash tables these days (absl, F14, and all robin hood implementations) use linear probing. Providing a proper default hash function avoids most worse cases in practice. In addition, the hash library can use a secondary hash function [1] which makes the library much more robust to bad hash functions such as the identity hash function.

[1] https://attractivechaos.wordpress.com/2018/10/01/advanced-te...


Since you need to do collision counting for security reasons anyway, it's trivial the check the backshift threshold while searching for your key. If you got too many tombstones in your chain, backshift.


If you are using load factor around %75, it means you’ll do 3 shifts at most. On average it is less than it as you are ordering your map on each removal and only case you’ll do backshifts when your items are “not ordered according to hash function”.

Tombstones are silly, you do add/remove %75 of your map capacity, now you have to rehash your map even your map is empty. I can’t think a single scenario that tombstones can perform better. It causes much more probing and much more cache misses than backshift deletion.


A more detailed explanation of "backward shift deletion": https://codecapsule.com/2013/11/17/robin-hood-hashing-backwa...


Well, he did say "simple hash table". In my experience, many applications have no need for single-element deletion.


Small nit: in table 2 with the columns labeled "Key, Hash, Hash modulo 16", the thing called "hash" is actually known as a "prehash", while the "hash modulo 16" is the hash. In the context of hash tables, pre-hashing is when you take something like a string or a data structure and turn it into a number. Hashing is the mapping from that number to one of the hash table bins. The distinction is often lost in practice, but if we're talking about hash table design and implementation it's important to make this clear at least once.


Is that always true? It depends whether or not you think of the hash as a property of the hashtable, or a property of the key itself. e.g. you may have string objects that store their hash values internally - a useful optimisation if you are dealing with lots of long read-only strings, given that the hash function will have a cost of at least O(n) for the string length. Inserting a string into a hashtable wouldn't change its hash (what if the string is inserted into two tables of different sizes?)


Right, it does depend on this perspective, which is why I say the distinction is often lost in practice. But from the perspective of hash table designers (at least historically), the hash can in fact change and is dependent on the size of the table. The distinction is important because the two operations are separable. I guess people didn't want to call one thing a "hash" and the other "hash mod k" because there are ways to do this other than mod. They could have called one "hash" and the other "post hash" but it is what it is.

All I'm trying to say is if you want to know more about hash tables, and you bother to venture into the literature or some textbooks, you're going to run into this distinction sooner rather than later, so I wanted to point that out here.


And depending on your use case, you may not need to implement growing the hash table, or deletions, making it even simpler/less daunting to implement in C (since this avoids a lot of the memory management parts of the job).


Yeah, good point. And then you can allocate the hash table struct and all your buckets in one call to malloc().


When you need to scan billions of rows and get sorted unique strings, theory like this is really useful. You can’t just use a library and call it a day. Sometimes you gotta build your own hash table and profile it out. Then tweak it, and profile it until you get every last bit of juice from CPU and RAM.

Articles like this are really useful for someone who wants to under the hood.


Unless you need the specific memory characteristics of a hash table, a trie data structure is a pretty good solution to a lot of the same problems and in my opinion is conceptually easier to understand.


I have never seen a true described as easier than a hash table. Curious what makes you say they are easier?


I find their performance characteristics easier to naively understand without resorting to profiling with production data.


I should have worded my question more as "do you have a recommendation to describe tries?" In particular, construction is tough.


The C# implementation here is pretty approachable for just about any kind of programming background.

https://www.geeksforgeeks.org/trie-insert-and-search/


Isn't a faster way to, instead of using a backing array, to use a pool of memory that you just point into with pointers addressed with the hash?

Nasty stuff that you can do in a language like C :)


As far as the CPU is concerned, there isn't a big difference if you access a block of memory via pointers or via an array. Converting the hash result to a memory address is not the bottleneck. The main bottlenecks are the hash function itself and the logic for dealing with collusions


I definitely agree that is where the main bottlenecks are!


I guess the idea is to contrast dyi bsearch vs dyi hash table - but it's a bit odd to mention bsearch is in the standard lib, then not use it?

> Here’s how you’d do it in C (with and without bsearch).

Juding by random hit on the net - it would've been quite straightforward to use the stdlib version?

https://www.tutorialspoint.com/c_standard_library/c_function...


POSIX actually specifies a hash table interface in the standard library as well: https://man7.org/linux/man-pages/man3/hsearch.3.html

In typical Unix fashion, however, it is designed to use internal, global state, so you can only have one table at a time.

There is a GNU extension that provides the same functions with an `_r` suffix that allow you to work with more than one table.


I’ve never really understood why standard Unix functions rely on internal global state. For example, strtok():

https://en.cppreference.com/w/c/string/byte/strtok

Was there some reason it was optimal to write these functions this way back in the early days of Unix?


A fair amount of the timeline for Unix was when it supported multiple processes, but not multiple threads. SunOS/Solaris didn't have threads until 1992, for example, and it still didn't support SMP, so global state wasn't an issue initially. Strtok() predates 1992. They did add strtok_r() once Posix threads were a thing.


Well, if you wanted multiple dbm files or hash tables or whatever, it was easy enough to run multiple processes. Unix processes on the PDP-11 could only address 64 KiB of memory—PDP-11 addresses are 16-bit, and C strongly encourages a von Neumann model where your program and data are in the same address space, so you couldn't even use 64 KiB for your program and another 64 KiB for your data, though the PDP-11 hardware was capable of doing this. (On things like the AVR, this is handled by copying your data into RAM at startup unless you mark it PROGMEM, in which case you can't pass pointers to it to ordinary functions. PDP-11 Unix didn't have anything like this.)

As long as you don't have multiple threads (and you shouldn't, and on Unix you couldn't) the static-variable interface in these cases was usually more convenient, though more bug-prone.

There's also a micro-efficiency question.

Global variables are at a known memory address. If you put them in a struct and pass around a pointer to it, they are at a dynamically computed memory address. Every time you want to access them, you have to add the field offset to the struct's base pointer. This is a small amount of extra computation, and how much it actually costs you depends on the hardware—on the 8080 it's a mess that probably requires you to spill a couple of registers onto the stack, and on modern implementations of amd64 like the ones from Intel, it probably takes literally zero extra time, because you probably aren't keeping all of your addition-capable functional units busy anyway.

I don't know the PDP-11's instruction set well enough to say, but I suspect the cost of such indexed addressing was intermediate between these extremes: much like the 8086, it had an "index" addressing mode, mode 6, which added an immediate 16-bit offset to a base register https://pages.cpsc.ucalgary.ca/~dsb/PDP11/AddrModes.html, and it doesn't seem to have had a pure immediate addressing mode where the address of the variable is just an immediate word without any indexing. But if you didn't need a base register for your variables, you could use PC-relative addressing, thus saving a register and decreasing register pressure (the PDP-11 only had 8 general-purpose registers, and one of them was PC, so it really only had 7).

I don't have any idea what actual PDP-11 compilers did, though, and that matters a lot here.


It probably saved a few bytes of memory or disk. A PDP11 had 64k of memory and had to be shared by multiple simultaneous users. A lot of the Unix tradeoffs can be traced back to limitations of 1970s hardware.

Plus you weren't going to need more than one anyway, your program wasn't going to be big enough.


Because they were written with the expectation that they would be used in composable programs that each do one thing well.


hcreate_r() can be used on some of the BSDs as well, but of course they're not 100% compatible with GNU. I've implemented a platform-independent POSIX hsearch as a wrapper to a templated hashtable in Pottery:

https://github.com/ludocode/pottery/tree/master/examples/pot...

Be warned, hsearch is terrible. You should probably only use this if you have some existing code that depends on hsearch and you want to make it portable.


The article actually does both (uses bsearch, and also the home made one) presumably just to demonstrate both.


(I noticed this because I was wondering why the cmp() function was taking void pointers only to cast them to "item*", and it occurred to me it would make sense if it were passed as a function pointer to a generic library function with no knowledge of "item*"... as it is when bsearch is used.)


Heh, I missed that - just skimming the code on mobile. Maybe there should be a slight edit in the text and/or the custom binary search could be dropped (as it's not the main focus of the article).


> Even while writing this I realized I’d used malloc instead of calloc to allocate the entries array, which meant the keys may not have been initialized to NULL.

I think this is still UB to allocate a struct containing pointer with calloc without explicitly set these pointers to 0 (or NULL). The C standard doesn't specify the binary representation of the null pointer value. In practice, that will work in virtually all current platforms, but it's still UB.


Implementation-defined as far as I'm aware. If pointers with all bits zero actually represent null pointers, things will be fine.

edit:

I haven't found any smoking gun quotation in the C11 standard on short notice, but there is footnote 296 (p.348) on calloc() zeroing all bits:

> Note that this need not be the same as the representation of floating-point zero or a null pointer constant.

Nothing about this being UB in cases where it does represent a null pointer constant.


Wow, thanks, I did not know that. Though that would be a kinda crazy platform or implementation! I'm betting that a lot of people depend on this.


I think that's true only in the narrowest sense of the C language standard. At the ABI and runtime linkage level, platforms (yes, every single one of them) document data in the .bss section to be zero, and C compilers place pointers there that are defined to be initialized to NULL per the standard.

So, no, in practice it's well-defined, fully specified behavior on any given platform.


Any particular platform can declare it to be defined, by fiat. "With our compiler, all zeroed raw storage treated as pointers are null."

Anywhere that they technically could, you can treat it as if they actually had so declared, and just didn't publicize it enough.


NULL is guaranteed to equal integer 0 by standard, and converting 0 to a pointer is guaranteed to be a NULL pointer.

> An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.


An integer constant expression is a syntactical construct, not a runtime entity. The C standard provides no guarantee that bytes (chars) with value zero can be read from memory as a null pointer.


wait what? my C is a bit rusty but when would reading 0 from memory & casting to a (void?) pointer type ever != NULL?


C implementations are allowed to use a not-all-bits-zero representation for null pointer values. They are merely required to convert integer constant expressions with value zero to a null pointer value — regardless of how the resulting null pointer value is represented in memory.

See also these FAQs:

http://c-faq.com/null/machnon0.html

http://c-faq.com/null/machexamp.html


This might not be the best place to ask but: I'm wondering what the best data structure is for these needs: * there are fields and values, similar to a struct * I know at initialization time what fields there are (assume a string) * at run time I want to be able to get this values accessing via the key/field * also I want to update those values * what I don't care about is inserting/deleting fields because that's now allowed,

I guess what I want is a struct like data structure, but in an interpreted language

Mightve typed to fast...


You're looking for something similar to Python's namedtuple.[1] (Though modern python should use a dataclasses in the stdlib or the attrs package.)

Essentially, you have an array/list/vector under the hood, and a mapping of field names to array indices to govern field access.

If you're trying to implement this at a lower level, in the interpreted language itself, take a look at the slots mechanism[2] in Python.

[1]: https://docs.python.org/3/library/collections.html#collectio...

[2]: https://docs.python.org/3/reference/datamodel.html#object.__...


That seems like the thing I'm looking for! Thanks for the info


Sounds like you want a struct except you want to be able to reflect on the property keys (which you can't do normally in C/C++/Rust)?

I got thrown off by "but in an interpreted language" so I'm not sure whether you're requesting this for C or for another language. In JS, at least the V8 engine does some of this for you: if you have a group of objects that always have the same set of properties, V8 will detect this in many cases and optimize them under the hood as a struct or class (it must also store the string keys somewhere since those can be iterated in JS).

In C/C++ you'd have to store the keys as strings somehow, and also add some way to index the struct via those keys (since you can't even do that with a normal struct). You may be stuck with a hashmap one way or another.


Rust allows you to do this with a macro. The serde macros are pretty close to what you need, but a slight variant could give you a much more ergonomic API.


Right, I thought about suggesting a macro for C/C++ but I don't know enough about their macro systems to know for sure that it would work


I believe C++ has "Runtime Type Information" (RTTI) that allows you to do this. Although I also hear that it's not great.


To expand on not great: RTTI is one of the few features of C++ where you pay for it even if you don't use it, so it's not all that popular. It can be disabled with compiler flags to shrink the binary. LLVM's coding standard prohibits use of RTTI, they instead use their own 'hand-rolled' solution. [0] RTTI is also banned by the Google C++ coding standard. [1]

[0] https://llvm.org/docs/CodingStandards.html#do-not-use-rtti-o...

[1] https://google.github.io/styleguide/cppguide.html#Run-Time_T...


"I guess what I want is a struct like data structure, but in an interpreted language"

Python ffi is an option: https://cffi.readthedocs.io/en/latest/using.html#working-wit...

Though I don't know what advantage that route would have over most interpreted language's already existing associative arrays / hashmaps / dicts.

There may be some module already built that serves your needs. DiscoDB is a good example..a very fast hashmap that uses mmap(). https://discodb.readthedocs.io/en/latest/


It depends on what you want its footprint in memory to look like. Since you don't need to mutate the structure, a densely packed structure seems optimal. For small structures, linearly searching through an array would be fast. For larger structures, a sorted array could be binary searched efficiently.

It also depends on how you can represent the string and their typical size. If the strings are small, then allocating fixed size chunks to store them inline might make sense. If the strings are long, it might make sense to go crazy and use a "trie" data structure to minimize comparisons.

I believe python uses hash maps for classes by default, but you can alternatively use "slots", which are less mutable and presumably more densely packed.


Yes. At some point i encountered this trie with payloads (which would act as t he field values).


Sorry if I wasn't clear enough. Yes I want something like an object in javascript, like a struct in C but at runtime. But I want to implement this myself. At first I thought a hash table was the right choice. But I don't need the insert delete increase capacity that this data structure typically needs to implement. I want to implement this in Mathematica. The language doesn't have native objects given that the main idea is for things to be immutable. But I still can program data structures, so I'm looking for one...


A class? After all, an class is a struct with some additional features to handle functions.

You can use a static instance if you just need a single object.


if the fields are different types, how do you know the type of key access's return?


I guess that can be solved by creating an Expression type


Use an object?


Will this expression correctly handle overflow?

size_t mid = (low + high) / 2;


Not an issue since the following is at the top of the function:

  if (size + size < size) {
    return NULL; // size too big; avoid overflow
  }


Interesting question.

Since this is integer division, I think you can do:

   size_t low_s = low >> 1;
   size_t high_s = high >> 1;
   size_t mid = (low_s + high_s)
...With no overflow worries. (expanded for clarity)

That would effectively be distributing the division. So it's equivalent to low / 2 + high / 2.


This gives a different result when both low and high are odd.


what about low>>1 + high>>1 + (1 & low & high)


Careful about operator precedence. low>>1 + high>>1 + (1&low&high) is the same as low>>(1 + high>>(1 + (1&low&high)))


99% of the effort is 'what hash function?'


If you'd have to implement the hash table every time you want to use one, you'd probably realize that there is a better alternative data structure. Quite often you data is not as randomly distributed as you may initially think it is.


One of the cool nuances of hash tables is that you can exploit the nonrandomness of your data to get a more uniform distribution of keys over hash buckets than you would get with an ideal random-oracle hash function. For example, it's very common to use hash functions which will produce four consecutive output values (possibly shuffled) for string keys "r0", "r1", "r2", and "r3", which guarantees that those keys will not collide with each other. Hashpjw is one such widely-used hash function; FNV is another. (Better not use linear probing with such a hash function, though! And it's probably better not to use them in cases where the keys might be chosen adversarially to produce a DoS.)


> Quite often you data is not as randomly distributed as you may initially think it is.

That's why it's important to pick the right hash function. A hash table usually goes a long way before it becomes a performance issue.


What?


I think the parent comment is highlighting that a hash table is just one implementation of a "Map" abstract data type. Depending on the use case, other implementations (such as using a heap, in the C++ STL) may have better performance.


True, but as far as I know it's rarely the worst choice you can make, and it's plenty fast for a lot of applications. I don't know that there's a single data structure that's always better, which is what the GP seems to be implying.


The STL is always the wrong solution for performance.

set: the worst. unordered_set: the worst. vector: no small vector optimizations. string: horrible. array: use small_vector instead. deque: slow. list and forward_list: nobody uses that for performance.


Or for the example of word frequencies in the article, a prefix trie.


I tried a trie with data that I thought would work perfectly with it. There were a good amount of shared prefixes. But it ended up performing worse than a hash map. I don't remember all the details now, but after seeing that I'll just keep reaching for hash maps first (unless I'm sure that there's only going to be a very small number of elements, then I'll just iterate in order over a vector).


JavaScript also lacks a proper hash table. I wonder how much faster a native implementation could possibly be over say `Map()` (which keeps insertion order by maintaining linked lists and does not have constant op).


What's wrong with using objects as hashes?


There's a table at the bottom of https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... that shows some reasons why.


They can be faster than `Map` but:

1) Slower than hash tables

2) `{}` is not empty but has default keys

3) `{}` has multiple read-only keys

2 and 3 are caused by the prototype chain (properties not present on the initial object will be attempted to be resolved via the prototype hierarchy) which can be avoided via `Object.create(null)`


FWIW there's a standard hash table API in POSIX.


Yeah, but it's pretty lame since you have to specify the size of the table up front and can't change it later. They didn't implement the hard stuff!


There are many reasons why it's lame! Here's a short list:

- There is only one global hash table for the whole process! If you want individual hash tables you need to use implementation-specific extensions (hcreate_r().)

- There is no way to remove a key from a hash table. No implementation I know of provides an extension to do it. If you want to truly remove a key you must destroy and rebuild the table.

- There is no requirement for a means to grow the table. On some platforms it's possible to run out of space. If you want to truly grow you must destroy and rebuild the table.

- There is no standard way to free keys or data stored in the table. Destroying the table leaks all contents; you must keep track of keys and data externally and free them yourself. Only OpenBSD frees keys by default, and only NetBSD has extensions to call callbacks to free keys and data.

- Keys must be strings. You cannot provide custom types or void pointers as keys. There is no way to specify custom hash or comparison functions. The quality of the hash function is unspecified.

- When using ENTER, you may not want to allocate the key unless it is necessary to insert a new entry. Since the given key is inserted directly, it's not necessarily safe to use a stack-allocated buffer for lookups. It's awkward to replace the key with an allocation after insertion and it's unclear whether it's allowed.

This doesn't even get into incompatibilities between the various implementations. You will encounter all of the above flaws even if your only target is glibc.

No one should ever be encouraged to use POSIX hash tables. They should be deprecated and we should all just pretend they don't exist.


IMHO this is one area where the C committee could improve the situation by mandating a sane and useful hashtable implementation for the stdlib.

Or at the very least just deprecate the old one. I can't remember ever seeing it used in code.


Hash tables in C are not part of the standard. They are part of POSIX, so it is mandated only on UNIX-ish environments. The standard committee has nothing to do with that.


Maybe the should be. I know C is all about staying minimal, but native hash tables have proven themselves to be extremely useful in every other language where they are present.


Wow, that's quite the list! Why was such a bad (overly simplistic?) implementation allowed into a standard library?


This might sound stupid, but could you store N-1 entries, and then store a pointer to the next hash as the Nth entry? So you dynamically add fixed tables as you run out of space?

Yes, I realize the lookup would get complicated.


Nope. The POSIX API lets you have one table.

There is a GNU extension that lets you have more than one table.


[flagged]


> Are you really going to recommend growing the table by duplicating it?

Um... yes.

You want the data linearly in memory in the first place so you can have constant time access by index. If you want to grow a dynamic array, doing a realloc can easily cause the same effect as the article does manually (alloc new, copy data, free old), because the memory immediately after may be occupied by something else if you did any other malloc/calloc. Copying over the data is linear in time, thus degrading your constant time insert to linear time as soon as the table is full if you do a realloc for single items instead.

You may find the choice of growing when half full as in the article questionable, because then there will always be an unused half, but pretty much every, remotely practical dynamic array implementation in C (and probably a ton of other languages) does that (grow by capacity * 2 when full) to maintain amortized constant time when appending.


It's also how many popular, in production, hashtable implementations work.

From the uthash user guide, for example:

"Normal expansion - This hash attempts to keep fewer than 10 items in each bucket. When an item is added that would cause a bucket to exceed this number, the number of buckets in the hash is doubled and the items are redistributed into the new buckets. In an ideal world, each bucket will then contain half as many items as it did before."

So, the person questioning "why" should be instead explaining "why not".


This is the worst attitude I have this on HN. Look out people - its the internet's gatekeeper - making sure only legitimate/fully thought out things are published to the internet. By golly, we wouldn't want to waste bits and bytes on just random things being available on the internet.


Yes, I believe you are intended to duplicate the table once a certain load factor (fraction of elements filled) of your choosing is reached.

For linear probing, this is recommended to keep your probe lengths logarithmic:

https://www.sciencedirect.com/science/article/abs/pii/019667...

Removing elements for a hash table using linear probing is fairly non-trivial. Without using so-called “tombstones” to flag an element as deleted, you essentially have to shift elements up in the table.

Also, your rudeness is really quite unwarranted. If everyone took that attitude, where would the teaching examples ever come from?


>Are you really going to recommend growing the table by duplicating it?

Setting a reasonable initial size then doubling it and copying when (if!) it needs to grow is a pragmatic, reasonable approach that’s quite common!



A teaching tool is deliberately simplified so as not to overwhelm students with details they don't need to worry about yet. This is absolutely the right way to go about it.


I suppose it's a personal preference thing, but I like over simplified implementations when the purpose is educational.

I am more bugged by the accusation that the author did this because "look what I can do". I don't see any evidence of that.


The problem with your way of thinking, is who draws the line to define what's "half-assed"? Because everything can be improved. You should take ideas from this kind of articles, not fully working code.


do we really need more hostile, sneering, condescending, toxic pricks like you policing the web?


This comment bugs me.


100% agree with U.


What if this became the new “invert a binary tree on a whiteboard”?


I wouldn't ask someone to whiteboard it but I think "what is a hash table?" is a brilliant interview question. The depth someone can go into tells you a lot and can bring up lots of other topics to discuss. It tells you a lot if a senior developer doesn't know what a hash table is or only knows "it's a class from Java" too.


> I wouldn't ask someone to whiteboard it

I did. Works very well as an interview question.


Hashtables are probably more directly useful than all the binary tree gymnastics.

The K&R C Programming Language book has nearly the same example (although I recall it using linked-lists for hash collisions and not probing).

I would be mildly suspicious of anyone claiming to have a recent CS degree who wasn't familiar with the ideas.


This shouldn't be a mystery to someone who's had a datastructures course. It's a fundamental data structure. Yes, optimization always goes down a rabbit-hole, but Knuth's "Algorithms in C" was mandatory in most engineering schools' CS degrees in the early 90's, and it essentially what this website describes, but with more rigor.


Algorithms in C, which I strongly recommend, is actually by Knuth's student Sedgewick.


Ok, but school was years ago. I've had years of relevant experience that would make me a good fit for some job since, but I've forgotten how to implement a hash table in C within 30 minutes. Is that really the gatekeeper we want?


Knowing how to implement a hash table in C doesn't strike me as something you can "forget". If you know how to write C and you know how a hash table works, what is there to forget?


Honestly it's such a simple concept it shouldn't be hard to remember.


A good developer doesn’t need to remember how to implement something. A good developer can work it out from first principle when asked.


"Is that really the gatekeeper we want?" Yes, I think so? I mean implementing a hash table in C is really easy. What is there to "forget"? If you can't do it, either you don't know what a hash table is, or you don't know how to program in C.

    enum { table_size = 1024 };
    typedef struct item { int k, v, full; } item;
    static item table[table_size];

    void put(item kv)
    {
        size_t orig = kv.k % table_size;  // dumbest possible hash function
        size_t b = (orig + 1) % table_size;
        // use linear probing for collisions
        while (table[b].full && b != orig) b = (b + 1) % table_size;
        if (b == orig) abort();  // table full
        table[b] = kv;
        table[b].full = 1;
    }

    item *get(int k)
    {
        size_t orig = k % table_size;
        size_t b = (orig + 1) % table_size;
        while (table[b].full && table[b].k != k && b != orig) {
            b = (b + 1) % table_size;
        }
        return (table[b].full && table[b].k == k) ? &table[b] : NULL;
    }
I haven't tested that (just as I wouldn't in a whiteboard interview) so I don't know if it works. I did briefly skim the article we're talking about, but I concluded I didn't have anything to learn from it, so I didn't read it. I found a bunch of bugs in my implementation as I was writing it. And of course it's a pretty dumb hash table: linear probing is very suboptimal, it uses a statically-allocated non-resizable hash table, it's specialized for a certain type of keys, and so on.

But are you saying you can't even do that? Probably it would be bad to hire you for a C programming job, then, unless it was an entry-level position for you to learn C in.

Evidently the above comment took me 15 minutes to write. Oh, and now I see another bug or at least lacuna: if you try to update the value of an existing key, it doesn't fail, but it also doesn't update, instead adding a new entry that will never be read. And then I did write a minimal smoke test and run it, and unsurprisingly, it does work:

    int main()
    {
      put((item) { 4, 7 });
      put((item) { 1028, 9 });
      put((item) { 3, 25 });
      printf("%d %d %d %p %p\n", get(4)->v, get(1028)->v, get(3)->v, get(5), get(3));
      return 0;
    }
Writing the test and the above additional commentary, and running the test, took another 8 minutes.

Oh, and there's another bug where it will loop infinitely when the hash table is full and the key is negative. (Actually it invokes UB but typically the result will just be an infinite loop.)

If it takes you 60 or 100 minutes to write this, or to write something better, maybe you still know C and know what hash tables are. But if you throw up your hands and say "I don't remember, it's been a long time since I was in school"? Either you don't know C, or you don't know what a hash table is, which probably means you've never written anything in C but toy programs. (This is a toy program, of course, but many toy programs in C don't need hash tables.)

The compilable and fixed toy program in question is at http://canonical.org/~kragen/sw/dev3/dumbhash.c if you want to probe it for bugs.

It occurs to me that you might have been talking about something that isn't a C programming job. For example, we might be talking about a job programming an HTML database front-end in Python using Django. And of course it would be silly to reject someone for a job like that because they didn't know C, unless they claimed to know C on their résumé. And it's totally reasonable for a Python programmer, or a Python/Perl/JS/bash programmer, to not understand hash tables.


Sedgewick's?


Who said anything about interviews? This is just in the context of "something interesting to read with your morning cuppa".


This is pretty straight forward stuff.


Hash tables are deceptively complicated, like sorting algorithms.


Like sorting algorithms, they can be as complicated as you want them to be.

A merge sort isn't the fastest sort but it's pretty easy for anyone to implement. Something like a timsort is a bit more complex but is what a lot of languages use now-a-days.

Hash tables are in the same boat. A basic hashtable is pretty easy to implement. It's only when you start looking at things like ideal item placement and optimal filling that things start to get more complex.


Dunno, I wrote https://news.ycombinator.com/item?id=26595520 just now and my super dumb hash table has two bugs in its 21 lines of code:

1. It doesn't update—attempts to change the value associated with an existing key are silently ignored, although they do consume buckets.

2. It loops forever if the table is full and the key it was looking for is negative, because that invokes conversion of a negative signed value to an unsigned value, which I think is UB in C.

Maybe it has more bugs I haven't found yet.

Additionally the reason it took me 15 minutes to write 21 lines of code was that I changed a bunch of things in midstream, before posting the comment:

1. At first there was no "full" field in the hash table bucket, so there was no way to distinguish full buckets from empty buckets.

2. And then when I added it, I added it as "dead", with the opposite Boolean sense of "full", so I would have needed a separate initialization routine (to set all the dead fields to something nonzero). Instead I swapped the sense.

3. I was trying to detect the case where we'd looped back around to our starting point, indicating that the table was full and the search was unsuccessful, and I realized that the condition I wanted was not b == orig, which would always fail on the first iteration, but (b + 1) % table_size == orig. But doing that on every iteration seemed obscene, so I incremented b (mod table_size) to start out with instead.

4. At some point I think I realized I'd forgotten to set the .full field on newly inserted items.

5. Also I realized I'd forgotten to test the table[b].k == k condition in `get` as well, which would have made every search unsuccessful.

So I think it's reasonable to say that even a pretty limited hash-table implementation has plenty of opportunities to write bugs, more than I expected. But maybe I'm just running on low blood sugar this afternoon or something.


By contrast this binary search tree code I wrote in OCaml seems to have worked as soon as I got it to compile, although its worst-case performance is pretty bad, and, unlike the much longer hash table code, very likely:

    let rec getp k = function `Empty -> None
      | `Node(k1, v1, x, y) ->
        if k1 = k then Some v1 else getp k (if k < k1 then x else y)
    and putp k v = function `Empty -> `Node(k, v, `Empty, `Empty)
      | `Node(k1, v1, x, y) when k1 = k -> `Node(k, v, x, y)
      | `Node(k1, v1, x, y) -> if k < k1 then `Node(k1, v1, putp k v x, y)
                                         else `Node(k1, v1, x, putp k v y)
And it's about a third the size of the hash-table code. OTOH I spent the last hour writing it.

With four more lines of code it's also a sorting algorithm:

    and dictofp = function [] -> `Empty | (k, v)::xs -> putp k v (dictofp xs)
    and itemsp d = let rec iter d tl = match d with `Empty -> tl
      | `Node(k, v, x, y) -> iter x ((k, v) :: iter y tl)
                   in iter d []
Look:

    # List.map (fun (x, y) -> x) (itemsp (dictofp (List.map (fun x -> (x, x)) ["this"; "is"; "a"; "list"; "of"; "strings"])));;
    - : string list = ["a"; "is"; "list"; "of"; "strings"; "this"]
Of course the same comments about worst-case performance apply...


This is a fun exercise, but if anyone is thinking about doing this for a project - don't - use a library.


Existing libraries make decisions that aren't necessarily suitable for all applications. Most hash table code makes excessive use of malloc/realloc or has power of two growth which is problematic in memory constrained systems. If you want to minimize waste by using high load factors you're going to want Robin Hood hash which is still relatively uncommon in lib code.


Although Robin Hood hashing has good performance under high load, it needs to record the probe length in each bucket. This partly cancels its advantage especially for small keys. In my evaluation, Robin Hood hashing doesn't use less memory in comparison to others.

> Most hash table code makes excessive use of malloc/realloc or has power of two growth

Libraries based on open addressing only call malloc when there is not enough room. That is not "excessive". Also, if you know the maximum size before hand, many libraries allow you to preallocate buckets such that you never trigger rehashing in this case.


“But Doctor, I am Pagliacci!”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: