
Write a hash table in C - jmlr
https://github.com/jamesroutley/write-a-hash-table
======
fizixer
I've heard of two possibilities (as a recommendation) for using data
structures in C without having to develop them:

\- BSD queue.h offers linked lists, queues, etc [1].

\- UT-hash offers hash tables [2].

I've also read some people being frustrated with them, probably due to the
quirky syntax/API. On the plus side (BSD at least) it's just a header file
that you can download, include and start using.

[1]
[http://bxr.su/OpenBSD/sys/sys/queue.h](http://bxr.su/OpenBSD/sys/sys/queue.h)

[2]
[https://troydhanson.github.io/uthash/](https://troydhanson.github.io/uthash/)

~~~
dottrap
Also there is the header-only khash.h [3], complete with benchmarks:

[https://attractivechaos.wordpress.com/2008/08/28/comparison-...](https://attractivechaos.wordpress.com/2008/08/28/comparison-
of-hash-table-libraries/)

(Also provides kbtree.h, ksort.h, kstream.h, kvec.h)

[3]
[https://attractivechaos.wordpress.com/programs/](https://attractivechaos.wordpress.com/programs/)

Also there are the stb header-only libraries. There is a hash table
implementation buried inside stb.h

[4] [https://github.com/nothings/stb](https://github.com/nothings/stb)

------
msarnoff
"The language doesn't come with [a hash table implementation] included"

The standard library does come with a half-baked hash table implementation[1]
(hcreate/hdestroy/hsearch) that only allows creation of _one_ global hash
table. GNU libc[2] adds hcreate_r/hdestroy_r/hsearch_r that allows multiple
tables to be created.

The APIs are strange and antiquated. On the other hand, the hash table
implementation described in the article presents a much nicer interface.

[1]
[http://pubs.opengroup.org/onlinepubs/009695399/functions/hcr...](http://pubs.opengroup.org/onlinepubs/009695399/functions/hcreate.html)

[2] [http://man7.org/linux/man-
pages/man3/hsearch.3.html](http://man7.org/linux/man-
pages/man3/hsearch.3.html)

~~~
to3m
The hcreate nonsense is POSIX, not C99, so a conforming C implementation might
not include it.

~~~
twic
How does junk like this ever get into a standard? Didn't a whole committee of
very experienced systems programmers have to sit down and say, "yes, these are
routines that i think people will find generally useful"?

It looks like the same header, search.h, also includes functions for working
with binary trees, where the user passes in the root pointer, and so which
support multiple trees:

[http://pubs.opengroup.org/onlinepubs/9699919799/functions/ts...](http://pubs.opengroup.org/onlinepubs/9699919799/functions/tsearch.html)

So this committee must have signed off on the hashtable stuff even though they
had an example of how to do it properly right next to it!

~~~
jws
A standards process is frequently about documenting what already exists in
common implementations so that there is a specification for future
implementations. POSIX was very much one of these processes.

~~~
to3m
OK... so how did junk like this ever get into enough implementations that
people were forced to include it in the standard? ;)

~~~
emmelaich
tsearch/hsearch/lsearch have been around for at _least_ 30 years. My guess is
that they've been around for a lot longer.

Why not have these anyway? Otherwise people will make their own (probably
buggy) equivalents.

I'm actually surprised that people aren't aware of them.

~~~
to3m
This is specifically about hcreate/hsearch, which are utterly useless, since
you only get one hash table at a time. The reason you don't want crap like
this is just that: it's crap.

The _r versions look perfectly sensible.

------
sillysaurus3
The link at the bottom of the intro is broken:
[https://github.com/jamesroutley/write-a-hash-
table/blob/v0.1...](https://github.com/jamesroutley/write-a-hash-
table/blob/v0.1.0/hash-table) preventing the reader from advancing.

I think the whole thing would be best as a single page, to be honest. Even if
it's long, people have scroll wheels. But it's a stylistic choice.

------
pieguy
"(long)pow(a, len_s - (i+1))" is both inefficient and inaccurate. You should
store the current power in a variable which you multiply by 'a' and reduce by
'm' on each iteration.

Also, your double hashing scheme has a bug. You identified that it's
problematic if hash_b returns 0, but adding 1 to the result doesn't actually
solve anything, because now you have the same problem if hash_b returns
num_buckets-1.

------
__s
Weird there isn't a src directory collecting everything together

Hashtables are one of those data structures one can easily get away with never
having to write on their own, yet still use every day. In scripting languages
it's impractical to roll your own, & in the more abstracted low level
languages there'll usually be a pretty good generic version. Yet there's much
room for variation

I had the joy to be out of reach of off-the-shelf hashtables for luwa,

init:
[https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...](https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/alloc.lua#L156)

get/set:
[https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...](https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/table.lua)

hash:
[https://github.com/serprex/luwa/blob/436c7271120fcd116af30e9...](https://github.com/serprex/luwa/blob/436c7271120fcd116af30e906919afe7a7e55254/rt/obj.lua#L130)

Still need to implement Lua's sequential-portion-as-array logic. I do like the
API not having a 'remove' method, opting intead for nil by default. nil value
entries get vacuumed out when rehashing. This ends up acting as tombstone. In
an earlier iteration I was using 0 as a marker distinct from nil, but an
implementation detail is that nil is stored at address 0.. oops

------
nathan_long
If you just want to see hash tables work as a data structure, here's a talk-
as-blog-post I made on the subject.

[http://nathanmlong.com/2015/10/reimplementing-rubys-
hash/](http://nathanmlong.com/2015/10/reimplementing-rubys-hash/)

~~~
igravious
Nathan, that was absolutely fan-fuckin-tastic. Pardon me for the potty mouthed
tmesis but that was a highly _highly_ amusing run through. How did you do the
graphs? (I'm always wondering how people do graphs.) And is that really you on
the sofa waking up from the dream?

------
dullgiulio
If your set is known at compile time, make a Perfect Hash Table with gperf
instead. It's commonly used in compilers and other software with a parsing or
interpreting stage.

------
benlorenzetti
In the intro section, on extending to unicode being non-trivial...is the
argument that if your input has some non-ascii data you probably have all non-
ascii data and the above 127 bucket will be too full? Honest question.

------
baron816
While the reasons for this that the author gave are valid, I think it would
also be valuable to do this in a high level language. I'm sure lot of people
who can read C code went through a CS program, and hopefully learned how hash
tables work then. People without CS degrees and who mostly write
JS/Ruby/Python/etc. and use hash tables on almost every line of code they
write are much less likely to know how they work. Just saying, someone should
do it for other languages.

~~~
eighthnate
The problem with writing hash tables or any fundamental data structures in
JS/Ruby/Python/etc is that most of the hard work is done for them.

The hard part is allocation memory, keeping track of the structure(array)
holding the hashtable, etc.

What else is there? Implementing the hash function/code?

If you don't have a CS degree, you probably don't even know how lists, queues,
stacks, etc really work. But the kind of programming that is done at a "high"
level, you might not even need know it.

Instead of saying people should implement it JS/Ruby/Python/etc ( all of whom
are OOPs ), they should learn C and implement it in C.

~~~
pfranz
There's room for both. There's a reason why a lot of CS programs have shuffled
from primarily C to Python (some had Java in between).

Not having to deal with boilerplate and housekeeping lets you focus on
learning the topic at hand. Especially in a classroom setting where you're
probably focusing on the specific subject instead of real-world dirtiness.

But after learning the basic concept, you do need an actual implementation if
you want to do anything.

~~~
eighthnate
> There's room for both.

Sure. I'm all for learning ML, Scheme, C/C++, Python, Ruby, Java/C#, etc.

> There's a reason why a lot of CS programs have shuffled from primarily C to
> Python (some had Java in between).

The reason is called dumbing down the curriculum. My fall freshman year, my
college tried to switch from C to Java in order to make it easier for the
general population to get into CS and there was a major pushback from the CS
student body, alumni and a bunch of the CS professors. We went back to C/C++
my sophomore year.

> Not having to deal with boilerplate and housekeeping lets you focus on
> learning the topic at hand.

I disagree. Having to deal with the boilerplate and housekeeping helps you
learn the topic at hand.

> But after learning the basic concept, you do need an actual implementation
> if you want to do anything.

I disagree. Giving them "training wheels" will keep them dependent on training
wheels. In an idealistic world, people go from Java/Python to C/Assembly, but
that's rarely the case.

CS should begin with Math ( boolean algebra/logic/etc ), Assembly and C. Then
go on to functional paradigms. Then to OOP. Etc ( in the language ). Of course
they should also learn about OS( kernel/filesystems/etc ), networks,
computation theory/etc, compilers, etc.

The dumbing down of the curriculum so that music/english majors can pass CS
classes doesn't do anyone any favors.

~~~
B1narySunset
_The dumbing down of the curriculum so that music /english majors can pass CS
classes doesn't do anyone any favors._

You sound butt-hurt

~~~
eighthnate
Huh? I'm not butthurt. Seems like you are though.

I'm sorry having standards and expectations is something that offends so many
people.

C isn't a "difficult" language. It's something any CS major should have an
understanding of.

I can't believe on hackernews, I'd get attacked for saying people should learn
C.

Nevermind that most OSes and most languages themselves are written in C. Even
python interpreters are written in C.

Since when is saying people should learn C for a CS degree considered being
"butthurt"? What's next? Saying people should learn latin/ancient greek for a
liberal arts degree is considered being butthurt?

~~~
B1narySunset
You're being attacked because of your condescending tone implying that music
and english majors aren't smart enough to pass the CS curriculum

~~~
grzm
_You 're being attacked_

Regardless of how much you may disagree with what someone said or how they
said it, attacking them for it (even verbally) is not an appropriate response.
It only contributes to worsening the discourse. If you can't find a way to
disagree or express an opinion constructively, it's better to refrain from
commenting at all.

------
clarry
Don't allocate these items individually; don't store pointers to such items.
They're tiny, and the pointer chasing + allocator bookkeeping overhead +
potential randomisation make it inefficient for no gain. Just allocate an
array of items. This way you'll also deal with fewer potential errors.
Speaking of errors, please handle them and fix the API so that the caller can
know if there's a problem.

~~~
pacaro
Yeah, reading through that jumped out at me.

Also the use of strdup, while a reasonable choice, should also come with
caveats about ownership and copying

I stopped at the use of pow for exponentiation in the hash function. Just
iteratively multiply your factor and take the modulus on each round.

This would handle utf-8 just fine if characters are cast to unsigned in the
hash function and a larger prime factor is used

None of these are real criticisms, I'd just love to see a pointer to more in
depth information on how to really write hash functions

I hope that i missed it, but this definitely needs a clear statement that this
is a pedagogical exercise and not a production ready implementation

------
smegel
This might seem academic, but I actually had a real reason to hand-roll a hash
table implementation that would work in process-shared memory (as an Apache
module). It was based on the hash table implemented in mod_ldap, and was part
of an LRU cache.

------
jay-anderson
This is a good mostly simple overview. The appendix should also mention robin
hood hashing as a method for addressing collisions. From my understanding it
is cpu cache friendly and works well in practice.

------
peterbotond
berkeley db 185. it has served me very good over the years and it was fast
enough for real time market data and indexing and many more.

------
7Anonymous
Just go back to the table of contents. The link works from there.

------
agumonkey
hah, just a few days ago I named a few files hshtbl.c trie.c and so on. All
after a Raymond Hettinger talk about how they rethought Python 3 dicts.

------
cletusw
Link should be changed to [https://github.com/jamesroutley/write-a-hash-
table](https://github.com/jamesroutley/write-a-hash-table)

~~~
sctb
OK, we've updated the link from [https://github.com/jamesroutley/write-a-hash-
table](https://github.com/jamesroutley/write-a-hash-table).

~~~
fao_
What's the difference in the URLs, I can't see any.

