
15-line hash table in C - api
http://pastes.archbsd.net/graphitemaster/15_line_hashtable
======
tsmith
People understand that the compiler/executable doesn't run any faster the less
newlines there are, right?

This is cute, but as others have pointed out it, it isn't really a correct
implementation of a hash table. Also, it wouldn't pass a code review anywhere
I've ever worked.

~~~
burkaman
> This is cute

Well that's the whole point, right? This is clearly written for fun, it's not
production code, it's not supposed to pass a code review. It's just supposed
to be cool and interesting, and a challenge for the author.

~~~
lelandbatey
Exactly! Building cute but ultimately useless[0] programs is a lot of fun and
makes for great brainteasers. It's also a great way to interact in a friendly
but still technical way with your programmer peers. I recommend it to
everyone!

[0] - [http://lelandbatey.com/posts/2014/09/binary-tree-
printer/](http://lelandbatey.com/posts/2014/09/binary-tree-printer/)

------
throwaway_yy2Di
There's no bounds testing in hget(): some valid sequences of operations will
cause buffer overflows. For example, this should segfault:

    
    
        int (**table)[2] = hnew();
    
        for (int j=0; j<40; ++j) {
          hset(table, (10 + j*SIZE), 0);
        }
    

The problem is, the probing function doesn't wrap (the "t += h" part), so if
you have have several colliding keys, it will probe for them past the end of
the table.

~~~
balakc
Every time a lookup is performed, isn't it linearly looking through the table
to find that key... That doesn't sound like a hash! Maybe am missing something
here.

~~~
jfoutz
Instead of a linked list to store hash collisions, it's using linear probing,
which just uses the next open spot in the table. (Maybe I'm misunderstanding
the question, it looks like they're indexing into the table correctly to me)

------
mathetic
I love how a comment thread about a fun little program turned into HN showing
off their wisdom in software engineering coupled with occasional boasting
about tight code reviews.

I will not understand how and why people's egos get threatened by things like
this.

~~~
jakobegger
I actually enjoyed reading this thread. Yes, there are some snarky comments
about missing newlines, but there are also a lot of constructive comments
here!

------
shaftoe
This code reminds me why clever is the enemy of good.

~~~
inDigiNeous
Exactly. Why people try to write code in less lines of code? More lines ==
better readability, better debugging, easier to move around, easier to fix
single parts and so on.

~~~
reacweb
I do not code much, but I have to read a lot of code when I investigate bugs.
I never see very concise code, but I have quite often verbose code. The
readability of verbose code is very bad: I have to scroll a lot more to
understand algorithm and often, it is split in many files.

In my experience more lines == reduced readability, more difficult to move
around, more difficult to fix single parts (any change requires to modify many
files) and so on. More lines of code is often against the simplicity.

I have seen code that was very difficult to understand. Rewriting it to make
it simple never needed to increase the number of lines.

~~~
bobchops
I agree in that I'd much rather read terse expressive code than pages of
dribble convey anything meaningful, having seen a lot of verbose code too.
That said, code should be well formed. When I started out in C I only used
curly braces when I needed to. After creating several bugs stemming from this
I now add them as a rule of thumb. That particular issue is compounded here by
the fact that multiple statements are shoved on a single line - also poor
form.

------
je-so
(See [http://pastebin.com/KypbginU](http://pastebin.com/KypbginU))

// corrected overflow hget

    
    
      for (int k2=k,o=0; t[k&(SIZE-1)] && **t[k&(SIZE-1)] != k2 && o<SIZE; ++k,++o);
        return t+(k&(SIZE-1));
    

// hset now allows overwrite

    
    
      for (int (**a)[2] = hget(t, k); a && (*a || (*a=malloc(sizeof(**t)))); (**a)[0]=k,(**a)[1]=v,a=0);

------
nine_k
I can see two legitimate cases for writing very short, very dense code.

1: Implement an algorithm in a very concise and straightforward, even if not
very efficient, way. An example is the classic quicksort in Haskell, which
most literally implements the idea of the algorithm:

    
    
        qsort [] = []
        qsort (p:xs) = qsort [ y | y <- xs, y < p ] ++ [p] ++
                       qsort [ y | y <- xs, y >= p ]
    

2: Implement an algorithm in a super-efficient, while non-obvious, way. An
example is the inverse square root calculation made by famous by John Carmack
([http://www.codemaestro.com/reviews/9](http://www.codemaestro.com/reviews/9)).

The linked hashtable implementation, to my mind, is neither very elegant nor
fiendishly clever, nor even reasonably correct.

~~~
taeric
There is something funny about stating that most "literally" implements the
idea of the algorithm, when the main "idea" of the algorithm is to be quick.
:)

Also, even in examples of super efficient ways, be wary. The inverse square
root you are referring to is actually slower than what many CPUs can do with a
single instruction nowdays.

Also, I think you are missing out on the main reason this code was written.
Essentially a puzzle to see if it can be done.

~~~
foota
Just because it's one instruction doesn't make it faster

~~~
to3m
In the case of a reciprocal square root estimate it probably is, particularly
compared to the code given. A fast reciprocal square root estimate function
usually takes less than 5 cycles - and is less than 1/4096th out, so the
accuracy is better too.

------
Demiurge
Thank god, I was just about to run out of newlines!

------
jakobegger
For those of us who actually ship code, I'd like to suggest khash[1], which is
a pretty nice implementation of a hash table in C. It uses lot's of macros, so
if you're looking for a brain teaser you will be satisfied as well :)

[1]:
[http://attractivechaos.github.io/klib/#Khash%3A%20generic%20...](http://attractivechaos.github.io/klib/#Khash%3A%20generic%20hash%20table:%5B%5BKhash%3A%20generic%20hash%20table%5D%5D)

~~~
AntiRush
uthash[1] is another good solution. It's pretty bulletproof and has lots of
options. Also implemented with macros, if that's a plus :-).

[1]:
[http://troydhanson.github.io/uthash/](http://troydhanson.github.io/uthash/)

~~~
gjulianm
I don't think being implemented with macros is a plus. I've seen some friends
use it and get stuck whenever it was failing because it's impossible to know
how it's working. It's the same type of cleverness some people criticize in
other comments. Yes, it may be complete and useful but it's opaque.

------
ivan_ah
will need the clockwise spiral rule to parse this:
[http://c-faq.com/decl/spiral.anderson.html](http://c-faq.com/decl/spiral.anderson.html)

~~~
HillRat
The author helpfully explains the code by way of un-optimizing it to something
readable:
[http://pastes.archbsd.net/graphitemaster/hashtable_explinati...](http://pastes.archbsd.net/graphitemaster/hashtable_explination)

~~~
chii
see why couldn't the code be written this way by default! then they would've
been understandable from the beginning!

------
vonmoltke
So, I have been looking at this implementation for several minutes. Ignoring
what others have already posted about the code being way too dense for no
benefit, how is this a hash table? It looks like a bog-standard associative
array.

~~~
jayvanguard
No doubt, I prefer to have actual hashing going on in my hash tables.

------
codeshaman
Why the hell does this thing get upvoted ? What's good about this
implementation? It's cryptic, it's buggy and it's useless for any practical
purpose.

Why is this better than an array of structs ? How much longer can that be ?

------
readme
If you rewrote this exact logic in Python it would be longer, because then the
code would actually be readable.

However, fortunately it's just two chars: {}

~~~
jeorgun
It seems to me that the exact logic behind those two chars is roughly 150
times longer :) ( [https://github.com/python-
git/python/blob/master/Objects/dic...](https://github.com/python-
git/python/blob/master/Objects/dictobject.c) )

------
viraptor
Is this some common style?

    
    
        int (**hnew())
    

I've never seen parens used like that. Usually it's:

    
    
        int **hnew()

~~~
userbinator
[] has higher precedence than *, so without the parens you get "function
returning array of two pointer to pointer to int" (which is illegal) instead
of the desired "function returning pointer to pointer to array of two int".

~~~
pmelendez
So would that be equal to this?

    
    
        typedef int arr2int[2];
        typedef arr2int** ptr_arr2int;
    
        #define SIZE 1024
        static ptr_arr2int hnew() {
            return calloc(sizeof(int**), SIZE);
        }

------
art187
This doesn't seem to use the hash in the first access. So the first hset,
seems to always go to the first row of the table.

Perhaps this is what is desired (fixing roll over issue as well):

    
    
        static int (**hget(int (**t)[2], int k))[2] {
          int (**t_old)[2] =   t;                                                                                                                                                                                   
          int h = k & (SIZE - 1);
          for (t = t + h; **t && ***t != k;                                                                                                                                                                         
               h = ((h + 1) & (SIZE - 1)), t += h, t = t_old + ((t - t_old) & (SIZE - 1)));
          return t;
        }

------
seyfulislam
everything is 15-line when you have semicolon.

------
valleyer
Won’t hget() easily run off the end of the array if it has a few collisions?
The function doesn’t save off the original value of t and then does an
unbounded number of (t += h) operations.

edit: someone else pointed this out already. How did this make it to the front
page of HN?

------
pmelendez
Is this even a hash table? It looks more like a regular associative array. It
seems like look ups are linear in time which is ditching one of the biggest
advantages with hash tables.

Also since keys are integers, how this would be useful? Isn't the same that a
regular array?

------
im2w1l
It took me some serious thinking to figure out why you could take __t, despite
_t being NULL.

If I managed to understand it correctly, it is because __t is an array, and
arrays are special.

_ t is a pointer to an array, i.e. it points to the first element of the
array. __t is an array, which means it behaves like a pointer to the first
element.

The remarkable consequence of this is that _t and_ _t have the same numerical
value, but different types._ t== __t evaluates to true.

------
OldWolf2
A bug nobody has pointed out yet is that star-star-t (sorry, can't figure out
this site's markup for escaping stars) dereferences a null pointer. It appears
to work because gcc decides to generate the same code for star-star-t as for
star-t (which is correct behaviour if star-t is not NULL, and moot if t is
NULL). The code should just be star-t there.

Also, sizeof(int star-star) in hnew should be sizeof( int (star)[2] ).

------
dllthomas
I'd call this 16 lines - multiple statements on one line is cheating; if we're
going that route this could be 3 lines.

------
deweerdt
A small improvement could be to hash the key using a prime multiplier and
successive multiplications, instead of using the key and increments of one.
It'd reduce the collisions at the expense of a more computationally expensive
hash function.

~~~
nowne
The problem with that method is that it doesn't have data access locality,
while linear probing does. Linear probing ends up being more efficient because
it is easy on the cache.

~~~
stormbrew
It seems to me this would only be true if the keys that collide are related to
each other, or you have a vastly oversized table that you collide a lot in, at
which point you're just accidentally synthesizing a smaller table.

What am I missing here?

[edit] I guess the other situation would be if the keys are largely
sequential, but then a hash table seems like an odd choice of data structure.

~~~
nowne
Linear probing works by initially hashing the value, call it h(x), then if
there is a collision, it checks h(x)+1, h(x)+2, ..., h(x) + k, until it finds
a open slot. Lookup works in the same way, and deletion is a bit more
complicated.

This model plays nicely with the cache, although its downside is there tend to
be more "runs" of contiguous filled slots in the hash table. This method still
provably takes an expected insert/lookup time of O(1) with a 5-wise
independent hash function and a load factor smaller than 1.

~~~
stormbrew
I do know how linear probing works, thanks. ;)

But now I see where the confusion lies. I was taking your post to be replying
more to the hash function part of the GP, but you were talking specifically
about the skip distance. Yes, now I see what you mean, and I'm not actually
sure how I misinterpreted so badly in the first place.

------
linuxlizard
(Insert my usual rant about not checking malloc()/calloc() for failure.)

~~~
bnegreve
Note that on many systems (e.g. Linux), malloc/calloc won't always return NULL
when you're out of memory because of lazy memory allocation policies. It will
only crash when you start reading / writing. That makes it arguably less
useful to test the return value.

edit: clarity.

~~~
lgeek
However, mmap and the functions which rely on it will return MAP_FAILED/NULL
if no gap large enough is found in the virtual address space or if you've used
MAP_FIXED and the area isn't free. If you don't check for errors, the
application could potentially end up trying to dereference a NULL pointer.

~~~
bnegreve
Ok, let me rephrase then: On modern systems, checking the return value of
malloc() is not a proper way to check that the system is out of memory and
therefore it doesn't guarantee that the program will run correctly in any
case.

That's what I wanted to point, and I agree that it remains useful to detect
other types of errors such as the ones you mentioned.

------
Aardwolf
What's with the hardcoded size though...

~~~
snorkel
Generally it's easier if the height of a hash table is a fixed value (which
can be calculated once at runtime based on how much memory the table can
consume) so that the hashing logic is:

target_row_number = table_height % int_value_of_key

If table height keeps changing then this formula would be inconsistent. This
code isn't using % but kind of doing the same using &

------
vivekvelankar
Just add this line to make it much more readable typedef int (
__table_row)[2];

Replace ( __int)[2] with table_row :)

------
youngthugger
Is learning how to create a hash table necessary for a software engineer?

~~~
fsloth
Yes. Hash table, linked list and a graph of some sort are those fundamental
datastructures that everyone should have implemented at least once by
themselves. IMO, the core to excellent software engineering his having a lucid
intuition about fundamental things such as those.

Mainly because they are simple, and lot of problems are solvable by using them
without resorting to Someone Elses Gigantic Platform Framework.

------
Buge
hset doesn't allow you to overwrite old entries. But you can do it manually
with hget.

------
Saltor
why would you a 2 element array of int here instead of defining a struct of 2
ints?

~~~
ExpiredLink
A wasteful struct - are you kidding?

~~~
Saltor
I'm not. Is there some performance penalty for using a struct over a 2 element
array?

~~~
vidarh
No performance penalty. It would make the code longer.

Consider "<Foo> in <bar> lines" type articles to be performance pieces - the
point is to make it short, not readable or user friendly.

------
jlink
unreadable code is unreadable

------
andyidsinga
just think how amazing and readable it could be in 100 lines

------
thrownaway2424
C++ wins again.

~~~
jevgeni
C++ never won.

