Solving “Two Sum” in C with a tiny hash table

Const-me · on June 29, 2023

Given the constraints, the proposed implementation is not efficient at all. Here’s a quick benchmark to compare two implementations: https://gist.github.com/Const-me/265b8a3e89f17f4268d34d64771...

In theory, the hash map is O(n), and std::sort approach is O(n*log(n))

And yet when measured on my computer, std::sort is more than twice as fast. Specifically, for hash map my program prints 3574736 (best of 10 tries), for std::sort only 1695712. My CPU base frequency is 3.8 GHz, so these numbers measured with RDTSC translate to 941 and 446 microseconds, respectively.

I’m using a CPU with AMD Zen 3 cores, and compiled that code for AVX2 ISA, with Visual Studio 2022.

attractivechaos · on June 29, 2023

std::unordered_map is notoriously slow, several times slower than a fast hashmap implementation like Google's absl or Martin's robin-hood-hashing [1]. In addition, your implementation lets the hashmap grow from a small size. As you know the total length, you could preallocate the hash table and avoid rehashing. As typical hash table implementations double the capacity during rehashing, preallocating the buckets should be about twice as fast if we don't consider the cache and the complex mechanism behind the STL allocator.

That said, std::sort is not the fastest sort implementation, either. It is hard to say which is faster for a small N of 10000. Your general point is still right in that an O(n*log(n)) algorithm is not necessarily slower than an O(n) algorithm in practice.

[1]: https://github.com/martinus/robin-hood-hashing

Const-me · on June 29, 2023

Adding `map.reserve( vec.size() );` indeed helped, but still slower than sorting these entries. Specifically, the time for the hash map is 740 μs this way.

I think the main reason for the observed behavior is memory management. The sorted vector takes under 80kb of RAM, allocated with a single malloc() call, and the entire vector fits in L2 cache which is very fast to access. OTOH, without advanced shenanigans like custom allocators, the hash map needs 1 malloc() call per entry.

On a general note, I noticed that when the data being handled fits in the faster in-core caches, dense buffers pretty often perform best, despite these theoretical big O-s. Even if this means spending some extra time to sort the data.

charlieyu1 · on June 29, 2023

Just learning c++, curious how reserve works. I thought map would allocate memory exponentially automatically, so an early reserve at best saves you a few reallocation

Const-me · on June 29, 2023

That reserve() method only reserves memory for the hash table inside the container. It does help with performance to an extent, but the memory for the items stored in that map is still allocated one by one, as new entries are inserted into the map.

I think the main reason why reserve() helped substantially weren’t saved malloc() calls, it were saved re-hashes. Each time the hash table grows, it needs to re-insert all items currently stored on the map into the new, larger table of these buckets.

tialaramex · on June 29, 2023

Sort of. For std::unordered_map there are two types of memory allocation, there's a table, of pointers, and then there are nodes, which themselves have a pointer to a next node if any, forming a "chain".

The table is indeed grown exponentially as you describe. If you tell std::unordered_map that you expect 10000 entries using reserve, it will calculate how big it would expect to grow a table for 10000 entries and make it that size immediately.

The nodes however are created when you insert data. Even though you called reserve, it is an allocation each time you insert to make space for the new node.

If you used a Swiss table (e.g. from Abseil) the reserve behaves more as a person might expect, it will allocate a data structure big enough to insert at least 10 000 entries successfully, and no further allocations are needed unless you keep going.

gumby · on June 30, 2023

Don't forget those reallocations require copying as well. Also reallocation can increase memory fragmentation, not that it matters for this toy problem.

Basically doing less computation is always a win when it's trivial to do so. This is not a c++ issue.

nly · on June 29, 2023

If you wanted to see the effect of memory allocations then you could use boosts intrusive unordered_set implementation

https://www.boost.org/doc/libs/1_82_0/doc/html/boost/intrusi...

It doesn't own the underlying memory, so you can preallocate your nodes

attractivechaos · on June 29, 2023

> the [std::unordered_map] hash map needs 1 malloc() call per entry.

This is a key reason why std::unordered_map is several times slower than fast hashmap implementations that only use a couple of long flat arrays per hashmap object. Nonetheless, common STL implementations use their own allocators by default and would not call malloc() on small objects frequently.

creata · on June 29, 2023

Can someone explain why std::unordered_map has to be slow?

I vaguely remember that it's because it isn't allowed to invalidate references when inserting elements.

tialaramex · on June 29, 2023

Yes, it's obliged to be the design you'd have been taught in the 1980s or maybe 1990s in a typical data structures class, linked lists of entry nodes rather than a modern open addressed map.

You can satisfy typical real world needs with better structures now, even when people want stable references to the nodes, but you can't implement the API that std::unordered_map promises with the faster choices. So there are "near drop-in" replacements, but you can't actually swap those into the C++ standard library.

The chained design has terrible locality, which means doing lookups (failed or successful) in the hash table will cause more memory fetches, which are very slow, compared to a modern design which was optimised for this.

Imagine we're looking at a Swiss Table, we're looking for 12530, we hash that, we get an index into our hash table, that causes our first memory fetch, to pull in the RAM where that part of the hash table lives. We check the key, this one is wrong, we step to the next, it's in the same blob of RAM but it's wrong, we step to the next, it's empty, we're done, one memory fetch, nothing found.

Now std::unordered_map, we're looking for 12530, we hash it, we get an index into our hash table, we do a memory fetch. We get a pointer to a node. We do another memory fetch, we have a key to examine, it's wrong, we get a pointer to another node, we do another memory fetch, we have another key to examine, this one is wrong too, but there are no more pointers, three memory fetches and nothing found.

Const-me · on June 30, 2023

I wonder why won’t they fix the performance by changing the API specification of these collections?

I don’t think they care about backward compatibility. For example, in C++/20 they decided UTF-8 string literals (available for nearly a decade since C++/11) now returns them in a brand new `char8_t` type, which won’t automatically cast to the old `char` type. I once had to waste couple days of work fixing the issues caused by C++ language upgrade.

kccqzy · on June 29, 2023

That's exactly it. Also because of that, it has to be implemented with chaining rather than open addressing.

_a_a_a_ · on June 29, 2023

> Your general point is still right in that an O(n*log(n)) algorithm is not necessarily slower than an O(n) algorithm in practice.

Sounds reasonable

> for a small N of 10000

Ah, now that is not what I would call a small N

tialaramex · on June 29, 2023

10_000 is easily in the ballpark of constant factors k which really happen in software.

Suppose my O(n) algorithm is a random access walk of nums. So I'm causing ~10_000 memory fetches, maybe they each take k=100 cycles, so it takes 1 million cycles total.

But the O(n*log(n)) algorithm happens to be linear, it does access some nearby objects more than once, but every damn access is in cache. That's only 100_000 cycles, it's ten times faster despite you not thinking 10000 is a small N.

Stalls are huge. A modern CPU can do a crazy amount of arithmetic in the time it takes to fetch from even an L3 shared cache, never mind main memory. If you need to go out, to the network or to "disk" it starts to look faster to do a whole bunch of difficult work again instead. Oh the 2417th digit of Pi is on the hard disk? Thanks but I'll just calculate it again instead 'cos I don't got all day.

_a_a_a_ · on June 29, 2023

Thank you for the lesson in memory access times I actually know something about the subject already :-)

> But the O(n*log(n)) algorithm

I think you're badly missing out the "* log(n)" part. But on reflection, you have made your point, thank you.

tialaramex · on June 29, 2023

> I think you're badly missing out the "* log(n)" part.

Nope? But I can see why you might have missed it, instinctively log(10_000) feels like it should be pretty big. It feels big, doesn't it. Not as big as n, but surely a sizeable quantity.

However the log to base 2 (which we might decide we prefer to care about here since this is computer science) is about 13. So yeah, the numbers work out fine.

I guess it's possible my use of the word "linear" confused you. I'm talking linear in comparison to random access, not suggesting that somehow a log-linear complexity is the same as a linear complexity. But then, you know "something" about this so maybe not.

EtA: Aha, I realised my mistake, I just used rough numbers, maybe you'd have realised if I'd written 132 870 cycles or something even though that's obviously not correct because it's excessive accuracy.

aidenn0 · on June 29, 2023

If you pick base 2 for the then log2(10000) is about 13. So if you are a constant factor of >13 faster for the work that dominates, the n*log(n) will be faster.

Slightly OT:

I remember in my algorithms class we used to joke that log(log(N)) was the same as a constant factor of 5 (in the 64-bit era, perhaps we should say 6).

tromp · on June 29, 2023

Sorting provides a alternative solution: sort the array, and then traverse a linear path over pairs of indices from (0,n-1) to (n-1,0) whose sum straddles the target:

    int main() {
      /* read in and sort array a */

      int i=0,j=N-1, sum;

      while ((sum = a[i] + a[j]) != tgt) {
        if (sum < tgt)
          i++;
        else
          j--;
      }

      printf("%d %d\n", i, j);
    }

nkurz · on June 29, 2023

I think it's a little trickier than that, since the problem asks for the indices not the numbers, which would be lost in the sorted version. You'd have to keep an additional mapping from the sorted to original positions. And are we counting the sorting and space for the maps as free? Or do we have to account for them somehow?

Mostly, I just don't like the problem specification. What does it actually mean to "solve it efficiently"? How many times are we going to be searching each input array? How much memory do we have available? What tradeoff between memory and computation is allowable? How much does a comparison cost us versus accessing memory? Are all targets equally likely?

Assuming as theoretical computer scientists tend to do that we can ignore all constant time operations, that access to memory is free, that an unknown but fixed amount memory is available, and that we'll be searching the same array forever ruins the problem for me. Fill in the numbers and we can talk about what's optimal, but the "spherical cow" approach does nothing for me.

tialaramex · on June 29, 2023

For the specifics, as someone else observed, walking nums to find the values we've discovered and identify their indices is very affordable, it's N operations, and if we can't afford N we're in trouble since we definitely need to do at least N comparisons.

To your general point, sure, one thing that's nice (but takes a colossal effort) about Advent of Code is that it has specific inputs for players, and they have specific answers, so although it's usual for participants to write code which could (you hope) solve the problem in the abstract, the actual test is whether you solved the input you actually had. Sometimes not all conceivable inputs are even tractable (and some people find this situation frustrating if it happens) only the inputs given to real players may be suitable.

For this exercise the provided Go solution gets to handwave a bunch of problems. For example, Go's "map" hash table allocates whenever it wants to and has amortized table growth on top. That can really add up for a problem like this.

If I attacked it in Rust I would make HashMap::with_capacity(nums.len()) which does a single allocation up front, I might not need all this space if I exit early, but it won't hurt. I don't know if there's even a way to write that in Go.

tromp · on June 29, 2023

> I think it's a little trickier than that, since the problem asks for the indices not the numbers, which would be lost in the sorted version.

Good point.

> You'd have to keep an additional mapping from the sorted to original positions.

It's simpler to just find the two numbers in the original unsorted array.

marginalia_nu · on June 29, 2023

For an input array of values I[], create an array A = [ 0 ... n ] so that n = len(I). Sort the indices in A based on their corresponding value in I. Now A is the mapping, A[n] is the index in I of the n+1:th largest value, and you can binary search in the same way, by comparing I[A[i]].

This is strictly more space efficient than a hash table. You're not going to win any awards for being cache efficient, but then again, when the competition is a damn hash table it's not too bad ;P

ape4 · on June 29, 2023

It seems like it could be a "waste" to make a hash or sorted list only to throw throw it away after one use. Depending on what's required.

raverbashing · on June 29, 2023

> You'd have to keep an additional mapping from the sorted to original positions.

2*O(n) is still O(n)

taeric · on June 29, 2023

Though, at the low level, it is perfectly valid to compare the full costs, not just the growth rates. Right?

colejohnson66 · on June 29, 2023

The hashmap solution here is ideally O(n) as you only iterate once (assuming O(1) lookups and inserts in the map). A sort/search method, even using quicksort would be O(n*logn).

tromp · on June 29, 2023

You can use radix sort which is linear for fixed width ints.

tromp · on June 29, 2023

Although hashmaps are O(n) in practice, they can be worse than O(n log n) on some rare inputs that cause lots of hash collisions. Several sorting algorithms such as mergesort and heapsort are guaranteed O(n log n).

enedil · on June 29, 2023

Even it it was O(1) not amortized, it's not possible to beat the sorting in actual execution time for reasonable inputs (less than 10B elements probably)

jjgreen · on June 29, 2023

O(n log n + n) = O(n log n)

topposter32 · on June 29, 2023

What's log base 2 of a million or a billion?

floor_ · on June 29, 2023

Just a warning that this is secretly a GO article for those who are GO-intolerant.

chongli · on June 29, 2023

Yes. There’s not a single line of C code in it. It’s practically a bait-and-switch.

Edit: you’re right. There is C code intermixed with Go. It’s tricky to tell apart on a skim. Need to read carefully

o1y32 · on June 29, 2023

... I'm sure there is some C code there. In fact most code is in C.

codetrotter · on June 29, 2023

Well, at least it’s not Rust am I right fellas? Hahahah

(I like Rust.)

mananaysiempre · on June 29, 2023

Linear probing for a hash table that you’re not going to delete things from. Textbooks usually tell you to prefer other probing patterns unless you need deletion (and e.g. Go interface casts use quadratic probing[1]). Or is it still advantageous for some reason? (Locality?)

[1] https://github.com/golang/go/blob/03cd8a7b0e47f2141a1dbdad9f... via https://lukasatkinson.de/2018/interface-dispatch/

Xeoncross · on June 29, 2023

I'm sure my next interview to require I know this "optimum solution".

Jokes aside, I'm very glad to see another article using more than a single language to showcase something. Chances are much higher I can read one language well, or parts of both.

taeric · on June 29, 2023

I used to use this as an interview question. I passed people more for being able to talk about different tradeoffs than any optimum solution, though. Gave a yes vote to more than a few that messed up and had bugs. (Granted, most of the bugs were in extending this to the "3 sum" idea.)

badtension · on June 29, 2023

I was really hoping for something more creative than the traditional hashmap approach.

Still pretty cool, I wish I was that confident with rolling my own data structures.

abhishekjha · on June 29, 2023

All you need is a mumrmur hash function to create a hash table. That is something that people should copy.

tialaramex · on June 29, 2023

This (a hash table) is a basic vocabulary type, and it's only barely understandable that C doesn't provide one given when it was invented. The hash table which people are likely to knock together this way in C isn't very good, which makes sense - but if the C library provided one then the effort to do it well only needs to happen once.

This isn't like say, Bloom filters where most people don't need one. It's like sort, which I notice C does provide in its standard library.

jandrese · on June 29, 2023

It's not technically C stdlib, but POSIX.1 is the next best thing:

https://linux.die.net/man/3/hcreate

That said, the implementation leaves a lot to be desired.

camel-cdr · on June 29, 2023

> but if the C library provided one then the effort to do it well only needs to happen once.

It would've happened only once, but in 1987, and then we would've stuck with it, because every body relies on the exact behavior.

Just look at `rand()`, it's basically unusable.

teo_zero · on June 30, 2023

> Just look at `rand()`, it's basically unusable.

It does have some limitations you must be aware of, but "unusable"?