Hacker News new | past | comments | ask | show | jobs | submit login
Problems with Hash Tables (enfranchisedmind.com)
32 points by RiderOfGiraffes on Jan 8, 2010 | hide | past | web | favorite | 30 comments

This doesn't make much sense to me. The author correctly identifies that hash resizes that are done by changing the size by a multiple remain O(1) amortized over their construction and use. But then asserts that many hash tables (who?!) don't do this, and that makes them O(N). I've written a ton of hash tables in my career. None have that characteristic. Nor do any I've studied in popular software.

Then he finishes up with a discussion about "bad hash functions" without actually pointing to any examples of tables that use them.

Worst, the point of the whole (rather long!) article is about performance, and yet there's not a benchmark to be seen. Nothing is measured; it's just one assertion after another.

The whole thing is a giant straw man. Not worth reading.

The part I find particularly silly is where he writes:

  For example, assume that the hash table implementation
  shrinks the table if the number of full buckets drops
  below half, and grows the table if it becomes over full.
  If the table starts out with 100 buckets, and the number 
  of buckets falls to 49, the size is shrunk to 50.
I've never seen a hashtable that actually works this way. It's common knowledge that performance of hashtables degrades long before the "full" point. (I particularly like the explanation of that here: http://eigenclass.org/R2/writings/separate-chaining-vs-doubl...)

However, even if you are using a load factor like .72 or so to decide to grow, such a decision is still sensitive to flip-flopping if you use a very similar boundary for shrinking.

And this is trivially avoided by any simple hysteresis protection scheme. Use a different factor for shrinking or growing, or a different threshold.

With respect, I disagree.

It may be that you've never made these mistakes, and you understand all the implications, and you've never had these misconceptions, but of the people with whom I've discussed hash tables, I'd guess that around half have not understood these issues, and have thought that because it's a hash table it's pretty much automatically O(1), and haven't realized that it's amortized.

Not every article about performance needs to include benchmarks to make relevant points. Not every article that talks about implementation details need to give specific examples.

EDIT: FWIW, he does actually give an example of a bad hash function.

In short, I suspect you simply have too much knowledge to find this article useful. That doesn't (necessarily) mean others won't.

So given that you clearly do have a lot of experience in these matters, and given that you say you've written lots of hash-tables in your career, may I ask why? Why not use existing library implementations?

Even if I weren't the best hash author ever (TM), I still wouldn't find much use in this article. You can't argue without evidence, and this provides none. If you want to argue that the amortization creates latency bugs, you need to measure resize latency on some hash or another. If you want to complain about bad hash functions being used, you need to find one and show me a collision rate analysis.

You're basically saying that computational complexity theory cannot make any valid claims about the actual complexity of algorithms. That's a pretty far reaching assumption you're making there.

In my view, the author did two things, and maybe this debate results from his rather rambling style that made it a little difficult to separate the two.

One thing he did was to explain theoretical properties of possible (bad) hash table implementations and hash functions. And the other thing he did was to claim that these bad implementations are actually out there.

I did not find a flaw in his theoretical analysis. Maybe there is one that you can find, but if you do you should be able to prove him wrong without any empirical evidence. His second claim that these issues frequently occur in the wild is where we do indeed need empirical evidence.

I don't believe that bad hash table implementations are widely used. They are very central components of programming languages and libraries and I suspect that any glaring mistakes would have been found.

But one thing I can say for sure is that bad hash functions exist in the wild. I know because I have written some of them. It's not trival to write a hash function that distributes hash values evenly across all buckets. I really do think that hardly anyone bothers to actually analyse the distribution of their ad-hoc hash functions.

In OO languages people write lots of little hash functions all the time. They're not well designed or well tested pieces of code.


  > You can't argue without evidence ...
I guess as a mathematician I find evidence based arguments unconvincing, and exposition of principles more interesting. I'm now aware of the issues (although in truth I already was before reading this article) and so if implementing my own hash tables would know what to be wary of, and be ready to do some research, analysis and measurements.

But each to their own. I think we'll simply have to agree to disagree.

As a mathematician, I find evidence-based arguments very convincing... where do you think that QED comes from, the planet of magic unicorns?

Purely mathematical proofs are nothing but supplying a pile of evidence, in the form of theorems and facts that both the proof-reader and proof-writer agree on, to support the argument that the proof is attempting to make.

What you have done in your article is this:

"X implies Y, QED."

And when people pointed this out, you balked, rather than writing a follow up that looked like this:

"X may imply Y under many circumstances. Take X. A, B, C, and Q are very common implementations of X, and all of them do Y, based on the following analysis..."

My article?

But leaving that aside, I simply don't understand your points at all. In what way is a mathematical proof "a pile of evidence"? Consider, as a simple, single example the Banach-Tarski theorem. To help me understand you, perhaps you could explain how the proof of that is "a pile of evidence."

As I said elsewhere, I'm baffled by the responses here, so I'm not going to reply any more until, and if, I understand the points of view being expressed.

EDIT: speeling cerrocted


Arguing from principle is great but ...not only does he not give numerical evidence, he also doesn't give evidence in terms of describing which commonly used libraries or applications use what hashing method.

I don't see much exposition of principles going on when the author asserts that hash functions are much slower than comparison functions, for example. If you're going to make an empirical argument, you need empirical evidence.

I'm baffled by some of the comments here, I really am. I'm going to leave this for a bit and come back later to see if I can make sense of it. Clearly people disagree with me, but it seems to me that there are several principles and concerns that are validly raised.

This is an opinion piece - he doesn't like hash tables. For him personally, he prefers not to use them and instead to use other structures such as balanced trees. He says that there are gotchas with hash-tables that aren't always realised by those who don't read the small print and take care. That seems unassailable to me. There are gotchas.

He gives an example of a bad hash-function, and points out that you need good ones. Designing a good hash function is part-art, part-science, and can be very hard. Certainly it deserves a full scholarly article and justice couldn't be done in a blog like this.

You've highlighted:

   > the author asserts that hash functions are
   > much slower than comparison functions, for
   > example.
He said:

  > hash functions have to be blindingly fast - because
  > comparisons are already really fast.
That seems pretty clear to me. The reasons are then expanded, but the point remains. Crypto-hash functions are not fast, and yet I've seen them suggested as good hash functions. In many cases, though, they're not fast enough.

But this isn't really the place for a full point-by-point discussion of the issues. I think the piece makes useful and interesting points. It's not a journal article, it's a blog piece that allows the author to vent his prejudices with some supporting arguments.

If want it done properly, read Knuth. I think many of the criticisms here are mis-placed.

In your modded-down post, I think you were intending to mean numeric evidence like bench-marks but you actually wrote that you weren't convinced by evidence, period.

The author makes some logical arguments, which could be considered evidence. But as the GP points out, a lot of his assertions at least seemed to lack both numeric and logical evidence - so your assertion comes-off a bit muddle headed.

I found some of the article's points interesting too but I don't think this is problem people are having with this earlier post.

I think the real objection is that it's a meandering rant, not a well-structured essay explaining the pitfalls of using hash tables.

The whole thing is a giant straw man. Not worth reading.

His would-be juicy bit about amortized analysis and hash table growth -- this is all freshman undergrad algorithms stuff. Though, it is sad to say, that lots of IT shop pros out there don't know this stuff. (I have been the "guru" or "genius" far too many times, because I know that a naive array add implementation is O(n^2))

I think it was a reasonable discussion, it just had some faulty conclusions. Using binary trees for overflow buckets is silly. They may be O(N), but the O can be much larger if you keep them balanced. A much better strategy for dealing with pathological cases is to keep some simple statistics such as the maximum bucket size or the average size of occupied buckets. If you see a pathological case change the hash function and rehash. No need for seeds, or trees or all that.

The discussion of the cost of reallocation seems somewhat overblown. Everyone here knows that if you are going to have a million or billion entries, you dont start with the default allocation or give a fixed increment. Anyway, a good implementation should keep track of the allocation pattern and make adjustments along the way.

He does make a good point that you need to pay attention to your hash function. A good function should try to smear the bits out so that the beginning and end of (say) a string make similar contributions to the hash.

"Welcome to amortized analysis!"

Seriously, if the point is to educate people the difference between O(1) and O(1)A, then fine, let people understand the difference. The problem is that he thinks the solution is to scrap it altogether.

It's almost like, hey guys, Quicksort isn't really O(n log n), it only is on average! (Of course, the STD uses introsort.. ;))

Good criticism...

Yet, if you want to avoid your application occasionally freezing when the user hits a key, then isn't distinguishing worst-case scenarios good? You seem to agree with the article's claim that hash table sometimes give you an O(n) hit. Even if this happens very occasionally, aren't there situations where this could be worse than a constant O(log(N)) hit?

On the other, what keeps the hash algorithm from expanding the buckets little-by-little?

Sure, latencies can be a problem. Though even extraordinarily large tables aren't going to have user-visible latencies, really. I'd worry more about real time applications where occasionally high latencies would cause things like buffer underruns. Think packet routers, video transcoders, things like that.

But even then, the naive solution would just be to pre-size the hash table and be careful about insertion/deletion architcture. Not to chuck it in favor of an AVL tree or whatnot.

the naive solution would just be to pre-size the hash table and be careful about insertion/deletion architcture. Not to chuck it in favor of an AVL tree or whatnot.

Sometimes, a pre-sized hash table is the "best" solution, depending on context, and for some perhaps debatable value of "best." If I was doing hard real time (which is not what I generally do) and I didn't have any good idea of how much space the system would need, then I'd use a largish pre-sized table as a starting point then think about cooking up some sort of incremental growth/rehash code. Start out just using one hash table. When it gets too close to full, you allocate the bigger table and you start moving things to the larger one with each table access. To check for membership in this state, you check both tables. To add a new entry, you add to the larger table. Once you are done, you can throw away the old table. I don't do hard real time, but this is just what comes off the top of my head. (Note that table allocation might have to be done incrementally, if nulling out lots of entries would take too long.)

That's not the point. I don't think anyone is arguing against what he says about worst case for a well-implemented hash table. The problem is the straw-man: all of the 'bad implementations' he has seen, which is all he really attacks.

Potential benefit of hash tables: cache locality.

Assuming you already have the key in the CPU cache, computing the hash is relatively cheap and gets you directly to the desired value.

A binary tree structure always makes you traverse several levels until you get to the desired value. Of course, this can be minimized with high-fanout trees -- see clojure for example).

Cache locality: especially with open addressing and linear probing.

I've done a search and found that this was submitted a year ago ...


However, it got no comments, and replies are now closed, so I'll leave this here. Useful information, many lessons to learn. If you're not into algorithms or low level details, you can probably ignore it.

If you want a compromise that takes little space like hash tables, has average time complexity of O(log(N)) like balanced trees, and is as simple to implement as hash tables, take a look at skip lists.

The eficiency of hash tables (insertions, in particular), is actually slightly worse than he indicates. They are not amortized O(1) for all data, but only for typical data (or on average over all possible data sets). To see this, suppose that every item inserted happens to go into the same bucket. The amortized time complexity is linear.

On the other hand, insertion is not O(1) even for average data; there is always the periodic table resizing. So: amortized O(1) for average data, not O(1) for average data.

The upshot of all this is that hash tables are still pretty good, but insertion is only constant time in a "double average" sense: on average over a large number of consecutive operations (amortization) and over all possible data sets.

A note to those who say, "Why not just use a different hash function?": This can greatly decrease the probability of poor hash-table performance. But it does not change the worst case, unless we allow for an arbitrarily large number of hash functions, and one of them is guaranteed to be a good one. I've never seen this done (actually, I'm not sure anyone knows how to do it within reasonable efficiency constraints).

And a note to those who say, "But in the real world it doesn't matter": No, sometimes it matters, and sometimes it doesn't. (Don't use Python to program a pacemaker ....)

i would love to read this except i am using windows mobile and am shown some "use a different browser!" page instead. this is annoying

He's blocking Opera?

Hashtable routines are far easier to write and debug than balanced binary tree routines.

Of course there are open source implementations available for both. But I would guess it is easier to write your own hash function than learn to use somebody else's, and easier to learn somebody else's B-tree routines than to write your own.

So we live in a world with O(N) hash implementations and O(1) B-tree implementations, for N programmers.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact