Then he finishes up with a discussion about "bad hash functions" without actually pointing to any examples of tables that use them.
Worst, the point of the whole (rather long!) article is about performance, and yet there's not a benchmark to be seen. Nothing is measured; it's just one assertion after another.
The whole thing is a giant straw man. Not worth reading.
For example, assume that the hash table implementation
shrinks the table if the number of full buckets drops
below half, and grows the table if it becomes over full.
If the table starts out with 100 buckets, and the number
of buckets falls to 49, the size is shrunk to 50.
It may be that you've never made these mistakes, and you understand all the implications, and you've never had these misconceptions, but of the people with whom I've discussed hash tables, I'd guess that around half have not understood these issues, and have thought that because it's a hash table it's pretty much automatically O(1), and haven't realized that it's amortized.
Not every article about performance needs to include benchmarks to make relevant points. Not every article that talks about implementation details need to give specific examples.
EDIT: FWIW, he does actually give an example of a bad hash function.
In short, I suspect you simply have too much knowledge to find this article useful. That doesn't (necessarily) mean others won't.
So given that you clearly do have a lot of experience in these matters, and given that you say you've written lots of hash-tables in your career, may I ask why? Why not use existing library implementations?
In my view, the author did two things, and maybe this debate results from his rather rambling style that made it a little difficult to separate the two.
One thing he did was to explain theoretical properties of possible (bad) hash table implementations and hash functions. And the other thing he did was to claim that these bad implementations are actually out there.
I did not find a flaw in his theoretical analysis. Maybe there is one that you can find, but if you do you should be able to prove him wrong without any empirical evidence. His second claim that these issues frequently occur in the wild is where we do indeed need empirical evidence.
I don't believe that bad hash table implementations are widely used. They are very central components of programming languages and libraries and I suspect that any glaring mistakes would have been found.
But one thing I can say for sure is that bad hash functions exist in the wild. I know because I have written some of them. It's not trival to write a hash function that distributes hash values evenly across all buckets. I really do think that hardly anyone bothers to actually analyse the distribution of their ad-hoc hash functions.
In OO languages people write lots of little hash functions all the time. They're not well designed or well tested pieces of code.
> You can't argue without evidence ...
But each to their own. I think we'll simply have to agree to disagree.
Purely mathematical proofs are nothing but supplying a pile of evidence, in the form of theorems and facts that both the proof-reader and proof-writer agree on, to support the argument that the proof is attempting to make.
What you have done in your article is this:
"X implies Y, QED."
And when people pointed this out, you balked, rather than writing a follow up that looked like this:
"X may imply Y under many circumstances. Take X. A, B, C, and Q are very common implementations of X, and all of them do Y, based on the following analysis..."
But leaving that aside, I simply don't understand your points at all. In what way is a mathematical proof "a pile of evidence"? Consider, as a simple, single example the Banach-Tarski theorem. To help me understand you, perhaps you could explain how the proof of that is "a pile of evidence."
As I said elsewhere, I'm baffled by the responses here, so I'm not going to reply any more until, and if, I understand the points of view being expressed.
EDIT: speeling cerrocted
Arguing from principle is great but ...not only does he not give numerical evidence, he also doesn't give evidence in terms of describing which commonly used libraries or applications use what hashing method.
This is an opinion piece - he doesn't like hash tables. For him personally, he prefers not to use them and instead to use other structures such as balanced trees. He says that there are gotchas with hash-tables that aren't always realised by those who don't read the small print and take care. That seems unassailable to me. There are gotchas.
He gives an example of a bad hash-function, and points out that you need good ones. Designing a good hash function is part-art, part-science, and can be very hard. Certainly it deserves a full scholarly article and justice couldn't be done in a blog like this.
> the author asserts that hash functions are
> much slower than comparison functions, for
> hash functions have to be blindingly fast - because
> comparisons are already really fast.
But this isn't really the place for a full point-by-point discussion of the issues. I think the piece makes useful and interesting points. It's not a journal article, it's a blog piece that allows the author to vent his prejudices with some supporting arguments.
If want it done properly, read Knuth. I think many of the criticisms here are mis-placed.
The author makes some logical arguments, which could be considered evidence. But as the GP points out, a lot of his assertions at least seemed to lack both numeric and logical evidence - so your assertion comes-off a bit muddle headed.
I found some of the article's points interesting too but I don't think this is problem people are having with this earlier post.
His would-be juicy bit about amortized analysis and hash table growth -- this is all freshman undergrad algorithms stuff. Though, it is sad to say, that lots of IT shop pros out there don't know this stuff. (I have been the "guru" or "genius" far too many times, because I know that a naive array add implementation is O(n^2))
The discussion of the cost of reallocation seems somewhat overblown. Everyone here knows that if you are going to have a million or billion entries, you dont start with the default allocation or give a fixed increment. Anyway, a good implementation should keep track of the allocation pattern and make adjustments along the way.
He does make a good point that you need to pay attention to your hash function. A good function should try to smear the bits out so that the beginning and end of (say) a string make similar contributions to the hash.
Seriously, if the point is to educate people the difference between O(1) and O(1)A, then fine, let people understand the difference. The problem is that he thinks the solution is to scrap it altogether.
It's almost like, hey guys, Quicksort isn't really O(n log n), it only is on average! (Of course, the STD uses introsort.. ;))
Yet, if you want to avoid your application occasionally freezing when the user hits a key, then isn't distinguishing worst-case scenarios good? You seem to agree with the article's claim that hash table sometimes give you an O(n) hit. Even if this happens very occasionally, aren't there situations where this could be worse than a constant O(log(N)) hit?
On the other, what keeps the hash algorithm from expanding the buckets little-by-little?
But even then, the naive solution would just be to pre-size the hash table and be careful about insertion/deletion architcture. Not to chuck it in favor of an AVL tree or whatnot.
Sometimes, a pre-sized hash table is the "best" solution, depending on context, and for some perhaps debatable value of "best." If I was doing hard real time (which is not what I generally do) and I didn't have any good idea of how much space the system would need, then I'd use a largish pre-sized table as a starting point then think about cooking up some sort of incremental growth/rehash code. Start out just using one hash table. When it gets too close to full, you allocate the bigger table and you start moving things to the larger one with each table access. To check for membership in this state, you check both tables. To add a new entry, you add to the larger table. Once you are done, you can throw away the old table. I don't do hard real time, but this is just what comes off the top of my head. (Note that table allocation might have to be done incrementally, if nulling out lots of entries would take too long.)
Assuming you already have the key in the CPU cache, computing the hash is relatively cheap and gets you directly to the desired value.
A binary tree structure always makes you traverse several levels until you get to the desired value. Of course, this can be minimized with high-fanout trees -- see clojure for example).
However, it got no comments, and replies are now closed, so I'll leave this here. Useful information, many lessons to learn. If you're not into algorithms or low level details, you can probably ignore it.
On the other hand, insertion is not O(1) even for average data; there is always the periodic table resizing. So: amortized O(1) for average data, not O(1) for average data.
The upshot of all this is that hash tables are still pretty good, but insertion is only constant time in a "double average" sense: on average over a large number of consecutive operations (amortization) and over all possible data sets.
A note to those who say, "Why not just use a different hash function?": This can greatly decrease the probability of poor hash-table performance. But it does not change the worst case, unless we allow for an arbitrarily large number of hash functions, and one of them is guaranteed to be a good one. I've never seen this done (actually, I'm not sure anyone knows how to do it within reasonable efficiency constraints).
And a note to those who say, "But in the real world it doesn't matter": No, sometimes it matters, and sometimes it doesn't. (Don't use Python to program a pacemaker ....)
Of course there are open source implementations available for both. But I would guess it is easier to write your own hash function than learn to use somebody else's, and easier to learn somebody else's B-tree routines than to write your own.
So we live in a world with O(N) hash implementations and O(1) B-tree implementations, for N programmers.