

Do hash tables work in constant time? - fogus
http://www.daniel-lemire.com/blog/archives/2009/08/18/do-hash-tables-work-in-constant-time/

======
jfoutz
This is fantastic, he starts with the claim pure theory is wasteful,
experiments must be run to validate models.

He asserts hashtables are not amortized O(1). Ok, interesting, maybe something
with caching? latency of disk access in _huge_ data stores? No. He concludes
hashtables are not O(1), because integer multiplication is not O(1). No
mention of putting an upper bound on the hash key size, like the size
addressable memory, or electrons in the universe, just that it's not O(1)

Lucky for us, he's backed it up with a clever and insightful lack of
experiments.

 _edit_ spelling and grammar.

~~~
lemire
I'm giving an example (vectorization) where experiments would be needed:

"Am I being pedantic? Does the time required to multiply integers on modern
machine depend on the size of the integers? It certainly does if you are using
vectorization. And vectorization is used in commercial databases!"

Sorry, I did not run the experiments this time around.

~~~
jfoutz
You have to admit it's a little silly to start with "This is one of the
fundamental reason why pure theory is wasteful." and then not back it up with
numbers.

I'm a head in the clouds kind of guy, i'd have been fine if you'd said, here's
all this extra stuff you might be overlooking. But when you frame the
discussion with concrete measurements... I go looking for concrete
measurements.

That said, I broke the rule on not saying something in writing, that i
wouldn't say face to face. (well, i would, but only after i knew you better)

I apologize for being snarky. That was uncalled for.

~~~
lemire
Implementing vectorized hash tables is no joke, so I don't think it is silly
to have merely evoked the possibility rather than implementing it. It would
have taken me days of work, possibly, and even then, people could have
questioned my code. I do have a day job, doing academic research and teaching
Computer Science. Maybe one day I will research vectorized hash tables, but
the result will not be a mere blog post.

~~~
jfoutz
fwiw, gcc-4.2 foo.c; time ./a.out; gcc-4.2 -ftree-vectorize -ffast-math
-ftree-vectorizer-verbose=3 -funsafe-math-optimizations -O3 foo.c; time
./a.out foo.c: In function ‘main’: foo.c:16: warning: incompatible implicit
declaration of built-in function ‘printf’ 0 real 0m0.095s user 0m0.093s sys
0m0.002s foo.c: In function ‘main’: foo.c:16: warning: incompatible implicit
declaration of built-in function ‘printf’

foo.c:6: note: not vectorized: unsupported use in stmt. foo.c:11: note: LOOP
VECTORIZED. foo.c:2: note: vectorized 1 loops in function. 0 real 0m0.018s
user 0m0.017s sys 0m0.001s

int main () { int a[256], b[256], c[256]; int count = 100000; int i; for(i=0;
i < 256; i++) { a[i] = i; b[i] = 255 - i; } do{ for(i=0; i < 256; i++) { c[i]
= a[i] * b[i]; } } while(count--); printf("%i", c[0]);// avoid c being
optimized away. }

so, i'd assert that vectorized multiplication is a speed win rather than a
loss. upping the count a few orders of magnitude indicates vectorized
multiplication grows slower than regular multiplication.

So, I claim the vectorization of multiplication is probably not detrimental to
the Big O of a hashtable.

ymmv.

------
jacquesm
I don't get it, somebody please correct me if I'm wrong.

Hashtables derive their efficiency from the number of keys that you can encode
for a given size hash. If the collision chance is low the number of 'chained'
items in each hash bucket approaches one.

To take the length of the key and the number of bits in it as a factor in the
complexity of the hash table shifts the problem to a sub-problem that has
nothing to do with the original one. As far as the hash table algorithm is
concerned that is a different algorithm, whose fixed cost per item is taken
into account.

Sure, it is slower to compute a longer hash. But that is not where the
majority of the savings are. The majority of the savings are in not having to
traverse the N items in the keyspace to find a given key.

Constant factors - and this is one of them - are simply left out because in
the greater scheme of things they are but a very small item. Any constant
factor, be it 5, 10 or 50 as viewed from the far side of the code calling the
hash store is simply that, a constant.

You might be able to optimize on it by using a clever algorithm, you might be
able to work it down all the way to '1'.

But that will not change the complexity at all.

------
ajross
This seems like a clumsy way to make a simpler and more generic point: _any_
key/value data structure (at least one that relies on inspecting the key in
its entirety) will scale as O(NlogN) at very large sizes. The reason is that
the size of a key needed to identify a unique entry in a table of size M is at
least log2(M) bits long. So the primitive operations on the keys is bounded by
log2(M) and is not constant.

This is the same reason that "radix sort"
(<http://en.wikipedia.org/wiki/Radix_sort>) isn't a linear operation even
though it doesn't seem to have a log(N) dependence on the input set. The
number of radix buckets scales as log(N).

That said, of course, this doesn't mean much in practice, where primitive
operations on keys live in L1 cache and the pointer indirections require DRAM
cycles.

~~~
lemire
My post was specifically about vectorization. In which case, I do claim that
it matters.

------
ggchappell
Okay, he is saying that the time required for an arithmetic operation is
proportional to the number of bits operated on.

This is true, and it is applicable to other concepts besides hash tables. Can
you sort n numbers in O(n log n)? Well a comparison is not really constant-
time, in the wall-clock sense of time. If the n numbers are distinct, then
each is going to need at least O(log n) bits. So a comparison is O(log n) by
the clock, and our sort becomes O(n log n log n).

Part of the solution is to be more precise. For example, when we say that
Merge Sort is O(n log n), we usually do not mean that it takes that long to
run; we mean that it can require _that many comparisons_. Counting primitive
operations can be useful, even if those operations are not constant-time from
a wall-clock point of view. And as long as we are clear on what the primitive
operations are, everything is fine.

By the way, there is another problem with the statement that hash-table
insertion is amortized constant-time: what if all the keys end up being in the
same bucket? The true statement is that HT insertion is amortized constant-
time _for average data_. (So it's constant-time on average on average. :-)

~~~
codeodor
You should not be using n for the number of bits and n for the size of the
input.

It may take longer to multiply 128 bit data structures than 64 bit data
structures, but it is constant with respect to the size of the hash table.

~~~
lemire
My post specifically refers to vectorization where you may use the fact that
you can multiply 4 pairs of 16-bit integers in the time it takes to multiply a
pair of 64-bit integers. So, you could operate four 16-bit hash tables in the
time it takes to operate a 64-bit hash table.

That's all somewhat theoretical sure, but the point is to challenge your
assumptions. (People do use vectorization, right now.)

~~~
nostrademons
"That's all somewhat theoretical sure, but the point is to challenge your
assumptions. (People do use vectorization, right now.)"

You're assuming that people are not aware that multiplication is not always a
constant-time operation. That assumption generally doesn't hold among people
with any sort of academic CS background. Multiplication algorithms - and the
fact that they were typically O(N) in number of bits - were covered in my
intro machine architectures course.

In practice, there are many, many other things that can negatively impact hash
table performance in a much bigger way. Like cache misses - one cache miss
costs far more than an integer multiplication. Or interpreter overhead from
using, say, Ruby instead of C (bad example, since Ruby's hashtables are
implemented in C, but the general point holds). Or the difference between -O0
and -O3. Or how early versions of Java Hashtables would often devolve into
linked lists because the hash function only looked at the first 8 characters
of a string.

~~~
lemire
I am not sure optimization, cache misses, and interpreter overhead would
obviously impact the computational complexity.

"You're assuming that people are not aware that multiplication is not always a
constant-time operation."

I made no such assumption. But I suggest people consider the case of
vectorization where, all of a sudden, it may matter quite a bit.

~~~
stcredzero
You're making a point about where we take cognitive shortcuts. Short version:
don't assume the classroom example assumptions always hold in the real world.

------
roundsquare
This is dissapointing. I was really hoping for something more exciting about
number of buckets vs number of elements or something like that.

In any event, he was correct when he said: "The problems that result from
picking the wrong model of computation are well known and addressed by most
textbooks. I have not discovered anything new." There are problems that you
can "solve in polynomial time" by encoding the problem as some HUGE number
(where the number of bits is exponential in the size of the problem) and doing
arithmetic. This is where you really see the breakdown of the constant time
arithmetic model.

------
sorbits
Excerpt from article: _Multiplications of numbers in [0,m) can almost be done
in time O(log m). For the purposes of hash tables, the number m must be close
to the number of different keys n. Thus—formally—hash tables often run in time
O(log n)._

Yes, this guy is confused about Big-O notation — nothing to see here, please
move along :)

------
codeodor
It seems to me he is using the same variable to represent two different
quantities:

m- the size of the hash table m- the size of each element in the hash table

The 2nd m should be a different variable, and it is constant with respect to
the first m, which is the one we care about.

------
tybris
The definition of a hash table is independent of the hash function. You should
separately consider the complexity of your hash function.

