> Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table. The impact of this can be reduced by using multi-level hash tables, but it still causes some unpredictability. Trees, on the other hand, have worst case performance of O(log n).
No. This discussion is about algorithmic complexity, typically counting comparisons. And of course, it ignores constants. If we're counting page accesses, we care very much about the constants.
And furthermore, a "multi-level hash table" is a tree, except an unordered one. What a useless beast!
Yes, mostly. There are better structures with these characteristics like a skiplist, multihashed table (typically double hashed) or extensible hashing.
Mostly but not entirely. There are cases where you need a tree but you don't care about order, e.g. when order depends on a state that is unknown at time of writing or when all you want to encode are relationships.
"Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table."
This my impression and it seems like it implies that the "hash table has O(1) access" is bullshit taken as a general formulation. The situation is essentially, hash tables can have "O(1) access" time at a certain size and if tuned for that size. BUT that seems an abuse of big-O notation, which is meant (originally, in math) to mean "as you go to infinity" and you can't really have a data structure that mains O(1) as it's size increases to infinity.
> you can't really have a data structure that mains O(1) as it's size increases to infinity.
At the level of abstraction that comexity analysis is usually done, you can. For example, a lookup in an array (a[X]) has O(1) worse-case complexity no matter how large the array is.
Of course, this relies on a few abstractions - that random memory access has uniform cost; and that basic numeric operations have uniform cost regardless of the size of the operands. Of course, in real computers, both of these assumptions are wrong (adding terabyte-size numbers is not as fast as adding byte-sized numbers, large amounts of memory can't usually be stored on random-access hardware etc.), but that does not necessarily impact the usefulness of complexity analysis.
For hash-tables in particular, the worse-case complexity for key lookup depends on some guarantees from the hash algorithm. If you had a magical hash that is unique for any key in your domain, than lookup would be O(1) for a hash table of n elements, as n approaches infinity. Since that is not a good abstraction, we need to model the table using hash functions that can create collisions, and the worse case becomes that O(n) elements collide and get put into the same bucket, giving worse-case complexity of O(n). Adding elements to the hash table is a different operation, same as for the array.
Formally it's amortized O(1) time and pessimistic O(n) time.
It's not abuse - many algorithms and data structures have different performance in average and worst case - for example quicksort is O(n log n) but in worst case it's O(n^2). People just ignore the worst case unless worst-case guarantees are especially important for their system.
Growing hash tables only takes amortized O(1); you just do two operations per logical operation, which is still O(1). One search term is "incremental resizing."
In practical applications, costs are often not amortized.
If lucky transactions are more than fast enough but unlucky transactions time out because they perform a rebuild it's a defect that cannot be accepted because average performance is good.
Most databases need to take care of worst case performance with techniques that increase complexity and degrade best-case and aggregate performance (e.g. maintaining two hash tables, with lookup accessing both, insertions into the preferred one, and background threads that move records from the old table to the new table).
I may have misspoken and should have left out "amortized," perhaps. It doesn't sound like you're familiar with incremental resizing or other real-time techniques?[1]
> Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.
It's amortized O(1) even without any special incremental resizing. I think the goal of incremental resizing is to get closer to actual O(1).
But I'm not sure incremental resizing can get to actual O(1). Is the malloc() call for the new memory O(1)? And then after malloc() wouldn't the memory have to be initialized somehow (maybe just 0-initialized)?
Is there some reason that malloc() would be slower asymptotically than a GC? Why couldn't whatever technique is used for the fast GC be used for malloc() as well?
GCs are typically allowed to (and often do) move objects, while malloc/free do not. This means that malloc/free must maintain data structures tracking allocated vs. free space, while compacting GCs keep all the free space in a single contiguous address space, so allocating an object just increments the "start of free memory" pointer.
Of course, that just moves the cost from the allocation side to the GC side. This means that the GC must maintain data structures tracking the type of each object (to know which fields are references), the root references, and several others (like card tables) for an efficient implementation. For malloc/free, releasing an object can just add it to a free list (a pair of pointer writes), while a compacting GC has to update all references pointing to the live objects it moves.
Yep! And this a pretty specific design element that I don't think gets appropriate attention when introducing GCs as a concept. There's a big difference between saying malloc vs. gc is doing "more work" or "less work" implying faster and slower.
Doing more work but elsewhere can (possibly) make your program faster. GC's often eat more CPU and RAM which hurts when you need more of either. But C/C++ can occasionally put malloc pressure on your critical path hurting you when you have plenty of resources where a GC might actually help you.
If it was any simpler it wouldn't be so damn interesting and fun.
malloc implementations often involve some kind of a search to find a free area that can satisfy the request. See http://www.gii.upv.es/tlsf/ for an O(1) implementation - use a large number of free lists, one per size, using bit scan instructions on the block size to select the list to use. To malloc, grab the first entry, and put any remaining portion on the appropriate free list.
To free, add the block to the appropriate free list. Assume blocks participate in a doubly-linked list of all blocks, used and free alike, in address order - adjacent free blocks can be coalesced at this point.
> Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table. The impact of this can be reduced by using multi-level hash tables, but it still causes some unpredictability. Trees, on the other hand, have worst case performance of O(log n).