Didn't the article address 1) with this?: > Another difference is that hash tabl...

geophile · on Dec 10, 2019

No. This discussion is about algorithmic complexity, typically counting comparisons. And of course, it ignores constants. If we're counting page accesses, we care very much about the constants.

And furthermore, a "multi-level hash table" is a tree, except an unordered one. What a useless beast!

AstralStorm · on Dec 10, 2019

Yes, mostly. There are better structures with these characteristics like a skiplist, multihashed table (typically double hashed) or extensible hashing.

misterman0 · on Dec 10, 2019

>> What a useless beast!

Mostly but not entirely. There are cases where you need a tree but you don't care about order, e.g. when order depends on a state that is unknown at time of writing or when all you want to encode are relationships.

andrewflnr · on Dec 11, 2019

Also, JSON and its ilk.

joe_the_user · on Dec 10, 2019

"Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table."

This my impression and it seems like it implies that the "hash table has O(1) access" is bullshit taken as a general formulation. The situation is essentially, hash tables can have "O(1) access" time at a certain size and if tuned for that size. BUT that seems an abuse of big-O notation, which is meant (originally, in math) to mean "as you go to infinity" and you can't really have a data structure that mains O(1) as it's size increases to infinity.

tsimionescu · on Dec 10, 2019

> you can't really have a data structure that mains O(1) as it's size increases to infinity.

At the level of abstraction that comexity analysis is usually done, you can. For example, a lookup in an array (a[X]) has O(1) worse-case complexity no matter how large the array is.

Of course, this relies on a few abstractions - that random memory access has uniform cost; and that basic numeric operations have uniform cost regardless of the size of the operands. Of course, in real computers, both of these assumptions are wrong (adding terabyte-size numbers is not as fast as adding byte-sized numbers, large amounts of memory can't usually be stored on random-access hardware etc.), but that does not necessarily impact the usefulness of complexity analysis.

For hash-tables in particular, the worse-case complexity for key lookup depends on some guarantees from the hash algorithm. If you had a magical hash that is unique for any key in your domain, than lookup would be O(1) for a hash table of n elements, as n approaches infinity. Since that is not a good abstraction, we need to model the table using hash functions that can create collisions, and the worse case becomes that O(n) elements collide and get put into the same bucket, giving worse-case complexity of O(n). Adding elements to the hash table is a different operation, same as for the array.

ajuc · on Dec 10, 2019

> that seems an abuse of big-O notation

Formally it's amortized O(1) time and pessimistic O(n) time.

It's not abuse - many algorithms and data structures have different performance in average and worst case - for example quicksort is O(n log n) but in worst case it's O(n^2). People just ignore the worst case unless worst-case guarantees are especially important for their system.

pwildani · on Dec 10, 2019

The actual analysis of inserts are that they take "amortized O(1) with a single operation worst case of O(N)"

So N operations take N measurement units as N goes to infinity, but any individual operation may be slower or faster.

loeg · on Dec 10, 2019

Growing hash tables only takes amortized O(1); you just do two operations per logical operation, which is still O(1). One search term is "incremental resizing."

https://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing

HelloNurse · on Dec 10, 2019

In practical applications, costs are often not amortized.

If lucky transactions are more than fast enough but unlucky transactions time out because they perform a rebuild it's a defect that cannot be accepted because average performance is good.

Most databases need to take care of worst case performance with techniques that increase complexity and degrade best-case and aggregate performance (e.g. maintaining two hash tables, with lookup accessing both, insertions into the preferred one, and background threads that move records from the old table to the new table).

loeg · on Dec 10, 2019

I may have misspoken and should have left out "amortized," perhaps. It doesn't sound like you're familiar with incremental resizing or other real-time techniques?[1]

> Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.

[1]: https://en.wikipedia.org/wiki/Hash_table#Alternatives_to_all...

Thorrez · on Dec 10, 2019

It's amortized O(1) even without any special incremental resizing. I think the goal of incremental resizing is to get closer to actual O(1).

But I'm not sure incremental resizing can get to actual O(1). Is the malloc() call for the new memory O(1)? And then after malloc() wouldn't the memory have to be initialized somehow (maybe just 0-initialized)?

LessDmesg · on Dec 10, 2019

malloc() isn't O(1), but allocations in a normal GC are O(1).

Thorrez · on Dec 10, 2019

Is there some reason that malloc() would be slower asymptotically than a GC? Why couldn't whatever technique is used for the fast GC be used for malloc() as well?

taejo · on Dec 10, 2019

GCs are typically allowed to (and often do) move objects, while malloc/free do not. This means that malloc/free must maintain data structures tracking allocated vs. free space, while compacting GCs keep all the free space in a single contiguous address space, so allocating an object just increments the "start of free memory" pointer.

cesarb · on Dec 10, 2019

Of course, that just moves the cost from the allocation side to the GC side. This means that the GC must maintain data structures tracking the type of each object (to know which fields are references), the root references, and several others (like card tables) for an efficient implementation. For malloc/free, releasing an object can just add it to a free list (a pair of pointer writes), while a compacting GC has to update all references pointing to the live objects it moves.

kjeetgill · on Dec 10, 2019

Yep! And this a pretty specific design element that I don't think gets appropriate attention when introducing GCs as a concept. There's a big difference between saying malloc vs. gc is doing "more work" or "less work" implying faster and slower.

Doing more work but elsewhere can (possibly) make your program faster. GC's often eat more CPU and RAM which hurts when you need more of either. But C/C++ can occasionally put malloc pressure on your critical path hurting you when you have plenty of resources where a GC might actually help you.

If it was any simpler it wouldn't be so damn interesting and fun.

tom_ · on Dec 10, 2019

malloc implementations often involve some kind of a search to find a free area that can satisfy the request. See http://www.gii.upv.es/tlsf/ for an O(1) implementation - use a large number of free lists, one per size, using bit scan instructions on the block size to select the list to use. To malloc, grab the first entry, and put any remaining portion on the appropriate free list.

To free, add the block to the appropriate free list. Assume blocks participate in a doubly-linked list of all blocks, used and free alike, in address order - adjacent free blocks can be coalesced at this point.