Hacker News new | past | comments | ask | show | jobs | submit login
Why databases use ordered indexes but programming uses hash tables (evanjones.ca)
429 points by matt_d 44 days ago | hide | past | web | favorite | 197 comments



I think this article misses two points more important than anything else mentioned.

1) B-trees are dynamic while hash tables are not. A B-tree grows gracefully across orders of magnitude, and shrinks just as easily. Most hash tables do not have this property. Extensible hashing, and linear hashing do grow and shrink gracefully, but I'm not sure how widely used they are.

2) In a database system, the concern was traditionally to minimize page accesses. CPU is not negligible, and now there are main memory databases, as well as RAM sizes much larger than they were in the 70s and 80s. However, a lot of the main ideas in traditional OLTP databases were developed a long time ago, and the resulting architecture is still very much in use. So how many page reads to do a key lookup in a hash table? One. End of story. How many page reads to do a key lookup in a B-tree? Or to be more precise, a B+-tree, which has a much higher branching factor? Probably one. The root, and most likely the second level of pages stay cached, so it's really one page access to get a 3rd-level page. And, of course, as mentioned in the article, B-trees give you sequential access in key order, for when that's important.


On modern hardware a key lookup in a hash table isn't necessarily a single page read! Sure, it's a single virtual memory access, but if that page isn't in your TLB you need to read the page table... and if the page containing that part of the page table isn't in the TLB you need to read that page...

On modern hardware, every memory access looks very much like a B-tree lookup.


TLB hierarchy is not a B-tree, it is a trie in all of the CPUs. Very different layout (not balanced and also hard sized), much faster on happy path.

Making a 48-bit B-tree would have a bit of a memory problem making TLB huge.

And then CPU cache is an array. Single virtual memory access is bound to be a single physical as well, with minor exceptions for NUMA nodes being crossed.


There's a reason I said "looks very much like" rather than "is". ;-)


But the point remains, walking the cache hierarchy once for a hash table is faster than walking it log times for a Btree.


> and if the page containing that part of the page table isn't in the TLB you need to read that page...

I thought page tables used physical addresses, which are accessed directly without any TLB lookup (except when nested paging during virtualization, which adds another level of indirection). Of course, the processor still needs to read each level into the data cache(s) while doing a page table walk.


Typically. Although on a virtualized Arm what the guest views as a physical address is really an intermediate physical address that must be translated by the second stage MMU. So it’s possible that reading the first stage page tables can cause a page fault in the second stage MMU. I suspect modern x86 works similarly, but I’m less familiar with that.


Page tables can have multiple levels. For example in x86_64 you'd have 4 levels, i.e the virtual->physical mapping is implemented as a tree with depth 4, where each leave and internal node of such tree is 4kb (page size). (As usual, details are more complicated than that)


Yes, and each level of the tree has the physical address of the next level, so no TLB lookup is necessary (the top of the tree, in the TTBRn or equivalent registers, is also a physical address).


oh yeah, I totally misread the comment: a page fetch is required, but that has nothing to do with the TLB indeed


what's the difference? "page fetch" is not really something that can be googled on the web.


the TLB is just one element of the process that leads to resolve a virtual address into a physical one: it's a cache that hosts the most recently resolved addresses.

When the virtual address you're looking to resolve is not present in that cache (i.e. when you have TLB miss), the CPU falls back to walking the page table hierarchy. At each level of the tree, the CPU reads an physical address of the next level of the tree and performs a memory fetch of that page table entry (in my previous comment I erroneously said a "page fetch", but it's actually only performing a cache-line sized fetch) and repeatedly so until it reaches the leaves of the tree which contain the Page Table Entry that contains the physical address of the (4k) physical page associated with the virtual page address you wanted to resolve.


If huge pages are used then it's very likely the page is cached in the MMU.


Depends on your workload and how many TLB entries your CPU has for superpages. The Zen 2 TLB can hold tons (1000s) of 2MB superpages but relatively few (64) 1GB superpages. Older CPU models had worse capacity for 1GB and 2MB superpages. E.g., Haswell (2013) had only 32 entries for 2MB superpages and 4 entries for 1GB superpages (data).


In addition to the limited number of cache slots available for superpages (varies depending on cpu), remember that those can be invalidated (again, depending on cpu). If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.


> If you're ping-ponging processes on a single CPU, you won't necessarily have what you need in the TLB.

Is that really likely on a production database server?


Depends on the design. At a minimum you're ping-ponging between userland and kernel; but you might also be bouncing between a transport layer unwrapper, an authentication front-end, the database core, and a storage back-end.


Ideally, with io_uring(2)/DPDK/SR-IOV-NVMe, you can skip enough syscalls to drop their performance impact below 1%.


Spectre may have changed things, but for a long time a user->kernel switch didn't imply a full TLB flush.


And that’s assuming you’ve missed several layers of cache between the CPU and main (virtual) memory!


This is not how TLB lookup works


The OP is talking about disk page reads, not virtual memory pages.


Your both points (very valid) could perhaps be wrapped up as:

Indexes have lower computational complexity for certain tasks (in particular range queries & sort), but higher constants than typical hashing.

Thus indexes make more sense with two-tiered memory; to cache in RAM metadata about data kept in block storage.

How that could be put to good use with the L1 cache being much faster than RAM is anyone's guess[1].

--

[1] without guessing, we already know a small interpreter that fits in L1 can be blazing fast.


Rust's default (edit: ordered) map structure is a memory Btree, fwiw.


What do you mean by this? There is no default map type. The standard library has types such as BTreeMap and HashMap, that pretty much are what it says on the tin.


Ordered map.


Didn't the article address 1) with this?:

> Another difference is that hash tables only provide average constant time accesses. Bad behaviour can be caused by hash collisions, either unintentional or malicious, but a bigger issue is rehashing. Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table. The impact of this can be reduced by using multi-level hash tables, but it still causes some unpredictability. Trees, on the other hand, have worst case performance of O(log n).


No. This discussion is about algorithmic complexity, typically counting comparisons. And of course, it ignores constants. If we're counting page accesses, we care very much about the constants.

And furthermore, a "multi-level hash table" is a tree, except an unordered one. What a useless beast!


Yes, mostly. There are better structures with these characteristics like a skiplist, multihashed table (typically double hashed) or extensible hashing.


>> What a useless beast!

Mostly but not entirely. There are cases where you need a tree but you don't care about order, e.g. when order depends on a state that is unknown at time of writing or when all you want to encode are relationships.


Also, JSON and its ilk.


"Occasionally, when the table is growing, your normally fast O(1) insert involves a slow O(n) scan to move to a bigger table."

This my impression and it seems like it implies that the "hash table has O(1) access" is bullshit taken as a general formulation. The situation is essentially, hash tables can have "O(1) access" time at a certain size and if tuned for that size. BUT that seems an abuse of big-O notation, which is meant (originally, in math) to mean "as you go to infinity" and you can't really have a data structure that mains O(1) as it's size increases to infinity.


> you can't really have a data structure that mains O(1) as it's size increases to infinity.

At the level of abstraction that comexity analysis is usually done, you can. For example, a lookup in an array (a[X]) has O(1) worse-case complexity no matter how large the array is.

Of course, this relies on a few abstractions - that random memory access has uniform cost; and that basic numeric operations have uniform cost regardless of the size of the operands. Of course, in real computers, both of these assumptions are wrong (adding terabyte-size numbers is not as fast as adding byte-sized numbers, large amounts of memory can't usually be stored on random-access hardware etc.), but that does not necessarily impact the usefulness of complexity analysis.

For hash-tables in particular, the worse-case complexity for key lookup depends on some guarantees from the hash algorithm. If you had a magical hash that is unique for any key in your domain, than lookup would be O(1) for a hash table of n elements, as n approaches infinity. Since that is not a good abstraction, we need to model the table using hash functions that can create collisions, and the worse case becomes that O(n) elements collide and get put into the same bucket, giving worse-case complexity of O(n). Adding elements to the hash table is a different operation, same as for the array.


> that seems an abuse of big-O notation

Formally it's amortized O(1) time and pessimistic O(n) time.

It's not abuse - many algorithms and data structures have different performance in average and worst case - for example quicksort is O(n log n) but in worst case it's O(n^2). People just ignore the worst case unless worst-case guarantees are especially important for their system.


The actual analysis of inserts are that they take "amortized O(1) with a single operation worst case of O(N)"

So N operations take N measurement units as N goes to infinity, but any individual operation may be slower or faster.


Growing hash tables only takes amortized O(1); you just do two operations per logical operation, which is still O(1). One search term is "incremental resizing."

https://en.wikipedia.org/wiki/Hash_table#Dynamic_resizing


In practical applications, costs are often not amortized.

If lucky transactions are more than fast enough but unlucky transactions time out because they perform a rebuild it's a defect that cannot be accepted because average performance is good.

Most databases need to take care of worst case performance with techniques that increase complexity and degrade best-case and aggregate performance (e.g. maintaining two hash tables, with lookup accessing both, insertions into the preferred one, and background threads that move records from the old table to the new table).


I may have misspoken and should have left out "amortized," perhaps. It doesn't sound like you're familiar with incremental resizing or other real-time techniques?[1]

> Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.

[1]: https://en.wikipedia.org/wiki/Hash_table#Alternatives_to_all...


It's amortized O(1) even without any special incremental resizing. I think the goal of incremental resizing is to get closer to actual O(1).

But I'm not sure incremental resizing can get to actual O(1). Is the malloc() call for the new memory O(1)? And then after malloc() wouldn't the memory have to be initialized somehow (maybe just 0-initialized)?


malloc() isn't O(1), but allocations in a normal GC are O(1).


Is there some reason that malloc() would be slower asymptotically than a GC? Why couldn't whatever technique is used for the fast GC be used for malloc() as well?


GCs are typically allowed to (and often do) move objects, while malloc/free do not. This means that malloc/free must maintain data structures tracking allocated vs. free space, while compacting GCs keep all the free space in a single contiguous address space, so allocating an object just increments the "start of free memory" pointer.


Of course, that just moves the cost from the allocation side to the GC side. This means that the GC must maintain data structures tracking the type of each object (to know which fields are references), the root references, and several others (like card tables) for an efficient implementation. For malloc/free, releasing an object can just add it to a free list (a pair of pointer writes), while a compacting GC has to update all references pointing to the live objects it moves.


Yep! And this a pretty specific design element that I don't think gets appropriate attention when introducing GCs as a concept. There's a big difference between saying malloc vs. gc is doing "more work" or "less work" implying faster and slower.

Doing more work but elsewhere can (possibly) make your program faster. GC's often eat more CPU and RAM which hurts when you need more of either. But C/C++ can occasionally put malloc pressure on your critical path hurting you when you have plenty of resources where a GC might actually help you.

If it was any simpler it wouldn't be so damn interesting and fun.


malloc implementations often involve some kind of a search to find a free area that can satisfy the request. See http://www.gii.upv.es/tlsf/ for an O(1) implementation - use a large number of free lists, one per size, using bit scan instructions on the block size to select the list to use. To malloc, grab the first entry, and put any remaining portion on the appropriate free list.

To free, add the block to the appropriate free list. Assume blocks participate in a doubly-linked list of all blocks, used and free alike, in address order - adjacent free blocks can be coalesced at this point.


Relational databases do mostly range queries. (e.g. SELECT a WHERE b > ?). Hash tables suck for that since in general they reshuffle (hash) keys. In tree structures 1m key range access is 1 random lookup and 1m sequential (cache-friendly). In hash tables that becomes 1m random lookups and not cache friendly.

See section IV 4 E, "range queries" in "A comparison of adaptive radix trees and hash tables" (2015)

https://15721.courses.cs.cmu.edu/spring2019/papers/08-oltpin...

Even for the best hash table-based indexing B+Trees win. And combining this with the big cost of resizing (growing) or rehashing a table (rebalancing, i.e. Cuckoo) they are not useful.

Perhaps some adaptive hashing method could take care of this in the future.


I don't think this is correct (about mostly range queries). Scan the code base of any application, and I predict you will see 90%+ queries not involving an inequality. That is for OLTP. For analytics, the answer might be different.


Yup, aligning the B-Tree nodes to database pages (which are themselves aligned to physical disk blocks) is the key.

No pun intended ;)


> In a database system, the concern was traditionally to minimize page accesses

Is it really about memory access or disk loads?


Disk.

Though btrees are friendly to cpu caches too, but really it's about minimising disk accesses.


Why can’t hash tables shrink and grow gracefully?


Because items are placed into buckets based on their hash, and as you add or remove buckets you need to redistribute items.

I guess it depends what your definition is of ‘graceful’.


A b-tree grows and shrinks gracefully because each update modifies O(log n) pages. And the base of the logarithm is pretty big.

A typical hash table is not graceful in this way, because while each insert is O(1), you occasionally need to rehash everything into a bigger table, and that is O(n). As I noted above, linear and extensible hashing avoid this problem. An insert that forces a reorg just reorgs one bucket.


It might be down to memory/space allocation after all.

All hashtable requires rehashing up to certain fill rate, otherwise it will regress to linear look up. For the two most popular implementations, linked-list based or open addressing based, the common strategy for rehashing is to create another instance of hash table and copy the KV over. But on disk, this will be disastrous.


They can. See linear hashing and extensible hashing.


Nit-pick / pet-peeve regarding Big O notation!

> Hash tables provide constant time O(1) access for single values, while trees provide logarithmic time O(log n) access. For single value lookups, this means hash tables are faster,

> for small n, the hash table is better, but for large n, the cost of that rare scan dominates, and the tree is better

The Big O notation describes how the number of operations scales as the size of the input tends to infinity: the asymptotic complexity. This means that an O(1) algorithm can include some fixed number of operations, even enough to add _years_ onto the runtime, and still be classed as O(1).

Essentially I don’t think it’s really enough to consider the asymptotic complexity of algorithms to compare practical runtimes. You need to know what the smaller-order / constant terms are.


My favourite example of galactic algorithms is the most efficient way to multiply two numbers. As the Wikipedia page[1] states:

  An example of a galactic algorithm is the fastest known
  way to multiply two numbers,[2] which is based on a
  1729-dimensional Fourier transform. This means it will
  not reach its stated efficiency until the numbers have
  at least 21729 digits
[1] https://en.wikipedia.org/wiki/Galactic_algorithm


> 21729 digits

Not 21729 digits, but 2^1729 digits. Practically unusable in any real world case.


Thank you. I should have checked for copy-paste errors, but it's too late to fix it now.


Looking at the performance graphs of different hash table implementations is the best way to invalidate the idea of "constant time":

https://probablydance.com/2018/05/28/a-new-fast-hash-table-i...

HN submission with related comments: https://news.ycombinator.com/item?id=17177798


Also, most hash tables are not really O(1). The worst case scenario is more like O(n).


In some senses, no hash tables are O(1).

Hash tables are bounded below by the speed of arithmetic. A trillion-bit number takes a while to multiply. You may have never touched a trillion-bit number, but it's called "asymptotic" for a reason.

You only get O(1) if you use a model of computation in which arithmetic is constant time. It's simple to work with, even though it's not realistic, and opens up scenarios where you can encode a really complex operation into adding/multiplying some huge numbers, and then claim the overall computation is still fast.

Some theorists use a model of computation in which, if your input size n, then multiplying a log-n bit number is O(1), but multiplying an n-bit number is O(n). I have no idea how this is well-defined.


Constant-time arithmetic is realistic in most normal situations, like evaluating the performance of a hash table implementation. Perhaps your implementation uses 32 bits for the hash function, and can only scale to 4 billion slots. Maybe your implementation of a vector has the same limitation. In general, the reason you'll have trouble trying to store 2^trillion items is not because you chose a data structure with worse asymptotic performance than you expected, it's because you run out of space in the universe in which to store them.

So saying such a structure has O(1) asymptotic access time is not perfectly true in a mathematical sense, for unrealistic numbers. But it's certainly realistic.


Yes, it is, but the question is how do you make this notion of "realistic" formal and rigorous. The usual model for complexity analyses, the TM, can't do constant-time arithmetic. So we're interested in a more powerful model.

I'd proposed a model in a sibling comment that seems to be the one most people use for analyzing algorithms.


A Turing machine can do constant-time arithmetric (or more simply, constant-time bound arithmetic) for operands of bounded size, as was suggested above.


Same applies to trees thought: you have to multiply log n by that bit number, because each naive comparison will take O(bit number)


Indeed, in the TM model hash table lookups are necessarily Omega(n) in the length of data stored.

The model of computation used in these analyses and interview questions I'd say is a random-access infinite register machine of infinite size (and constant arithmetic), which is more powerful than a TM (and siblings) in complexity analyses.

Of course, they are useless for analyzing complexity classes of computational problems, but you can use them to talk about the complexity of computer algorithms.


That's not what big O notation is about. It represents the growth rate. In a the case of a hash table, the size of the thing you are hashing is not affected by the number of items present in the hash table. Putting a trillion bit integer into a hash table of other integers is still O(1); it's a constant.


>In a the case of a hash table, the size of the thing you are hashing is not affected by the number of items present in the hash table.

But it is, and that's precisely the reason why hash tables are not O(1) but rather O(log(n)). By the pigeon hole principle, it is impossible to store N items in a hash table using a key whose representation is less than log(N) bits without a collision. This means that there is a relationship between the the size of the key being hashed and the number of elements being inserted in the hash map, specifically the relationship is an element of O(log(n)).


> But it is, and that's precisely the reason why hash tables are not O(1) but rather O(log(n))

I'm sorry, but as a reader it is quite amusing that various posters are claiming (all without citing any sources) hashtables are O(1), O(n) and O(log(n))


It depends on the hash table. Eg, in the D programming language if there is a collision instead of an array it uses a red-black tree.

This makes it effective O(1) but worse case O(log n).


Java does something similar, decaying its HashMap into a binary tree after enough collisions.


People keep skipping a word and a qualifier. The precise description is "hash tables with appropriately chosen growth condition have amortised insert cost of O(1)". The amortised bit effectively hides the occasional rehashing. (It's about throughput, not latency)


Rehashing can be done incrementally; aside from the additional memory allocation, incremental rehashing can be O(1) rather than O(n).


The general assumption is that the hash function used is a random oracle, if your hash function is H(s) = 0, then you will obviously end up with lots of collisions.


Another question: doesn't the O(1) property of hashtables require that there's no collisions? Because of the birthday paradox, you'll need an insanely large hashtable to avoid collisions, and if you don't (or get unlucky), the performance approaches O(n) as the hashtable gets fuller.

If you think about it, a hashtable isn't really that different from a b-tree with a very large value for "b".


Collisions are expected (fill factors are typically typically 2/3 to 3/4). You make every bucket in the table a linked list. And check each value in the list against that for which you are searching.


This, I believe, is the exactly the behavior the parent commenting was describing (and bemoaning.) If you you have to check against every term in a linked list (an operation proportional to the length of the list), then your hash table ceases to be constant-time once you get collisions. In that case, if you have a hash table of fixed size, but arbitrarily increase the number of entries in the table, then the look-up time tends as O(n).


> if you have a hash table of fixed size

No HashTable implementation worth it's salt is going to have a fixed size table. They are going to dynamically grow the size of the table (double it) once certain level of utilization is reached in the table (say 50%). In practice this means that chains at the same bucket will be of some small constant length assuming a good hash function.


Hash tables with buckets are the simplest implementation, but also the least performant. Cuckoo hashing for example deals with collisions differently.


Java switches to Red-Black after a bucket's linked list's length has grown upto some constant number(8?).


Or use probing.


> Essentially I don’t think it’s really enough to consider the asymptotic complexity of algorithms to compare practical runtimes. You need to know what the smaller-order / constant terms are.

Nitpick: even if you know the constant terms, that would only tell you the number of operations required to execute the code. That's not enough to compare runtimes, because runtimes are also impacted by other factors like branch prediction rate and cache hit rate. You need to consider those factors as well, in order to compare runtimes.

At some point, articles become completely unreadable when they try to address every possible nitpick. Some tangents are better left unexplored in a blog post.


This is why benchmarking can be useful. I’m not sure why in a field where the materials cost of testing an idea is zero it’s done so much less frequently than theorizing.


The real costs of testing are clearly nonzero.

If they were zero, the testing would already be done.


Real costs are certainly more than zero, but it’s almost entirely time. “Materials cost” is what I rounded off to zero. Like if you want to benchmark an idea in aviation, you’re going to have to burn some fuel.


Maybe because that commits you to a specific implementation with all of its particular choices instead of being able to compare the class of problems and solutions.

Edit: Typo.


Also of note, hash tables (unless they are using perfect hashing) are not O(1). Big O notation refers to the worst case time, and the worst case time for a hash table is when the hash key is non-unique and the desired element needs to be looked up in a table. The relation between N and the complexity of this operation is somewhat implementation / data distribution dependent, but in the true worst case it is O(N) (all keys in the same bucket).


> Big O notation refers to the worst case time

Not true. Big O notation can refer to worst case time, average case time, best case time, or any other possible mathematical function from integers or real numbers to real numbers.

Big-O notation is in fact independent of computer science. It’s sometimes taught in calculus courses and used to describe arbitrary functions that have nothing to do with algorithm running times.


Big-O in Bachmann–Landau notation specifically means the upper bound.

There are lots of other interesting measures of asymptotic behavior, and it's true that frequently in the vernacular Big-O is bandied as if it can mean many different things.

But I've never heard anyone in the computer science domain (which is surely what we're talking about when we're talking about hash tables?) argue that Big-O, when used precisely, is anything but upper bound.


Yes. The upper bound of some function. That function is not necessarily the worst-case runtime.

Best-case, average-case, and worst-case can all have upper bounds.

If the best case runtime as a function of data size is n/2, average case is n^2 + 5x, and worst case is e^n + 5n^2, these functions are asymptotically upper bounded by n, 2n^2, and 2e^n respectively, so they are big-O (incidentally they are also big-Omega and big-Theta) of n, n^2, and e^n respectively.

Big-O means the same thing in CS as it does in mathematics; this is not an issue of different domains using terms in different ways.

See for example: https://stackoverflow.com/questions/42144251/why-is-big-oh-o... , https://stackoverflow.com/questions/3905355/meaning-of-avera...

A famous example is Quicksort which is O(n*log(n)) average case, but not worst case where the tightest big-O bound is O(n^2). Any reputable reference you look up will agree with this. From Wikipedia, for example: “Mathematical analysis of quicksort shows that, on average, the algorithm takes O(n log n) comparisons to sort n items. In the worst case, it makes O(n2) comparisons, though this behavior is rare.”


Quicksort has O(nlogn) worst case complexity when you use https://en.wikipedia.org/wiki/Median_of_medians just FYI.


When referring to the O() of a particular algorithm without further specification, upper bound is what is meant. When discussing the asymptotics of quicksort you say that it is O(nlog(n)) in the average case, not that it is O(nlog(n)) in general.


> upper bound is what is meant

I tried to make this clear in my comment - (asymptotic) "upper bound" is a concept applicable to any function from N to R. So do you mean the upper bound of the worst case, the upper bound of the average case, the upper bound of the best case, the upper bound of the worst case memory usage, the upper bound of the median price of eggs in China on the nth day since 1970, or something else?


Ya, I just mean that without further qualification, it's the upper bound of the whole function that's being referred to, which is equivalent to the upper bound of the worst case.


No, you still don't understand. The worst-case runtime and the average-case runtime are different functions. There's no one "whole function" that both are part of.


They are not different functions. They are the same function. The average case is a special case of the general function in which some of the terms collapse.


I don't know what to say at this point. Any reputable reference will disagree with you, including the two Stack Overflow links I posted, or even the Wikipedia article on big-O notation, which doesn't say anywhere that it has to do with the worst case of algorithms.

> They are not different functions. They are the same function.

Sure, if you expand the term "function" to include "random variable", which is legitimate to do in some fields of math... but we're talking about deterministic functions from the natural numbers (or real numbers) to the real numbers here. Which is what big-O notation is defined for.

Can you write down the deterministic function from N->R giving the "runtime" of some algorithm? (I mean the actual function, not big-O).


This does not match my experience at all. I have never (including in academia) encountered use of Big O which implicitly referred to the worst case - it has _always_ referred to the average case. That being said, usage has rarely been implicit in my experience - average or worst has almost always been explicitly stated.


You're wrong about this. g in O(f) is defined as: There exists c, m such that for all n > m , c*f(n)>g(n). Big O notation doesn't care about algorithms or worst cases.


No, it doesn't care about worst cases. But the asymptotics of the generalized function of the runtime by necessity are it's worst case.


It is an upper bound on the growth. But _what_ you're upper bounding is not specified.

Big-O notation works on mathematical functions. In the usual case, the function you're upper bounding is the worst-case run time of a procedure for a given input size. But that's _far_ from the only case.


Big O can b the upper bound can be for the amortized worst case though.

The other notations (small o and omega) are about scaling (small o doesn't allow constant factors) and pinching (omega means there are two constant factors, one of which forms a lower bound, and the other an upper bound.

True, the word amortized (i.e. average) is left out here, which still makes people's usage imprecise. But there is a decent argument that amortized is more meaningful than worst-case. Because amortized gives expected throughput and latency, whereas worst-case says very little about throughput and only informs you on the 99th percentile of latency.


In colloquial usage among programmers "Big-O" almost always really means Big-Theta.


This is true, but it's not related to what we're talking about in this thread.


* In most databases you can explicitly specify the kind of index, for example Postgres: CREATE INDEX name ON table USING hash (column); https://www.postgresql.org/docs/9.1/indexes-types.html

* In databases, ORDER BY is regularly used, making hash indexes as a default a bad choice

* See also https://en.wikipedia.org/wiki/B-tree#Advantages_of_B-tree_us...: The B-tree uses all of the ideas described above. In particular, a B-tree: keeps keys in sorted order for sequential traversing uses a hierarchical index to minimize the number of disk reads uses partially full blocks to speed insertions and deletions keeps the index balanced with a recursive algorithm In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions.


Real implementations of B-Trees (i.e. B+Trees) don't preserve the traditional guarantee about nodes being half full. However, space utilization is still a big advantage.

You can't do efficient retail deletes with hash indexes in the presence of duplicates because it isn't possible to append a unique-ifier (e.g. table TID) to the key, for exactly the same reason as it isn't generally possible to support multi-column indexes. In the case of Postgres hash indexes, it doesn't matter as much because VACUUM doesn't currently do retail deletions in any case. Postgres B-Tree indexes should support retail deletion at some point in the future, but that hasn't happened yet.

Also, deleting whole B-Tree pages (removing them from the tree entirely) can be thought of (and implemented) as an operation that is symmetric to page splits. Page deletion is especially complicated at the best of times, because it's particularly hard to reason about race conditions (Postgres has many layers of indirection to get this to work). I can't imagine anything that would work as well, but with hash indexes.


As a programmer, I always use an ordered map (Java TreeMap) in preference to hash tables. My reasons:

- It is much easier to write a transparently correct comparison method for a new object that I want to use as a key. With hash tables you need to worry that your hash function might not be properly distributed.

- With hash tables you have to worry that your language's under-the-hood resize/reallocate algorithm works well, or you need to make sure you choose the capacity and load factor numbers correctly. Even if things work as expected, you might end up paying a penalty because, for example, the array backing the hashtable was auto-resized to 128 million elements when you only need 70 million.

- Java TreeMap has lots of cool utility methods like pollFirstEntry (pull off the first K/V pair), higherMap (view of map that only includes pairs with higher key), etc.

- TreeMap key traversal is deterministic!! The worst bug I encountered in years of programming work was caused by non-determinism in HashMap traversal (you could blame Java's implementation here, I suppose).

- The O(1) vs O(log N) performance difference is minor to the point of being undetectable in every case I've ever encountered.

- When you do something with the data in the map, you often want to sort by key anyway: might as well have the sort taken care of by the data structure.


You should reach for the best tool for the job instead of giving job-specific reasons for your general choice. Without going into detail explaining how your bullets either don't apply to most uses, prematurely code for a future case that may not exist, dismiss performance, reference other languages as if language implementation is not a factor in the choice, etc... suffice to say they both have their place and in general you probably shouldn't reach for an ordered map over a unordered one without reason. I.e. it shouldn't be your default choice in most languages and it'd be bad advice to give other programmers.


An ordered map is much better as a default simply because it can do everything that an unordered map can. Choosing to use an unordered map instead is a premature optimization that is nearly always unnecessary.


IMO, premature optimization doesn't apply to cases where you are making general decisions about policy or defaults, otherwise you can get into a "death by a thousand cuts" situation where none of your functions are slow individually but the whole program is slow (i.e. big overheads). To me, premature optimization means don't optimize a specific function unless you know it is slow. I'd also say that premature optimization doesn't apply in cases where both alternatives are equally easy to write.


On the other hand, if you pick a TreeMap--- odds are it will be a non-issue either way.

If it does show up on your profiles, it's easy to reconsider and say "I don't think I'll ever access this in order, after all..." and change it and see if you see a benefit.


> I'd also say that premature optimization doesn't apply in cases where both alternatives are equally easy to write.

I think that premature optimisation applies not only to ease of writing, but even more to ease of debugging.


Prematurely optimizing for death by a thousand cuts is what you're proposing.


That argument works both ways, though.

An unordered map has the narrower interface and generally better performance, so it's much better as the default. Choosing to use an ordered map instead is YAGNI interface bloat that is nearly always unnecessary.


Premature optimization concerns effort. We do a lot of things when writing code that could be considered premature optimization by your vague definition. By this definition too, staying in Java land, I should probably not instantiate my ArrayList or HashMap with a capacity even if I know it...technically premature optimization.

All else equal wrt simplicity, choose the highest performing option that contains the features you need. Dogmatic premature optimization rules are unhelpful. We can apply reasonable thinking, and we should.


> Premature optimization concerns effort

And the OP is correct that writing a correct comparison function needed for an ordered map is easier than writing a correct hash function that is properly distributed. So the ordered map is lower effort for that reason alone, although the other properties the OP lists also produce valid reductions of effort.


The problem with choosing "the best tool for the job" for every job is that now you have a lot more unknown unknowns and they don't generalize when you find them.


Especially when an ordered map is backed by a binary tree (TreeMap), which is the worst data structure. Btree ordered maps good, hash tables good (if point queries are sufficient); binary trees have terrible cache locality.


Cache locality issues can be fixed by using a good allocator (or a good compacting GC).


How can an allocator optimize the layout of a dynamically sized structure that's being changed without compacting?

Now, compacting is an O(n) memory move plus additional calculation to figure out optimal-ish layout.

(Can be amortized, but it's same as amortizing hash table resizing. Optimal layout cannot be decided without profiling, and that's expensive.)


doesn't this force you to define a comparator for everything though?


As opposed to defining a hash code for everything, yes. Parent asserts that writing obviously-correct comparators is easier.


IDEs have had good equals/hashcode auto-generation mechanisms for years.

Java itself introduced Objects.hash(Object...) in 1.7 https://docs.oracle.com/javase/7/docs/api/java/util/Objects....

It's never really been an issue at all to implement equals/hashcode, which you're required to do for effectively all Collections-based usage anyway.


funny enough, a hash code defines an order on a set


Is it in any sense a useful order though? You can assign any order to any set of objects, but you'd want it to be useful. A hash - the more hash-ey it is, the less the order means AFAICS.


For the purpose of making a tree data structure, all linear orderings would be equally useful. Hashing only provides a partial order, so it's important to choose a hash function that won't assign the same hash to large batches of items.

But it only needs to have this property for hashes of one particular size, whereas a hash map requires a function that behaves well for a variety of hash sizes. So there's an opportunity for using a cheaper hash function.


For one, a database may receive a query such as "WHERE x >=10 AND x <=100". I.e. having an ordered index is useful for accessing ranges, whereas a hash lookup is always just for a single entry. The question then becomes - why are lookups in RAM more likely to be single lookups rather than ranges (and stats on ranges)? Partly at least because DBs provide a query language that make range based queries easy to do, and just because that's where most business data sits (most of it necessarily needs to be in a persistent store).


Everything is driven by the business case. Is there a business case to optimise for range queries? If so, optimise for that. Is there a business case for individual record queries? Then optimise for that. Is there a case for both? Then optimise for that.

The biggest lie of the 21st century is convincing JavaScript/Ruby/Python/Clojure/whatever programmers that web development is something sexier/holier/worthier than boring old CRUD Oracle Forms database development.

Its exactly the same but with fonts and animated transitions.


Oracle forms vs modern web development is a little Off topic but I’ll bite.

Could gmail have been built in Oracle forms, how about slack? How about Asana? How about an LMS such as blackboard, moodle, or D2L? Can you name any popular mainstream product that could be built upon and run off the Oracle forms product?

Oracle forms will get you 80% of the way there, and the final 20% will be impossible.


> Could gmail have been built in Oracle forms, how about slack? How about Asana? How about an LMS such as blackboard, moodle, or D2L? Can you name any popular mainstream product that could be built upon and run off the Oracle forms product?

Yes, it could.

Not going to judge whether it's a good idea or not, but it most certainly could be implemented.

I suspect you're not aware just how much capability relational database management systems have.


Don't know why this post was voted down. I used to work on Gmail, which uses a totally proprietary storage and database stack, like everything else at Google.

One of the most insightful design docs I ever read was an exploration by an engineer of how GMail 1.0 could have been implemented with commodity tech and how it'd compare cost wise. The rather sobering conclusion was that it'd have similar functionality and been cheaper to develop/run.


Sometimes databases use hash tables for indexing.

This is a natural match for a key-value store.

In particular, the following KV stores offer hash table-based indexing:

-Berkeley DB [1]

-Tokyo Cabinet (and its successor, Kyoto Cabinet) [2]

-Bitcask (used in Riak) [3]

In many benchmarks, the hash-based systems perform better than tree-based ones.

The real advantage of trees, as Berkeley DB manual says:

> Btrees are better for range-based searches, as when the application needs to find all records with keys between some starting and ending value. Btrees also do a better job of exploiting locality of reference. If the application is likely to touch keys near each other at the same time, the Btrees work well.

And there is the actual answer to the question in the title. And that is the actual answer to the question (that, and optimiz

[1]https://docs.oracle.com/cd/E17275_01/html/programmer_referen...

[2]https://fallabs.com/tokyocabinet/perldoc/

[3]http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-lo...

-----------------------

TL;DR: Why? Because you can do range queries and traverse sequential elements faster.

Don't need that? Use hash.


Came here basically to say this you will also find that well architected HBase/Big Table often work this way as well


Databases use hashes all the time, also ordered indexes are very frequently used in programming (e.g. C++ stl).

I think ordered indexes are a good default for use in programming languages. They are deterministic and do not have strange corner cases that can be exploited.

E.g. the order when iterating over elements in a hashtable is pretty much random, whereas the order of elements in an ordered tree based data structure is deterministic. That can very easily lead to the output of a program being nondeterministic, e.g. when serializing the contents of a hashtable. This is a frequent cause of compilers having non-deterministic builds.

I like my programs to be deterministic and want to precisely control where randomness enters the execution.

Hash collisions can lead to pathological performance. This can be exploited for DOS attacks, unless the hash data structure in question uses good randomization of the hash function. E.g. the rust hash data structures do a good job with this, but the issue just does not exist for an ordered tree based data structure.

The performance difference is often very small unless having very complex keys or very large sets/maps. A hash based data structure is a valid choice for me once the performance actually matters.


I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.

Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.


> I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.

You are right that non-deterministic is not strictly correct for non-salted hash functions. That makes it even more annoying, since it is deterministic enough that you might accidentally rely on the order. It is close enough to nondeterministic since iteration order can change when the hash function changes, but also when some minor internal parameter of the hashtable changes.

> Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.

Yes, I am aware and a very big fan of data structures that are actually fully memoryless, such as all kinds of binary tries, radix trees, or just wrapped sorted arrays. The latter are totally underrated, since they have very compact in memory representation (just the elements, single piece of heap) and have very competitive lookup performance.

They of course totally suck when you want to do random single element updates, but then just use something else...

I am going to release an entire collections library based on sorted arrays for rust over Christmas...


If you like sorted arrays and Rust, you might want to take a look at my sorted-vec crate, which has O(sqrt(N)) insertions and deletions: https://github.com/senderista/sorted-vec. In my tests it takes less than half the memory of BTreeSet.

PS: It isn't difficult to devise hash tables with path-independent iteration order. You can simply order the keys by their hash code values. See e.g., my implementation of bidirectional linear probing: https://github.com/senderista/hashtable-benchmarks/blob/mast....


Yes, saw that one. Neat.

My stuff is even simpler. It is just a flat array in memory and should just not be used when you want random single element inserts or deletions. But I find that rather uncommon anyway in many use cases. I made sure that building a new collection from an iterator is fast.

What I do use is a minimum comparison sort that makes very common cases very fast in terms of number of comparisons, and some unsafe stuff to allow in place updates to prevent allocations.

Not quite ready for prime time yet, but you can see the basic ideas being used in this data structure:

https://docs.rs/range-collections/0.1.0/range_collections/ra...

You could do a deterministic hash table by using a cryptographic hash function like sha256 and then use a binary trie of the hashes. No need to deal with collisions as far as we know. The hash is expensive, but if you need the hash for something else anyway, this is pretty neat...


I have played with hash tries in the past for determinism/dynamism compared to hash tables. I think they're an excellent choice for integer sets, where you can use a reversible hash function to losslessly randomize the elements (I use the same idea in the integer hash table I linked above). If I didn't need graceful resizing, though, I would use either Cleary hash tables or bucketized cuckoo tables for this application.


The hash set and hash map for scala are actually based on hash tries, but with a non-cryptographic hash, so collisions still need to be handled.

I wrote efficient set operations for scala.collection.immutable.HashSet once. Lots of bit-twiddling, which is not as fun on the JVM compared to rust...


You can also implement a hash map as a tree ;-)


Been there, done that... :-)


One major issue with disk based hash tables is rehashing. Actually, in-memory hash tables have the same problem, but most people don't notice it. Imagine your database suddenly duplicating itself because you inserted one row and suddenly overloaded a bucket.

Some hash tables that are designed to minimize the potential for DDOS attacks and improve the amortization of work use gradual rehashing (Go's hash table does this), but this is also not ideal for databases because the associated disk accesses are so expensive, and it requires significantly more disk space.

Many real world implementations just don't rehash unless you ask, which makes them pretty inefficient for most use cases.


I've seen it happen in Azure Blob Storage when a container hit about 10 million files, it took 2 hours.


Isn't this problem solved entirely by consistent hashing? Even for a non-distributed hash table, you can "distribute" your table over a collection of fixed-size blocks in memory or on disk. Maximum cost of rehashing is the cost of splitting a block, which can be made arbitrarily small. The cost can even be paid in parallel in non-pathological scenarios with a push-down/write-through splitting mechanism.


Yes, rehashing is very frustrating, especially because the hash tables default to most languages have that issue (i.e. std::unordered_map). It means that you may need 3x more available memory than what you actually want to store, because when the hash table resizes it has to allocate the size of the hash table twice over, while keeping the original hash table in memory.


That doesn't happen in databases. The standard incremental hashing algorithms (linear and extendible hashing) were devised for databases. Both of these algorithms split only one bucket at a time and thus avoid large allocations.


I've always thought b-trees should be the starting "default" for everything, because their speed is far more consistent, you get extra features, and you don't have to make decisions in advance about allocating extra memory or worry whether the keys aren't well-distributed enough that you'll get horrible collisions. And how often do the extra few lookups for a tree materially affect the speed of your software?

Hash tables to me aren't a go-to -- they're an optimization I reach for only when I have a real performance reason where I have to use them, and my data is highly predictable so I have confidence how to set them up. But honestly, that just doesn't happen that often. I swear, I think I've used Bloom filters more often than hash tables in my life.


I suspect the author of this language and/or his audience has done most of their work in dynamic languages where "associative lookup data structure" means a hash table.


of the data structures you mentioned, which are easier to work with? to implement? I code in Java and I usually use hashmaps as a go to. they're simple and fast ways to store relationships between data.


std::map in C++ is way easier to work with than std::unordered_map (which didn't even exist until C++11).


Define "way easier". The interface, at least for day to day operations, is basically identical.


Supports more features like ordered traversal, generally faster in most cases (yup!), consistent api cost, doesn’t invalidate iterators on certain operations like moving between maps. That’s just off the top of my head.


In C++ if you’re using the STL, std::map (tree) usage is way more common than a hash (std::unordered_map). I couldn’t tell you if its for a good reason though, or if its just because “map” is a lot shorter than “unordered_map”


std::unordered_map was only added in C++11, which explains why it's not used in legacy code. Habit, inertia and a desire to keep codebases internally consistent explains why it's less common now.


Developers primarily use sorted maps because the sorted map is the default map type in C++, by name. (I would be different if you had map/sorted_map instead).

The real question, is why C++ designers chose to prioritize sorted maps (RBTrees) over unsorted maps (Hash Maps).

I don't think it's a matter of speed though. I think it's because RBTrees are safer to use. Things can go really wrong with a hashmap if you have a poor hashing function (worst case search is O(n) vs. O(log(n)) for a RBTree)


Also std::unordered_map suffers from a design flaw that forces a poor implementation.


Depending on what you required, it can still be significantly faster than std::map. Any tree-based data structure is going to play havoc with cache. Also, it will perform an allocation for every single node inserted, which can become rather expensive (unless you're using some customized allocator for this).

Honestly, sorted vectors can often be a good replacement, depending on the workload, if you must have ordered data (this is what boost flat_map uses). All 3 (map, unordered_map, sorted vector) can have their uses, depending on what you are doing with the data.

If you need a better performing hash map, there's always Abseil as well.


Unfortunately unordered_map also requires lots and lots of allocations, because it's forced to use linked-list buckets.


Would you be willing to expand on that? I couldn't find any obvious well-known issues with a quick search.


Apparently the standard imposes a set of requirements that the implementation must abide by thus making it a lot slower.

[0] : https://www.reddit.com/r/programming/comments/5pwgtn/hash_ma...

[1] : https://stackoverflow.com/questions/42588264/why-is-stdunord...

[2] : https://www.youtube.com/watch?v=ncHmEUmJZf4


The interface forces an implementation as a map with linked-list buckets which is suboptimal for many use cases. So people who care about performance usually just implement their own hash map or use a library.


Ah, that makes sense. Thanks!


It probably varies by company, and by team. In my old hobby code there was a lot of std::map, but in my more recent professional experience I see a lot more absl::flat_hash_map and friends.


Is this something you’ve actually seen? I’ve often wondered.


Well, it was the case at my last three jobs and most open source i’ve seen. It is anecdotal though


I would very simply summarize it like this: If you know which data you look for and say "give me this piece" then you want to use a hash table. If you don't know what you are looking for and need to filter/search, then you use a b-tree. So actually you want to use both structures regularly.

Not sure how that general principle really applies or if it's just a quirck my brain came up with, though. If you look at distributed hash tables for instance, they are hash tables on a bus network in some sense, but they are used for search/filter tasks over multiple nodes.


This isn't a bad article, but I would be interested in seeing a follow-up where, instead of guessing, the author asks some database professionals. They are out there! We don't have to reinvent from first principles. I would guess this question could be asked and answered by known authorities on e.g. the PostgreSQL mailing lists.


I think part of it is inertia. There was a time when C++ only had a tree based map, but eventually it got a hash map too.

And part of it is that a database needs a sorted index. Because a lot of common queries would be awful without one. And when you have a sorted index already, a hash index becomes a nice-to-have.


Adding to that, a database typically has to be ready for all kinds of queries. But when choosing a data structure in a program, one often knows a great deal about exactly what queries will show up.


Also, the log N constant is much smaller on disk than in memory, because the branching factor is really high.


I believe it's more due to the fact that sorting, and less/greater than operators come up way more often in databases and using an ordered index makes a much better use case.

And the other reason is MVCC. It's easy to maintain multiple snapshots of a b-tree index and swap a root's ptr to point to a new sub-tree.


Sorting in a database makes sense even when you never use ORDER BY. You can do streaming joins/unions, and implement streaming, non-blocking aggregation operators (because you know when you've seen the last row for a grouping key, so you can output the result for that group immediately and evict its state from memory). Query plans will sometimes sort inputs (or intermediate results) for these reasons even when the final output is unordered; this is known as "interesting orders".


correct. There are many parts of the db planner that checks if the output is sorted to take a more optimal path.


AFAIK only LMDB implements MVCC with immutable pages and path copying. Everybody else uses multiple rows distinguished by transaction ID. This works perfectly fine with btrees or hash indexes.


Indexes on things that’re correlated with time (ID, timestamp) tend to concentrate recently added data towards the “end” of the table/index, which is a big help when you have years of data but mostly people are asking about the last few weeks.

It gets a couple sentences in the post, but prefix searches are key (ha!): a sorted index on (foo, bar, baz) helps not only searches on all three attributes but also foo=x and bar=y or just foo=x. It can also be useful when you want to look things up by a prefix of an individual column: filter to a given day or month using a timestamp index, or filter to errors reported from client version 3.2.x. As a “covering index” it can somewhat help with queries that limit to foo=x and do something with bar (filter on it, return it, sum it, etc.). Finally, it helps with multiple scenarios for aggregation, not only searching: a timestamp index can let you get a count of orders by day, hour, or second without needing to make another temporary table/index.

Often the crucial thing for keeping a DB humming is not getting the last couple percent out of already-fast operations but making sure you don’t have any disastrous queries, so an index being versatile for many kinds of search is especially valuable.

Related to that last point, databases are trying to prepare for queries that haven’t even been written yet, while when you’re coding some algorithm that involves a hashtable, you already know exactly what lookup you’re going to do a few lines later. That on its own is going to push you towards more flexible types of indexing.

(Crossposted from https://lobste.rs/s/68dgox/, where other folks have some good comments.)


I was going to guess that a reason databases prefer ordered tables is that, for two large tables (lots bigger than RAM), equijoins are easy to do efficiently: just do a merge.

I wasn't aware of any obvious way to do the same with two on-disk hash tables. But it appears some databases (MySQL?) do this.

So now my question is how.

Use the same hash function for the two tables to be joined, then loop over buckets, matching corresponding buckets between tables? And then what, treat those subsets similar to a join of two unsorted tables? I guess that's okay as long as each of your hash buckets only has very few things in it. Is there a better way?


There are a couple of common hash join algorithms that databases use: symmetric hash join and asymmetric hash join. The latter is actually simpler. Take the first (smaller) table and read it into a hash table (assuming it can fit in memory for simplicity's sake). Then stream all rows from the second table, looking up the join key for each input row in the hash table containing the first table. If you get a match, emit an output row. For symmetric hash join, you stream both input tables simultaneously: for each input row, you first check the other input table's hash table for a match (emitting an output row for each match), and then add the input row to that input table's hash table. (When you've exhausted one of the input tables, you can actually delete the other input table's hash table from memory, since you won't produce any more matches from it.) Again I've oversimplified in assuming that the input tables fit in memory, but the algorithms for spilling to disk are pretty simple (basically hash the inputs on the join key into disk partitions which do fit into memory and build hash tables from each partition).

Parallel databases also use hash join for performing distributed joins: each node hashes its inputs on the join key to distribute onto the other nodes, so that all inputs with the same join key end up on the same node. Then you can apply one of the hash join algorithms above.


Since the case where the smaller table fits in RAM is easy and obvious, it's the other case that I'm interested in.

It sounds like you might be saying when two large hash-based tables need to be joined, you're basically starting from scratch in that you're not taking advantage of the existing hash data structure. (At least that's how I interpret "build hash tables", in contrast to somehow using what already exists on disk.)

This sounds pretty slow to me compared to a merge join of ordered lists. There seems to be a lot of I/O (including writes, temp files, and a few passes) whereas with a merge join it's just reads and they're more or less sequential I/O.

So this would be a reason for databases to lean toward ordered storage. But only if disk-based hash joins are as slow as I think they are, which is the part I'm not sure about.


It might not be clear that symmetric hash join is a streaming, non-blocking operator (i.e., it can start producing results immediately without waiting for a hash table to be built). Unlike merge join, though, it requires all input rows to be kept in memory (at least until the other table is exhausted).

I suspect query optimizers haven’t taken hash indexes into account for joins because they’re so rare in practice, but it’s probably worth considering.


Most extant relational databases were originally designed pre-internet as a way to store and batch-process business records, rather than optimizing for things like single-record latency.

For the batch processing use case, you usually want to be able to do range queries with the interior nodes of the tree in memory and the actual data ending up resident on disk.

For the single-record lookup case (ie serving a user's data to them via a webpage) something like linear hashing actually does have better performance characteristics, although trees are usually good enough.


I think that people just take too seriously the part of "hash tables are O(1) and trees are O(log(n))". But you have to consider complexity of hash generation (in real life sometimes it's worse than tree traversal), rebalancing, order lost and programmer issues (some just ignore at all that hash collisions do happen). At least in my case, I have found that skipkists have similar performance than hash tables (ops/seq for real life loads, not infinite items), and use less memory (at least the Open vSwitch implementation).


Even if you need range queries when programming, it's rare that you need it for a dynamic dataset. This means you can get better constant factors just by sorting your data and binary searching. You only need trees if you need to insert/delete/update items in the middle.

Most programming tasks work with static datasets because memory is volatile. We are always loading the data from somewhere else and usually just once because reactively updating the in-memory collection is nontrivial to hook up. On the other hand, this is the default for databases.


That's just a wrong assumption. In case anyone wants a '=' or 'in' query, you are better off keeping a hash index as opposed to an ordered index. Hash based partitioning is available on many commercially used DBs as well.

I would also like to point out that radix based storing is not a bad idea at all, and these data structures are never completely exclusive of each other :-)

I tend to disagree with the author because we can get into the details, but the article is a good read for someone who wants to understand how DBs/B-trees and scans work.


I use two structures for persistent data: Extendible hashing for unordered data and skip lists for ordered data. Extendible hashing is great for looking up an object given its UID. Skip lists are easier than B-trees (to me) for everything else. I also use skip lists occasionally for in-memory data because they're just so darn handy. I wish both were better known.


For densely packed IDs (32 or 64 bit integers...) I think it's even faster to use a keyed trie structure, as it can use fast bit-shifting operations in contrast to using standard B+-trees.

Also, this can be used as a persistent, on-disk / flash drive data structure or as a hash array mapped trie for HashMap implementations in-memory (Scala or Closure for instance).

For a persistent (both in the sense of functional programming as well as storing on non-volatile disk), this is for instance implemented in SirixDB (https://sirix.io), a temporal data store prototype I'm maintaining and developing :-)


See also:

Extendible hashing—a fast access method for dynamic files

https://dl.acm.org/citation.cfm?doid=320083.320092


> Why is the "default" choice different between programs and databases, when at the end of the day they both do the same thing: accessing data for our code?

Because databases are designed to be efficient at scale. A hash table just doesn't have the functionality that a database with a B-tree variant does.


I never understood using hash tables for anything else but difficult to compare values like long strings. If you're working with numbers, trees seem like much nicer data structure.


Databases are part of programming.

Hash tables are easier to use on arbitrary data because they don't require Comparator to be defined.

Trees are faster when data is large enough that performance matters.


Hashing is a technique used in databases also.


Databases use hashing all the time. One popular join strategy is the hash join, which is repeated hash table lookup.


Thanks for posting this article. The link to the b-tree monography itself is gold. Kudos!


I thought it's becasue of spinning disks or tapes.


Adding an index is O(nlogn) time to build, not O(n).


Databases need sequential access because they often scan tables and indexes. Unless a SELECT query uses equality, the database engine needs to test many values to see if they match predicates.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: