
Why databases use ordered indexes but programming uses hash tables - matt_d
https://www.evanjones.ca/ordered-vs-unordered-indexes.html
======
geophile
I think this article misses two points more important than anything else
mentioned.

1) B-trees are dynamic while hash tables are not. A B-tree grows gracefully
across orders of magnitude, and shrinks just as easily. Most hash tables do
not have this property. Extensible hashing, and linear hashing do grow and
shrink gracefully, but I'm not sure how widely used they are.

2) In a database system, the concern was traditionally to minimize page
accesses. CPU is not negligible, and now there are main memory databases, as
well as RAM sizes _much_ larger than they were in the 70s and 80s. However, a
lot of the main ideas in traditional OLTP databases were developed a long time
ago, and the resulting architecture is still very much in use. So how many
page reads to do a key lookup in a hash table? One. End of story. How many
page reads to do a key lookup in a B-tree? Or to be more precise, a B+-tree,
which has a much higher branching factor? Probably one. The root, and most
likely the second level of pages stay cached, so it's really one page access
to get a 3rd-level page. And, of course, as mentioned in the article, B-trees
give you sequential access in key order, for when that's important.

~~~
cperciva
On modern hardware a key lookup in a hash table isn't necessarily a single
page read! Sure, it's a single _virtual_ memory access, but if that page isn't
in your TLB you need to read the page table... and if the page containing that
part of the page table isn't in the TLB you need to read _that_ page...

On modern hardware, _every_ memory access looks very much like a B-tree
lookup.

~~~
cesarb
> and if the page containing that part of the page table isn't in the TLB you
> need to read that page...

I thought page tables used physical addresses, which are accessed directly
without any TLB lookup (except when nested paging during virtualization, which
adds another level of indirection). Of course, the processor still needs to
read each level into the data cache(s) while doing a page table walk.

~~~
ithkuil
Page tables can have multiple levels. For example in x86_64 you'd have 4
levels, i.e the virtual->physical mapping is implemented as a tree with depth
4, where each leave and internal node of such tree is 4kb (page size). (As
usual, details are more complicated than that)

~~~
cesarb
Yes, and each level of the tree has the _physical_ address of the next level,
so no TLB lookup is necessary (the top of the tree, in the TTBRn or equivalent
registers, is also a physical address).

~~~
ithkuil
oh yeah, I totally misread the comment: a page _fetch_ is required, but that
has nothing to do with the TLB indeed

~~~
narnianal
what's the difference? "page fetch" is not really something that can be
googled on the web.

~~~
ithkuil
the TLB is just one element of the process that leads to resolve a virtual
address into a physical one: it's a cache that hosts the most recently
resolved addresses.

When the virtual address you're looking to resolve is not present in that
cache (i.e. when you have TLB miss), the CPU falls back to walking the page
table hierarchy. At each level of the tree, the CPU reads an physical address
of the next level of the tree and performs a _memory fetch_ of that page table
entry (in my previous comment I erroneously said a "page fetch", but it's
actually only performing a cache-line sized fetch) and repeatedly so until it
reaches the leaves of the tree which contain the Page Table Entry that
contains the physical address of the (4k) physical page associated with the
virtual page address you wanted to resolve.

------
michalc
Nit-pick / pet-peeve regarding Big O notation!

> Hash tables provide constant time O(1) access for single values, while trees
> provide logarithmic time O(log n) access. For single value lookups, this
> means hash tables are faster,

> for small n, the hash table is better, but for large n, the cost of that
> rare scan dominates, and the tree is better

The Big O notation describes how the number of operations scales as the size
of the input tends to infinity: the asymptotic complexity. This means that an
O(1) algorithm can include some fixed number of operations, even enough to add
_years_ onto the runtime, and still be classed as O(1).

Essentially I don’t think it’s really enough to consider the asymptotic
complexity of algorithms to compare practical runtimes. You need to know what
the smaller-order / constant terms are.

~~~
amelius
Also, most hash tables are not really O(1). The worst case scenario is more
like O(n).

~~~
Darmani
In some senses, no hash tables are O(1).

Hash tables are bounded below by the speed of arithmetic. A trillion-bit
number takes a while to multiply. You may have never touched a trillion-bit
number, but it's called "asymptotic" for a reason.

You only get O(1) if you use a model of computation in which arithmetic is
constant time. It's simple to work with, even though it's not realistic, and
opens up scenarios where you can encode a really complex operation into
adding/multiplying some huge numbers, and then claim the overall computation
is still fast.

Some theorists use a model of computation in which, if your input size n, then
multiplying a log-n bit number is O(1), but multiplying an n-bit number is
O(n). I have no idea how this is well-defined.

~~~
dmurray
Constant-time arithmetic is realistic in most normal situations, like
evaluating the performance of a hash table implementation. Perhaps your
implementation uses 32 bits for the hash function, and can only scale to 4
billion slots. Maybe your implementation of a vector has the same limitation.
In general, the reason you'll have trouble trying to store 2^trillion items is
not because you chose a data structure with worse asymptotic performance than
you expected, it's because you run out of space in the universe in which to
store them.

So saying such a structure has O(1) asymptotic access time is not perfectly
true in a mathematical sense, for unrealistic numbers. But it's certainly
realistic.

~~~
jhanschoo
Yes, it is, but the question is how do you make this notion of "realistic"
formal and rigorous. The usual model for complexity analyses, the TM, can't do
constant-time arithmetic. So we're interested in a more powerful model.

I'd proposed a model in a sibling comment that seems to be the one most people
use for analyzing algorithms.

~~~
johncolanduoni
A Turing machine can do constant-time arithmetric (or more simply, constant-
time _bound_ arithmetic) for operands of bounded size, as was suggested above.

------
chupa-chups
* In most databases you can explicitly specify the kind of index, for example Postgres: CREATE INDEX name ON table USING hash (column); [https://www.postgresql.org/docs/9.1/indexes-types.html](https://www.postgresql.org/docs/9.1/indexes-types.html)

* In databases, ORDER BY is regularly used, making hash indexes as a default a bad choice

* See also [https://en.wikipedia.org/wiki/B-tree#Advantages_of_B-tree_us...](https://en.wikipedia.org/wiki/B-tree#Advantages_of_B-tree_usage_for_databases): The B-tree uses all of the ideas described above. In particular, a B-tree: keeps keys in sorted order for sequential traversing uses a hierarchical index to minimize the number of disk reads uses partially full blocks to speed insertions and deletions keeps the index balanced with a recursive algorithm In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an arbitrary number of insertions and deletions.

~~~
petergeoghegan
Real implementations of B-Trees (i.e. B+Trees) don't preserve the traditional
guarantee about nodes being half full. However, space utilization is still a
big advantage.

You can't do efficient retail deletes with hash indexes in the presence of
duplicates because it isn't possible to append a unique-ifier (e.g. table TID)
to the key, for exactly the same reason as it isn't generally possible to
support multi-column indexes. In the case of Postgres hash indexes, it doesn't
matter as much because VACUUM doesn't currently do retail deletions in any
case. Postgres B-Tree indexes should support retail deletion at some point in
the future, but that hasn't happened yet.

Also, deleting whole B-Tree pages (removing them from the tree entirely) can
be thought of (and implemented) as an operation that is symmetric to page
splits. Page deletion is especially complicated at the best of times, because
it's particularly hard to reason about race conditions (Postgres has many
layers of indirection to get this to work). I can't imagine anything that
would work as well, but with hash indexes.

------
d_burfoot
As a programmer, I always use an ordered map (Java TreeMap) in preference to
hash tables. My reasons:

\- It is much easier to write a transparently correct comparison method for a
new object that I want to use as a key. With hash tables you need to worry
that your hash function might not be properly distributed.

\- With hash tables you have to worry that your language's under-the-hood
resize/reallocate algorithm works well, or you need to make sure you choose
the capacity and load factor numbers correctly. Even if things work as
expected, you might end up paying a penalty because, for example, the array
backing the hashtable was auto-resized to 128 million elements when you only
need 70 million.

\- Java TreeMap has lots of cool utility methods like pollFirstEntry (pull off
the first K/V pair), higherMap (view of map that only includes pairs with
higher key), etc.

\- TreeMap key traversal is deterministic!! The worst bug I encountered in
_years_ of programming work was caused by non-determinism in HashMap traversal
(you could blame Java's implementation here, I suppose).

\- The O(1) vs O(log N) performance difference is minor to the point of being
undetectable in every case I've ever encountered.

\- When you do something with the data in the map, you often want to sort by
key anyway: might as well have the sort taken care of by the data structure.

~~~
kodablah
You should reach for the best tool for the job instead of giving job-specific
reasons for your general choice. Without going into detail explaining how your
bullets either don't apply to most uses, prematurely code for a future case
that may not exist, dismiss performance, reference other languages as if
language implementation is not a factor in the choice, etc... suffice to say
they both have their place and in general you probably shouldn't reach for an
ordered map over a unordered one without reason. I.e. it shouldn't be your
default choice in most languages and it'd be bad advice to give other
programmers.

~~~
ummonk
An ordered map is much better as a default simply because it can do everything
that an unordered map can. Choosing to use an unordered map instead is a
premature optimization that is nearly always unnecessary.

~~~
mantap
IMO, premature optimization doesn't apply to cases where you are making
general decisions about policy or defaults, otherwise you can get into a
"death by a thousand cuts" situation where none of your functions are slow
individually but the whole program _is_ slow (i.e. big overheads). To me,
premature optimization means don't optimize a _specific_ function unless you
know it is slow. I'd also say that premature optimization doesn't apply in
cases where both alternatives are equally easy to write.

~~~
mlyle
On the other hand, if you pick a TreeMap--- odds are it will be a non-issue
either way.

If it _does_ show up on your profiles, it's easy to reconsider and say "I
don't think I'll ever access this in order, after all..." and change it and
see if you see a benefit.

------
redcalx
For one, a database may receive a query such as "WHERE x >=10 AND x <=100".
I.e. having an ordered index is useful for accessing ranges, whereas a hash
lookup is always just for a single entry. The question then becomes - why are
lookups in RAM more likely to be single lookups rather than ranges (and stats
on ranges)? Partly at least because DBs provide a query language that make
range based queries easy to do, and just because that's where most business
data sits (most of it necessarily needs to be in a persistent store).

~~~
jamesfmilne
Everything is driven by the business case. Is there a business case to
optimise for range queries? If so, optimise for that. Is there a business case
for individual record queries? Then optimise for that. Is there a case for
both? Then optimise for that.

The biggest lie of the 21st century is convincing
JavaScript/Ruby/Python/Clojure/whatever programmers that web development is
something sexier/holier/worthier than boring old CRUD Oracle Forms database
development.

Its exactly the same but with fonts and animated transitions.

~~~
zadokshi
Oracle forms vs modern web development is a little Off topic but I’ll bite.

Could gmail have been built in Oracle forms, how about slack? How about Asana?
How about an LMS such as blackboard, moodle, or D2L? Can you name any popular
mainstream product that could be built upon and run off the Oracle forms
product?

Oracle forms will get you 80% of the way there, and the final 20% will be
impossible.

~~~
garmaine
> Could gmail have been built in Oracle forms, how about slack? How about
> Asana? How about an LMS such as blackboard, moodle, or D2L? Can you name any
> popular mainstream product that could be built upon and run off the Oracle
> forms product?

Yes, it could.

Not going to judge whether it's a good idea or not, but it most certainly
could be implemented.

I suspect you're not aware just how much capability relational database
management systems have.

~~~
thu2111
Don't know why this post was voted down. I used to work on Gmail, which uses a
totally proprietary storage and database stack, like everything else at
Google.

One of the most insightful design docs I ever read was an exploration by an
engineer of how GMail 1.0 could have been implemented with commodity tech and
how it'd compare cost wise. The rather sobering conclusion was that it'd have
similar functionality and been cheaper to develop/run.

------
romwell
Sometimes databases use hash tables for indexing.

This is a natural match for a key-value store.

In particular, the following KV stores offer hash table-based indexing:

-Berkeley DB [1]

-Tokyo Cabinet (and its successor, Kyoto Cabinet) [2]

-Bitcask (used in Riak) [3]

In many benchmarks, the hash-based systems perform better than tree-based
ones.

The real advantage of trees, as Berkeley DB manual says:

> Btrees are better for range-based searches, as when the application needs to
> find all records with keys between some starting and ending value. Btrees
> also do a better job of exploiting locality of reference. If the application
> is likely to touch keys near each other at the same time, the Btrees work
> well.

And there is the actual answer to the question in the title. And that is the
actual answer to the question (that, and optimiz

[1][https://docs.oracle.com/cd/E17275_01/html/programmer_referen...](https://docs.oracle.com/cd/E17275_01/html/programmer_reference/intro_dbis.html)

[2][https://fallabs.com/tokyocabinet/perldoc/](https://fallabs.com/tokyocabinet/perldoc/)

[3][http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-
lo...](http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-log-
structured-hash-table-for-fast-keyvalue.html)

\-----------------------

TL;DR: Why? Because you can do range queries and traverse sequential elements
faster.

Don't need that? Use hash.

~~~
batoure
Came here basically to say this you will also find that well architected
HBase/Big Table often work this way as well

------
rklaehn
Databases use hashes all the time, also ordered indexes are very frequently
used in programming (e.g. C++ stl).

I think ordered indexes are a good default for use in programming languages.
They are deterministic and do not have strange corner cases that can be
exploited.

E.g. the order when iterating over elements in a hashtable is pretty much
random, whereas the order of elements in an ordered tree based data structure
is deterministic. That can very easily lead to the output of a program being
nondeterministic, e.g. when serializing the contents of a hashtable. This is a
frequent cause of compilers having non-deterministic builds.

I like my programs to be deterministic and want to precisely control where
randomness enters the execution.

Hash collisions can lead to pathological performance. This can be exploited
for DOS attacks, unless the hash data structure in question uses good
randomization of the hash function. E.g. the rust hash data structures do a
good job with this, but the issue just does not exist for an ordered tree
based data structure.

The performance difference is often very small unless having very complex keys
or very large sets/maps. A hash based data structure is a valid choice for me
once the performance actually matters.

~~~
kccqzy
I agree, but instead I would use the term "path-dependent" for hash tables
rather than "non-deterministic" because, after all, unless you salt your hash
functions, there is really no randomness anywhere. What you think of as non-
deterministic really is deterministic but it is determined by the precise
order of insertion. This is sometimes also called memorylessness.

Even though tree-based data structures give you an iteration order that is
predictable and useful, the structure itself is still not fully memoryless.
For example in a red-black tree if you insert an element smaller than all
preexisting elements and then immediately delete the minimum, the tree could
be different.

~~~
rklaehn
> I agree, but instead I would use the term "path-dependent" for hash tables
> rather than "non-deterministic" because, after all, unless you salt your
> hash functions, there is really no randomness anywhere. What you think of as
> non-deterministic really is deterministic but it is determined by the
> precise order of insertion. This is sometimes also called memorylessness.

You are right that non-deterministic is not strictly correct for non-salted
hash functions. That makes it even more annoying, since it is deterministic
enough that you might accidentally rely on the order. It is close enough to
nondeterministic since iteration order can change when the hash function
changes, but also when some minor internal parameter of the hashtable changes.

> Even though tree-based data structures give you an iteration order that is
> predictable and useful, the structure itself is still not fully memoryless.
> For example in a red-black tree if you insert an element smaller than all
> preexisting elements and then immediately delete the minimum, the tree could
> be different.

Yes, I am aware and a very big fan of data structures that are actually fully
memoryless, such as all kinds of binary tries, radix trees, or just wrapped
sorted arrays. The latter are totally underrated, since they have very compact
in memory representation (just the elements, single piece of heap) and have
very competitive lookup performance.

They of course totally suck when you want to do random single element updates,
but then just use something else...

I am going to release an entire collections library based on sorted arrays for
rust over Christmas...

~~~
senderista
If you like sorted arrays and Rust, you might want to take a look at my
sorted-vec crate, which has O(sqrt(N)) insertions and deletions:
[https://github.com/senderista/sorted-
vec](https://github.com/senderista/sorted-vec). In my tests it takes less than
half the memory of BTreeSet.

PS: It isn't difficult to devise hash tables with path-independent iteration
order. You can simply order the keys by their hash code values. See e.g., my
implementation of bidirectional linear probing:
[https://github.com/senderista/hashtable-
benchmarks/blob/mast...](https://github.com/senderista/hashtable-
benchmarks/blob/master/src/main/java/set/int32/BLPIntHashSet.java).

~~~
rklaehn
Yes, saw that one. Neat.

My stuff is even simpler. It is just a flat array in memory and should _just
not be used_ when you want random single element inserts or deletions. But I
find that rather uncommon anyway in many use cases. I made sure that building
a new collection from an iterator is fast.

What I do use is a minimum comparison sort that makes very common cases very
fast in terms of number of comparisons, and some unsafe stuff to allow in
place updates to prevent allocations.

Not quite ready for prime time yet, but you can see the basic ideas being used
in this data structure:

[https://docs.rs/range-
collections/0.1.0/range_collections/ra...](https://docs.rs/range-
collections/0.1.0/range_collections/range_set/struct.RangeSet.html)

You could do a deterministic hash table by using a cryptographic hash function
like sha256 and then use a binary trie of the hashes. No need to deal with
collisions as far as we know. The hash is expensive, but if you need the hash
for something else anyway, this is pretty neat...

~~~
senderista
I have played with hash tries in the past for determinism/dynamism compared to
hash tables. I think they're an excellent choice for integer sets, where you
can use a reversible hash function to losslessly randomize the elements (I use
the same idea in the integer hash table I linked above). If I didn't need
graceful resizing, though, I would use either Cleary hash tables or bucketized
cuckoo tables for this application.

~~~
rklaehn
The hash set and hash map for scala are actually based on hash tries, but with
a non-cryptographic hash, so collisions still need to be handled.

I wrote efficient set operations for scala.collection.immutable.HashSet once.
Lots of bit-twiddling, which is not as fun on the JVM compared to rust...

------
blaisio
One major issue with disk based hash tables is rehashing. Actually, in-memory
hash tables have the same problem, but most people don't notice it. Imagine
your database suddenly duplicating itself because you inserted one row and
suddenly overloaded a bucket.

Some hash tables that are designed to minimize the potential for DDOS attacks
and improve the amortization of work use gradual rehashing (Go's hash table
does this), but this is also not ideal for databases because the associated
disk accesses are so expensive, and it requires significantly more disk space.

Many real world implementations just don't rehash unless you ask, which makes
them pretty inefficient for most use cases.

~~~
kgwxd
I've seen it happen in Azure Blob Storage when a container hit about 10
million files, it took 2 hours.

------
crazygringo
I've always thought b-trees should be the starting "default" for everything,
because their speed is far more consistent, you get extra features, and you
don't have to make decisions in advance about allocating extra memory or worry
whether the keys aren't well-distributed enough that you'll get horrible
collisions. And how often do the extra few lookups for a tree materially
affect the speed of your software?

Hash tables to me aren't a go-to -- they're an optimization I reach for only
when I have a real performance reason where I _have_ to use them, and my data
is highly predictable so I have confidence how to set them up. But honestly,
that just doesn't happen that often. I swear, I think I've used Bloom filters
more often than hash tables in my life.

~~~
dajohnson89
of the data structures you mentioned, which are easier to work with? to
implement? I code in Java and I usually use hashmaps as a go to. they're
simple and fast ways to store relationships between data.

~~~
garmaine
std::map in C++ is way easier to work with than std::unordered_map (which
didn't even exist until C++11).

~~~
yuushi
Define "way easier". The interface, at least for day to day operations, is
basically identical.

~~~
garmaine
Supports more features like ordered traversal, generally faster in most cases
(yup!), consistent api cost, doesn’t invalidate iterators on certain
operations like moving between maps. That’s just off the top of my head.

------
overgard
In C++ if you’re using the STL, std::map (tree) usage is way more common than
a hash (std::unordered_map). I couldn’t tell you if its for a good reason
though, or if its just because “map” is a lot shorter than “unordered_map”

~~~
taneq
std::unordered_map was only added in C++11, which explains why it's not used
in legacy code. Habit, inertia and a desire to keep codebases internally
consistent explains why it's less common now.

~~~
adrianN
Also std::unordered_map suffers from a design flaw that forces a poor
implementation.

~~~
taneq
Would you be willing to expand on that? I couldn't find any obvious well-known
issues with a quick search.

~~~
adrianN
The interface forces an implementation as a map with linked-list buckets which
is suboptimal for many use cases. So people who care about performance usually
just implement their own hash map or use a library.

~~~
taneq
Ah, that makes sense. Thanks!

------
narnianal
I would very simply summarize it like this: If you know which data you look
for and say "give me this piece" then you want to use a hash table. If you
don't know what you are looking for and need to filter/search, then you use a
b-tree. So actually you want to use both structures regularly.

Not sure how that general principle really applies or if it's just a quirck my
brain came up with, though. If you look at distributed hash tables for
instance, they are hash tables on a bus network in some sense, but they are
used for search/filter tasks over multiple nodes.

------
asdfasgasdgasdg
This isn't a bad article, but I would be interested in seeing a follow-up
where, instead of guessing, the author asks some database professionals. They
are out there! We don't have to reinvent from first principles. I would guess
this question could be asked and answered by known authorities on e.g. the
PostgreSQL mailing lists.

------
im3w1l
I think part of it is inertia. There was a time when C++ only had a tree based
map, but eventually it got a hash map too.

And part of it is that a database _needs_ a sorted index. Because a lot of
common queries would be awful without one. And when you have a sorted index
already, a hash index becomes a nice-to-have.

~~~
downerending
Adding to that, a database typically has to be ready for all kinds of queries.
But when choosing a data structure in a program, one often knows a great deal
about exactly what queries will show up.

------
SamReidHughes
Also, the log N constant is much smaller on disk than in memory, because the
branching factor is really high.

------
halayli
I believe it's more due to the fact that sorting, and less/greater than
operators come up way more often in databases and using an ordered index makes
a much better use case.

And the other reason is MVCC. It's easy to maintain multiple snapshots of a
b-tree index and swap a root's ptr to point to a new sub-tree.

~~~
senderista
Sorting in a database makes sense even when you never use ORDER BY. You can do
streaming joins/unions, and implement streaming, non-blocking aggregation
operators (because you know when you've seen the last row for a grouping key,
so you can output the result for that group immediately and evict its state
from memory). Query plans will sometimes sort inputs (or intermediate results)
for these reasons even when the final output is unordered; this is known as
"interesting orders".

~~~
halayli
correct. There are many parts of the db planner that checks if the output is
sorted to take a more optimal path.

------
twotwotwo
Indexes on things that’re correlated with time (ID, timestamp) tend to
concentrate recently added data towards the “end” of the table/index, which is
a big help when you have years of data but mostly people are asking about the
last few weeks.

It gets a couple sentences in the post, but prefix searches are key (ha!): a
sorted index on (foo, bar, baz) helps not only searches on all three
attributes but also foo=x and bar=y or just foo=x. It can also be useful when
you want to look things up by a prefix of an individual column: filter to a
given day or month using a timestamp index, or filter to errors reported from
client version 3.2.x. As a “covering index” it can somewhat help with queries
that limit to foo=x and do something with bar (filter on it, return it, sum
it, etc.). Finally, it helps with multiple scenarios for aggregation, not only
searching: a timestamp index can let you get a count of orders by day, hour,
or second without needing to make another temporary table/index.

Often the crucial thing for keeping a DB humming is not getting the last
couple percent out of already-fast operations but making sure you don’t have
any disastrous queries, so an index being versatile for many kinds of search
is especially valuable.

Related to that last point, databases are trying to prepare for queries that
haven’t even been written yet, while when you’re coding some algorithm that
involves a hashtable, you already know exactly what lookup you’re going to do
a few lines later. That on its own is going to push you towards more flexible
types of indexing.

(Crossposted from [https://lobste.rs/s/68dgox/](https://lobste.rs/s/68dgox/),
where other folks have some good comments.)

------
adrianmonk
I was going to guess that a reason databases prefer ordered tables is that,
for two large tables (lots bigger than RAM), equijoins are easy to do
efficiently: just do a merge.

I wasn't aware of any obvious way to do the same with two on-disk hash tables.
But it appears some databases (MySQL?) do this.

So now my question is how.

Use the same hash function for the two tables to be joined, then loop over
buckets, matching corresponding buckets between tables? And then what, treat
those subsets similar to a join of two unsorted tables? I guess that's okay as
long as each of your hash buckets only has very few things in it. Is there a
better way?

~~~
senderista
There are a couple of common hash join algorithms that databases use:
symmetric hash join and asymmetric hash join. The latter is actually simpler.
Take the first (smaller) table and read it into a hash table (assuming it can
fit in memory for simplicity's sake). Then stream all rows from the second
table, looking up the join key for each input row in the hash table containing
the first table. If you get a match, emit an output row. For symmetric hash
join, you stream both input tables simultaneously: for each input row, you
first check the other input table's hash table for a match (emitting an output
row for each match), and then add the input row to that input table's hash
table. (When you've exhausted one of the input tables, you can actually delete
the other input table's hash table from memory, since you won't produce any
more matches from it.) Again I've oversimplified in assuming that the input
tables fit in memory, but the algorithms for spilling to disk are pretty
simple (basically hash the inputs on the join key into disk partitions which
do fit into memory and build hash tables from each partition).

Parallel databases also use hash join for performing distributed joins: each
node hashes its inputs on the join key to distribute onto the other nodes, so
that all inputs with the same join key end up on the same node. Then you can
apply one of the hash join algorithms above.

~~~
adrianmonk
Since the case where the smaller table fits in RAM is easy and obvious, it's
the other case that I'm interested in.

It sounds like you might be saying when two large hash-based tables need to be
joined, you're basically starting from scratch in that you're not taking
advantage of the existing hash data structure. (At least that's how I
interpret "build hash tables", in contrast to somehow using what already
exists on disk.)

This sounds pretty slow to me compared to a merge join of ordered lists. There
seems to be a lot of I/O (including writes, temp files, and a few passes)
whereas with a merge join it's just reads and they're more or less sequential
I/O.

So this would be a reason for databases to lean toward ordered storage. But
only if disk-based hash joins are as slow as I think they are, which is the
part I'm not sure about.

~~~
senderista
It might not be clear that symmetric hash join is a streaming, non-blocking
operator (i.e., it can start producing results immediately without waiting for
a hash table to be built). Unlike merge join, though, it requires all input
rows to be kept in memory (at least until the other table is exhausted).

I suspect query optimizers haven’t taken hash indexes into account for joins
because they’re so rare in practice, but it’s probably worth considering.

------
kyrieeschaton
Most extant relational databases were originally designed pre-internet as a
way to store and batch-process business records, rather than optimizing for
things like single-record latency.

For the batch processing use case, you usually want to be able to do range
queries with the interior nodes of the tree in memory and the actual data
ending up resident on disk.

For the single-record lookup case (ie serving a user's data to them via a
webpage) something like linear hashing actually does have better performance
characteristics, although trees are usually good enough.

------
estebarb
I think that people just take too seriously the part of "hash tables are O(1)
and trees are O(log(n))". But you have to consider complexity of hash
generation (in real life sometimes it's worse than tree traversal),
rebalancing, order lost and programmer issues (some just ignore at all that
hash collisions do happen). At least in my case, I have found that skipkists
have similar performance than hash tables (ops/seq for real life loads, not
infinite items), and use less memory (at least the Open vSwitch
implementation).

------
fyp
Even if you need range queries when programming, it's rare that you need it
for a _dynamic_ dataset. This means you can get better constant factors just
by sorting your data and binary searching. You only need trees if you need to
insert/delete/update items in the middle.

Most programming tasks work with static datasets because memory is volatile.
We are always loading the data from somewhere else and usually just once
because reactively updating the in-memory collection is nontrivial to hook up.
On the other hand, this is the default for databases.

------
vkaku
That's just a wrong assumption. In case anyone wants a '=' or 'in' query, you
are better off keeping a hash index as opposed to an ordered index. Hash based
partitioning is available on many commercially used DBs as well.

I would also like to point out that radix based storing is not a bad idea at
all, and these data structures are never completely exclusive of each other
:-)

I tend to disagree with the author because we can get into the details, but
the article is a good read for someone who wants to understand how DBs/B-trees
and scans work.

------
dreamcompiler
I use two structures for persistent data: Extendible hashing for unordered
data and skip lists for ordered data. Extendible hashing is great for looking
up an object given its UID. Skip lists are easier than B-trees (to me) for
everything else. I also use skip lists occasionally for in-memory data because
they're just so darn handy. I wish both were better known.

------
lichtenberger
For densely packed IDs (32 or 64 bit integers...) I think it's even faster to
use a keyed trie structure, as it can use fast bit-shifting operations in
contrast to using standard B+-trees.

Also, this can be used as a persistent, on-disk / flash drive data structure
or as a hash array mapped trie for HashMap implementations in-memory (Scala or
Closure for instance).

For a persistent (both in the sense of functional programming as well as
storing on non-volatile disk), this is for instance implemented in SirixDB
([https://sirix.io](https://sirix.io)), a temporal data store prototype I'm
maintaining and developing :-)

------
xjia
See also:

Extendible hashing—a fast access method for dynamic files

[https://dl.acm.org/citation.cfm?doid=320083.320092](https://dl.acm.org/citation.cfm?doid=320083.320092)

------
peterwwillis
> Why is the "default" choice different between programs and databases, when
> at the end of the day they both do the same thing: accessing data for our
> code?

Because databases are designed to be efficient at scale. A hash table just
doesn't have the functionality that a database with a B-tree variant does.

------
hoseja
I never understood using hash tables for anything else but difficult to
compare values like long strings. If you're working with numbers, trees seem
like much nicer data structure.

------
lonelappde
Databases are part of programming.

Hash tables are easier to use on arbitrary data because they don't require
Comparator to be defined.

Trees are faster when data is large enough that performance matters.

------
kazinator
Hashing is a technique used in databases also.

------
quotemstr
Databases use hashing all the time. One popular join strategy is the hash
join, which is repeated hash table lookup.

------
rumanator
Thanks for posting this article. The link to the b-tree monography itself is
gold. Kudos!

------
jbverschoor
I thought it's becasue of spinning disks or tapes.

------
stephenwilcock
Adding an index is O(nlogn) time to build, not O(n).

------
timClicks
Databases need sequential access because they often scan tables and indexes.
Unless a SELECT query uses equality, the database engine needs to test many
values to see if they match predicates.

