
Memory-Efficient Search Trees for Database Management Systems [pdf] - ngaut
http://reports-archive.adm.cs.cmu.edu/anon/2020/CMU-CS-20-101.pdf
======
willvarfar
A lot to digest. Presumably the compression schemes outlined in the paper
would benefit columns generally, and not just keys, and would have a positive
impact on row-size which would in turn benefit general performance?

One short-string-compression-library to compare with might be
[https://github.com/antirez/smaz](https://github.com/antirez/smaz) by the
Antirez, the author of redis.

In the last decade Tokudb and then MyRocks showed how to do a fast storage
engine for data too big to fit in traditional RDBMS. They are multi-level,
they do page-level compression etc. And yet there is still so many easy wins
yet to be had.

Generally, databases are completely inefficient and there is just so much low-
hanging fruit; if the team size working on e.g. myrocks could be doubled, and
tasked with looking at the inefficiencies in the server as well as the storage
engine, things might change. I have a list in my head of the various really-
promising-ideas that databases don't actually do:

* linux doesn't have a syscall batching system, but if it did, the number of context switches would be cut down dramatically. Research in 2010 [https://www.usenix.org/legacy/events/osdi10/tech/full_papers...](https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf) proved this and it wouldn't just be databases that benefit. These days context switching is more expensive than ever.

* database engines all use blocking io. Finally io_uring offers workable async io and that would benefit database storage engines immensely. See [https://fosdem.org/2020/schedule/event/rust_techniques_sled/](https://fosdem.org/2020/schedule/event/rust_techniques_sled/)

* tokudb showed that simply not tracking the affected row count could speed things up massively (tokudb called it NOAR)

* query engines often don't handle things that a few more lines of code would handle efficiently. I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql evaluating constraints that it can against the other columns in the key before dereferencing the row? Arrgh. Dumb dumb dumb. Etc.

~~~
qaq
"database engines all use blocking io" Well sled is using io_uring

~~~
samatman
probably why that paragraph had a link to a video titled "sled and rio: modern
database engineering with io_uring"...

~~~
qaq
which was my point :)

------
derefr
I've always wondered why I haven't seen a DBMS built as a unikernel. A DBMS is
already a "managed runtime" for data, with its own memory allocator,
scheduler, filesystem (in some sense), etc. And you're _almost_ always going
to want to run a DBMS "workload" on its own dedicated hardware/VM, anyway, for
predictability.

So why not just take that set of DBMS services and put them in ring-0, where
they won't need any context-switch overhead, will have fine-grained control
over their own queuing for kernel resources, and where they can pass data
structures by reference all the way from the network to the disk and back?

In Linux, we already have Open-iSCSI, which just has the control plane in
userspace, while the data plane is entirely a Linux kernel service, gaining it
all these advantages. This architecture works very well there; I'm unclear on
why others attempting to provide the higher-level "data-management solutions",
with the same high-throughput/low-latency requirements, haven't copied it.

~~~
aratno
Sounds like you would be interested in this paper on TabulaROSA:
[https://arxiv.org/pdf/1807.05308.pdf](https://arxiv.org/pdf/1807.05308.pdf)

------
jandrewrogers
This is a good paper and I appreciate the holistic focus on cache efficiency,
an area where multiple orders of magnitude of performance improvement are
often easily attainable compared to many common implementations. However, it
also highlights the gap between academic literature and the state-of-the-art
in database engine design. For example, adaptive succinct indexing structures
have been used for at least a decade in closed source databases. Structures
similar to the ideas presented in the paper have been reduced to practice in
real systems for a long time.

Last month I delivered yet another database engine, benchmarked against the
best open source comparables, which provides a rough but concrete example of
the gap:

The designed memory:storage ratio was 1:1000, an order of magnitude higher
than even the 1:100 ratio mentioned as aggressive in the paper. In fairness,
my prior systems were designed much closer to 1:100 ratio and it used new CS
research to significantly extend the ratio without materially sacrificing
performance. For data models with fairly complex indexing requirements,
insertion performance was >100x(!) the best open source comparables.

A large part of this performance is due to dramatic improvements in cache
efficiency that are not even particularly novel -- the gains attributable to
improved cache efficiency in the paper are eminently believable. The data-to-
index ratio in the above is around a million-to-one, small enough to fit in
CPU cache for many TB scale data models. The high data-to-index ratio is
largely attributable to using search structures that forego total order and
balancing, which enables dramatic improvements in succinctness with minimal
reductions in selectivity.

The other major contributor to performance is scheduler design, which wasn't
really touched on in the paper and is largely ignored entirely in open source
databases.

tl;dr: current open source database engine designs leave a massive amount of
performance on the table due to very poor cache efficiency, and this paper
correctly touches on some of the ways this is materially improved in closed
source database engines.

~~~
willvarfar
Can you give names of closed-source database engines that have these kinds of
performance improvements?

I mean, the mainstream RDBMS like Oracle, DB2 etc don't seem to be ahead of
the open source databases; they are all stagnant too!

~~~
jandrewrogers
Databases are severely constrained by the architecture choices from when they
were designed, you can't back port modern database architecture and computer
science to e.g. an Oracle or DB2. To integrate new computer science advances
you often need to write a new kernel from first principles. I sunset the
designs I license to companies every several years, starting over from a clean
sheet.

Most new high-performance database engines are intended to give the developing
company a new qualitative uplift in capability, scale, or operational
efficiency. No one sells public licenses these days. You've heard of the
organizations that are buying building these semi-bespoke database engines but
they are intended for internal use only.

The reason no one sells these capabilities as a product anymore is pragmatic:
it is extremely expensive to design a database engine for general public
consumption and the economics are difficult to justify as an investment. But
many large companies are willing to pay many millions of dollars for a
narrowly focused database capabilities, and the reduced scope makes the
development cost more palatable.

~~~
eternalban
Was skimming at your Space Curve writeup [1] and your mention of discreet
internal components caught my eye. Are you open to expanding a bit on this
statement:

"Discrete topology internals largely obviate secondary indexing"

[1]:
[https://www.jandrewrogers.com/2015/10/08/spacecurve/](https://www.jandrewrogers.com/2015/10/08/spacecurve/)

~~~
jandrewrogers
Secondary indexing is a hack to address the reality that most indexing
algorithms can only represent a single type of relationship efficiently and
typically only in a single dimension. This is not a law of the universe, it is
just how our algorithms tend to work. If you could eliminate secondary
indexing without sacrificing selectivity, it would be a massive win for
performance and scalability. However, this would require a single indexing
algorithm for complex data models that preserved an arbitrary mix of
relational, time-series, spatial, graph, etc relationships for searches and
joins.

To make this work in a practical database engine, you can't index the data
model per se but you can make it work by indexing a _moduli space_ into which
arbitrary data models can be mapped. These tend to naturally expose the
topological structure of the underlying data model for computational purposes
even though you are not computing on the data model per se. Designing very
general moduli spaces for database purposes is non-trivial and, to make
matters worse, they are pathologically incompatible with typical surrounding
database infrastructure once you figure out how to construct them. But you can
use the exposed topology to execute complex searches and joins on the
underlying data model.

None of my database engines use secondary indexing at all, hence the excellent
scaling and write performance, even for complex mixed-mode data models. A
decade ago the representations were pretty brittle and limited because I
didn't know how to express many things, but these days I know how to elegantly
express just about every common data model.

~~~
eternalban
Thanks!

> a moduli space into which arbitrary data models can be mapped

Very interesting. Somehow reminds me of using latices for deterministic
concurrency. Is this a topic that is discussed in public literature or an
innovation of yours? Love to learn more about this.

~~~
jandrewrogers
The work is mostly mine but I've had collaborators for some of the research
over the years.

I accidentally invented it many years ago when I discovered an unorthodox but
elegant solution to an algorithm problem which had stymied researchers for
decades (and which I needed to solve for a specific application). Some months
later, a research group I was working with were convinced that the unusual
algorithm construction might be applicable to an unrelated algorithm problem
they had been working on. Six weeks later I had solved their problem too by
extending the concepts developed for my original algorithm.

By that point I realized that while those two very different algorithms were
cool, the half-baked computer science I had developed to construct them was
even cooler. I spent the next decade fleshing out, generalizing, and extending
the scope of the computer science while figuring out how to efficiently
express that theory in practical software (which is not trivial). While I
wrote quite a few papers on this computer science many years ago when it was
new, the distribution was required to be non-public. I sporadically teach bits
of it but writing up hundreds of pages of old research in my spare time is a
lot less fun than working on my backlog of interesting computer science
projects.

------
alecco
Note by compression they mean keeping internal blocks closer to full. It looks
like a good thesis and advisors are reputable.

But this comes with a trade-off. As blocks are full, inserts trigger more
often a cascade effect. Batching inserts helps but once you need to apply the
batch that could take a long time to rebalance potentially the whole tree.
This adds a fat tail to insert times. But in many read-heavy scenarios it is a
good trade-off.

~~~
dgacmu
There was an aspect of that in the first part of the thesis, but the rest has
techniques that are independent. You can roughly break the thesis down by
paper:

    
    
      * Hybrid Indexes (a read-only, full-block kinda thing where you have to handle inserts by using a second read-write tree)
      * SuRF, a succinct range filter data structure
      * Order-Preserving Key Compression for In-Memory Search Trees
    

Each of them can be used independent of the others, or combined. You can find
two of those papers on Huanchen's web page:
[http://www.cs.cmu.edu/~huanche1/](http://www.cs.cmu.edu/~huanche1/) the third
is to-appear, but you can find a preprint on arxiv:
[https://arxiv.org/abs/2003.02391](https://arxiv.org/abs/2003.02391)

~~~
alecco
Excellent. Great job!

------
scandum
How does it perform against googlebtree?

[https://www.tommyds.it/doc/benchmark](https://www.tommyds.it/doc/benchmark)

------
7532yahoogmail
Off topic to OPs paper however consider the case of volt-db an all in memory
SQL db with replication. In the sales'y' write speed was written down to less
locks, latches, disk I/O. But somewhere else I read that a big culprit in fact
isn't that: it's formatting data to be written to disk then decomposing a disk
block back into memory for use. All memory dbs avoid that. Thoughts ?

------
7532yahoogmail
Working my way through paper. Looks very cool. And practical ... It's also
exceptionally well written. It's clear. Nice job

------
rini17
Only read the conclusion and it's not mentioned there: did they consider
locality and thus cache/paging misses?

