Hacker News new | past | comments | ask | show | jobs | submit login
Memory-Efficient Search Trees for Database Management Systems [pdf] (cmu.edu)
251 points by ngaut on March 11, 2020 | hide | past | favorite | 59 comments

A lot to digest. Presumably the compression schemes outlined in the paper would benefit columns generally, and not just keys, and would have a positive impact on row-size which would in turn benefit general performance?

One short-string-compression-library to compare with might be https://github.com/antirez/smaz by the Antirez, the author of redis.

In the last decade Tokudb and then MyRocks showed how to do a fast storage engine for data too big to fit in traditional RDBMS. They are multi-level, they do page-level compression etc. And yet there is still so many easy wins yet to be had.

Generally, databases are completely inefficient and there is just so much low-hanging fruit; if the team size working on e.g. myrocks could be doubled, and tasked with looking at the inefficiencies in the server as well as the storage engine, things might change. I have a list in my head of the various really-promising-ideas that databases don't actually do:

* linux doesn't have a syscall batching system, but if it did, the number of context switches would be cut down dramatically. Research in 2010 https://www.usenix.org/legacy/events/osdi10/tech/full_papers... proved this and it wouldn't just be databases that benefit. These days context switching is more expensive than ever.

* database engines all use blocking io. Finally io_uring offers workable async io and that would benefit database storage engines immensely. See https://fosdem.org/2020/schedule/event/rust_techniques_sled/

* tokudb showed that simply not tracking the affected row count could speed things up massively (tokudb called it NOAR)

* query engines often don't handle things that a few more lines of code would handle efficiently. I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql evaluating constraints that it can against the other columns in the key before dereferencing the row? Arrgh. Dumb dumb dumb. Etc.

> Generally, databases are completely inefficient and there is just so much low-hanging fruit;

Not generally, unless you only consider only popular open-source databases on Linux to be the "general" case.

> Linux doesn't have a syscall batching system

Windows has scatter-gather I/O, and notably SQL Server uses it.

> Database engines all use blocking io

SQL Server almost exclusively uses asynchronous I/O, and the entire Windows kernel I/O stack has been async since the 1990s.

> I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql

That's because MySQL barely qualifies as a database engine. It has a only fraction of the features of any commercial database, but it is cheap and it has low latency for simple queries, which makes it popular.

PostgreSQL is substantially more feature rich, but they made tradeoffs elsewhere in the design that results in high write amplification, unfortunately making it a no-go for many potential use cases.

You would be shocked at the concurrency levels achievable with modern in-memory databases, and the compression levels available with technologies like SQL Server's ColumnStore.

Unfortunately, all of that technology comes with a licensing cost...

> Unfortunately, all of that technology comes with a licensing cost...

Yes, so in the end it doesn't decrease TCO—it decreases OpEx but increases CapEx.

(Or, to put that another way: anything one SQL Server instance can do, five-to-ten Postgres read-replicas can do for the same price.)

Low-hanging fruit is "low-hanging" (i.e. tempting to pluck) because it'd be high-positive-ROI to solve these problems. Switching to an RDBMS that has already solved these problems, but has a licensing cost, is more likely just ROI-neutral.

>> the entire Windows kernel I/O stack has been async since the 1990s.

Sincere Qs from a novice database developer, what good is all that async IO when you also have to consider data integrity? How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer? In what db layer would I be helped by async IO?

Not involved, but giving an educated guess:

Imagine a simple search like SELECT * FROM my_table WHERE foo = ?

Imagine that the column foo is indexed.

With blocking IO, the we go through the foo index, finding the rows that have the correct value of foo.

The rows we find are on some kind of pages in the storage of the main table. So we now have a list of pages we have to fetch.

With blocking IO, we would fetch a page at a time, process it, then request the next page.

With async IO, we would request the fetching of all pages, as we discover we need them, and then process those pages in the order the IO subsystem can return them to us.

Writes are the same; we have some number of pages we want to write to disk. With blocking IO, we write a page at a time and wait for it to be written. With async you can buffer up the writes, let the OS write them in the order it chooses, and wait for them all to finish.

Note that writes are asynchronous on Linux too, and only become synchronous when the kernel buffer for writing is full. No special code needed.

Stuff like that is even more confusing because your performance analysis at moderate load tells you absolutely nothing about how the system will behave under high load (people who complain about GC often mention this as one of their grievances)

Turns out this causes a lot of issues when working with large diverse teams in particular. If you have a team of 4 you can socialize the idea of avoiding situations like this.

> How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer?

I mean, you wouldn't get write-concurrency for that field. And in fact you'd avoid serial PKs in general in a write-concurrent design, in favor of e.g. client-generated UUIDs.

Where you'd get wins is when 1000 different clients are each opening a transaction and upserting 1000 rows, with, say, 50% of them being duplicates of other clients' rows. Rather than linearizing+deduping all of those right away, you can just write down all 1000 individual MVCC-world-state WAL log fragments concurrently, as 1000 queued write-batches. This lets you take actual advantage of all those write IOPS that RAIDed NVMe gets you.

Using threads for in-flight IOs means burning up lots of memory with stacks, and burning ever more CPU in context switches as IOs complete. In practice this means using some kind of state machine (either explicit or in the form of coroutines) to pick up where logic paused at the initiation of the IO. The logic with respect to locking and concurrency control doesn't change; the costs in memory and CPU do.

The CPU cost is increasing all the time due to all the cache attacks being revealed and the mitigations blowing away various cache contents, which is especially damaging in VM contexts like AWS where the mitigations can't be reasonably disabled.

Data integrity and asynchronous I/O are unrelated concerns. The value of async I/O is that it exposes the concurrency and parallelism of the storage hardware to the software layer, enabling many types of performance optimizations. The vast majority of I/O operations in many database architectures do not have any sequential dependencies on other I/O operations, even within a single execution context.

Data integrity is a function of design and implementation correctness, in addition to having robust mechanisms for dealing with data corruption during I/O, hardware failures, etc.

> database engines all use blocking io

Only in open source, most advanced database engines have not been designed with blocking I/O in a very long time in closed source. I have not worked on any database engines with blocking I/O in the last decade.

This is an area of easy integer factor performance improvement for open source databases, and one of the primary reasons they are not competitive with the performance of many closed source designs.

Are there any benchmarks openly available to compare the performance of different open/closed source DB’s?

An older comparison of MySQL vs SQL Server that I dug up:


> linux doesn't have a syscall batching system

Well, you already mention io_uring. That's your syscall batching system. It doesn't cover all existing syscalls because not everything is safe to do asynchronously.

Yes, I'd like both!

Asynchronous io_uring is good for things that take a lot of time, e.g. io. Its really just a queue between the user-space and kernel threads.

But it doesn't replace just making a synchronous syscall for small things e.g. stating files etc.

A lot of those syscalls are called in quick succession, often with no logic between them.

The Soares paper I linked too showed how buffering up a list of syscalls, doing a single context switch to the kernel, executing them and then switching back is a major gain to a wide variety of applications.

Sychronous syscall batching would be much more straightforward to adopt than aysnc io; imagine if cstdlib file stuff all used buffered syscalls - everyone would benefit! Whereas I would imagine it adopting true async io would be very hard because it is a exposing a blocking api to its user.

You can use io_uring for synchronous batching too. All you have to do is submit a batch and set min_complete to the number of submitted items when calling io_uring_enter. That's a single syscall, assuming you already have some thread-local ring lying around just for batching.

You probably also need to set the flag that enables the events to be linked, so that they happen in the correct order.

Only when order matters. For the given example of stating a lot of files it wouldn't.

> A lot of those syscalls are called in quick succession, often with no logic between them.

Speaking of which, I noticed on macOS that the “du” command-line utility seems much slower than a third-party GUI utility called DaisyDisk [1] at collecting the sizes of everything on a disk.

This has me wondering what DaisyDisk is doing, if anything, that “du” isn’t, to achieve this.

One theory of mine is that perhaps APFS caches metadata about file sizes and that this can be accessed faster than walking all directories and stat’ing all the things.

An alternative theory I have is that perhaps GUI apps just run with a higher priority.

If anyone knows, I’d be delighted to learn.

A third possibly is that all of this was just due to “noise”. For example, disks happening to be under less I/O pressure when I ran DaisyDisk than when I’ve run “du”. My perception of their relative speed is based on very little scientific rigor.

[1]: https://daisydiskapp.com/

One of the features of APFS is supposedly a cache for total size of directory trees.

As far as I remember, it is something the OS decides to turn on or off for different folders. It’s wasteful for /tmp, for example, because it makes file creation and deletion slightly slower.

I’m not entirely sure if du would want to tap into that. A lot has changed in MacOS, and it’s not entirely clear if we still have a universal definition of “file size”. You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.

Don’t remember the exact name, sorry.

There is a MacOS-native way to explore recursive size, by the way: about this Mac -> storage -> manage -> documents -> file browser. This is limited to your user folder by default, but you can just create a soft link to take you to /

> You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.

Do you know how that translates to the BSD layer? Does such a file appear to exist and have a certain size, but is e.g. exclusively read-locked, unless you manually download it? Or does fopen(2) block, trigger an iCloud download, and then unblock once it's complete?

It might just be using more threads and thus filling command queues better.

Right. It's likely multithreaded whereas du is not. You can find some multithreaded versions of du though.

Stat and open are supported by io_uring in 5.6

It also doesn't (yet) handle use cases like FOPEN -> FREAD -> FCLOSE, where all of them are queued at once. This because the read and close need the file descriptor the open operation need.

Perhaps not a problem for databases (which probably keep all relevant files open anyhow),

"database engines all use blocking io" Well sled is using io_uring

probably why that paragraph had a link to a video titled "sled and rio: modern database engineering with io_uring"...

which was my point :)


I thought I saw a PR fly by for mysql but I can't find it now so maybe I was imagining things.

InnoDB (MySQL) uses aio where available.

MySql is the original No sql database

I've always wondered why I haven't seen a DBMS built as a unikernel. A DBMS is already a "managed runtime" for data, with its own memory allocator, scheduler, filesystem (in some sense), etc. And you're almost always going to want to run a DBMS "workload" on its own dedicated hardware/VM, anyway, for predictability.

So why not just take that set of DBMS services and put them in ring-0, where they won't need any context-switch overhead, will have fine-grained control over their own queuing for kernel resources, and where they can pass data structures by reference all the way from the network to the disk and back?

In Linux, we already have Open-iSCSI, which just has the control plane in userspace, while the data plane is entirely a Linux kernel service, gaining it all these advantages. This architecture works very well there; I'm unclear on why others attempting to provide the higher-level "data-management solutions", with the same high-throughput/low-latency requirements, haven't copied it.

Sounds like you would be interested in this paper on TabulaROSA: https://arxiv.org/pdf/1807.05308.pdf

Scylla is available as a unikernel.

This is a good paper and I appreciate the holistic focus on cache efficiency, an area where multiple orders of magnitude of performance improvement are often easily attainable compared to many common implementations. However, it also highlights the gap between academic literature and the state-of-the-art in database engine design. For example, adaptive succinct indexing structures have been used for at least a decade in closed source databases. Structures similar to the ideas presented in the paper have been reduced to practice in real systems for a long time.

Last month I delivered yet another database engine, benchmarked against the best open source comparables, which provides a rough but concrete example of the gap:

The designed memory:storage ratio was 1:1000, an order of magnitude higher than even the 1:100 ratio mentioned as aggressive in the paper. In fairness, my prior systems were designed much closer to 1:100 ratio and it used new CS research to significantly extend the ratio without materially sacrificing performance. For data models with fairly complex indexing requirements, insertion performance was >100x(!) the best open source comparables.

A large part of this performance is due to dramatic improvements in cache efficiency that are not even particularly novel -- the gains attributable to improved cache efficiency in the paper are eminently believable. The data-to-index ratio in the above is around a million-to-one, small enough to fit in CPU cache for many TB scale data models. The high data-to-index ratio is largely attributable to using search structures that forego total order and balancing, which enables dramatic improvements in succinctness with minimal reductions in selectivity.

The other major contributor to performance is scheduler design, which wasn't really touched on in the paper and is largely ignored entirely in open source databases.

tl;dr: current open source database engine designs leave a massive amount of performance on the table due to very poor cache efficiency, and this paper correctly touches on some of the ways this is materially improved in closed source database engines.


> ...adaptive succinct indexing structures

We've been optimising for low-memory and compact / succinct data-structures have been super impactful (especially, static succinct Radix Tree and Data Sketches like CountMinSketch). Apart from the excellent sdsl-lite [0], roaring-bitmaps [1], apache data-sketches [2], and pizza-chilli [3], what are other resources that you recommend we take a look?

> ...storage ratio was 1:1000, an order of magnitude

This is insane: Is it due to compression techniques or... Could you please point a few reasons why the gap is this high?

> The other major contributor to performance is scheduler design...

Out of curiosity: Any pointers to papers that we can take a look at?


[0] https://github.com/simongog/sdsl-lite

[1] https://roaringbitmap.org/

[2] https://datasketches.apache.org/

[3] http://pizzachili.dcc.uchile.cl/

Physical storage density per server is driving this. You can fit upward of a petabyte of physical storage in a server and there are many applications where this makes sense. RAM is still on the order of a terabyte, and expensive. Everything else follows from trying to use all of this physical storage effectively. Needless to say, when working with this much storage you aren't installing a filesystem on it.

When working with storage this dense, write performance matters immensely if you can't wait (literally) a year to index your data model. Scaling write throughput on indexing structures has a problem of write sparsity: given a fixed amount of cache, there is a scaling point where virtually every single insert requires at least one page fault. Indexing structures like B+Trees that consume large amounts of space will not only push the data pages out of cache, the index itself may no longer fit into cache at high densities. Every database exhibits this characteristic write cliff with indexing structures but the cache:data ratio where this occurs varies with indexing algorithm design. (This also impacts query performance but we'll ignore that for now.)

Succinct indexing structures can eliminate the competition with data pages for cache space. Write sparsity is intrinsic though, so this only lets you maintain write performance for maybe another 1-2 orders of magnitude. Eventually the set of actively written data pages will exceed the size of the cache and write throughput will drop precipitously before leveling off.

Being able to index tens of terabytes of data at wire speed is much better than typical but at least an order of magnitude short of what is desirable for a petabyte of storage in a single server. Ideally, you'd want the flat write throughput to extend out to the end of your storage, which means extending the write cliff at least another order of magnitude. There is not an obvious way of doing this without adversely impacting performance in other unacceptable ways.

I started researching this problem a few years ago from a very different direction that does not require directly improving cache efficiency per se, since that is essentially tapped out as a strategy. Most software engineers don't understand the nature of caches as abstract mathematical objects but there are some interesting NP-Hard problems surrounding the behavior of caches that we essentially ignore for database architecture purposes because (1) NP-Hard and (2) exploitable efficient approximations are incompatible with many common database architectures in any case. I've worked through a couple new algorithm prototypes that attack these properties to significantly extend the performance party for even higher storage densities. I'm just starting to design kernels that are purpose-built to take advantage of this research but the heavy lifting is done in the schedulers.

Yes, I do this kind of thing for fun.

Could you provide some references for this work on caching and schedulers?

I’m also curious what application domains are storing close to petabytes on a single machine with only a few terabytes of dram.

Ultra-dense database engines are typically used for data models created from measuring entities or places in the physical world, sometimes in real-time. This includes telemetry and sensors from mobile phones, automobiles, satellites, features extracted from video, network packets, the web, etc and sometimes all of the above. These are the largest data models that I know of; most medium-sized non-tech companies I work with have 10-100 PB data models now. The size doubles quickly even if you regularly truncate old data. The extremely high storage density saves both hardware and software engineering costs by massively reducing the required cluster size.

This has become a great emerging market for new database tech, tremendous amounts of money are being spent on these data models and traditional data infrastructures are very poor for this. The size and velocity of the data implies that you need a single database instance that can run ingestion workloads concurrent with analysis workloads.

The research on caching algorithms for ultra-dense storage is new, it is literally just starting to be worked into production designs, so I wouldn't expect references. Most advanced database research is done outside academia, so a significant fraction is published long after implementation or not formally published at all. Publication is not the objective per se and there is little time or incentive to do that work. Database research has a discovery problem, I hear about most interesting new research via informal community channels and much of that never shows up in literature.

Thanks for the reply. I would be very interested if you could point me towards any private companies providing these kind of database solutions. I am particularly interested in their story around concurrent ingestion + analytics workloads. This is a tricky problem that most existing works solve by either doing stupid things a la. having a single-version system, or by using lots of synchronization + complex version-friendly concurrent data structures. Solutions I have seen in academia are usually tested well below the 1TB of data range, and I don't imagine they scale to the kind of data sizes you mention.

The fact that the research is being done outside of academia and not being published is pretty sad imo. Hopefully those in the know decide to publish at least tech-reports/whitepapers once they have a share of the market.

I'm not sure if shakti-db [0] fits the bill of what jandrewrogers is talking about, but the same folks behind it built one of the fastest time-series databases in-use today, the kdb+ [1].

[0] https://shakti.com

[1] https://news.ycombinator.com/item?id=13481824

Can you give names of closed-source database engines that have these kinds of performance improvements?

I mean, the mainstream RDBMS like Oracle, DB2 etc don't seem to be ahead of the open source databases; they are all stagnant too!

Databases are severely constrained by the architecture choices from when they were designed, you can't back port modern database architecture and computer science to e.g. an Oracle or DB2. To integrate new computer science advances you often need to write a new kernel from first principles. I sunset the designs I license to companies every several years, starting over from a clean sheet.

Most new high-performance database engines are intended to give the developing company a new qualitative uplift in capability, scale, or operational efficiency. No one sells public licenses these days. You've heard of the organizations that are buying building these semi-bespoke database engines but they are intended for internal use only.

The reason no one sells these capabilities as a product anymore is pragmatic: it is extremely expensive to design a database engine for general public consumption and the economics are difficult to justify as an investment. But many large companies are willing to pay many millions of dollars for a narrowly focused database capabilities, and the reduced scope makes the development cost more palatable.

Was skimming at your Space Curve writeup [1] and your mention of discreet internal components caught my eye. Are you open to expanding a bit on this statement:

"Discrete topology internals largely obviate secondary indexing"

[1]: https://www.jandrewrogers.com/2015/10/08/spacecurve/

Secondary indexing is a hack to address the reality that most indexing algorithms can only represent a single type of relationship efficiently and typically only in a single dimension. This is not a law of the universe, it is just how our algorithms tend to work. If you could eliminate secondary indexing without sacrificing selectivity, it would be a massive win for performance and scalability. However, this would require a single indexing algorithm for complex data models that preserved an arbitrary mix of relational, time-series, spatial, graph, etc relationships for searches and joins.

To make this work in a practical database engine, you can't index the data model per se but you can make it work by indexing a moduli space into which arbitrary data models can be mapped. These tend to naturally expose the topological structure of the underlying data model for computational purposes even though you are not computing on the data model per se. Designing very general moduli spaces for database purposes is non-trivial and, to make matters worse, they are pathologically incompatible with typical surrounding database infrastructure once you figure out how to construct them. But you can use the exposed topology to execute complex searches and joins on the underlying data model.

None of my database engines use secondary indexing at all, hence the excellent scaling and write performance, even for complex mixed-mode data models. A decade ago the representations were pretty brittle and limited because I didn't know how to express many things, but these days I know how to elegantly express just about every common data model.


> a moduli space into which arbitrary data models can be mapped

Very interesting. Somehow reminds me of using latices for deterministic concurrency. Is this a topic that is discussed in public literature or an innovation of yours? Love to learn more about this.

The work is mostly mine but I've had collaborators for some of the research over the years.

I accidentally invented it many years ago when I discovered an unorthodox but elegant solution to an algorithm problem which had stymied researchers for decades (and which I needed to solve for a specific application). Some months later, a research group I was working with were convinced that the unusual algorithm construction might be applicable to an unrelated algorithm problem they had been working on. Six weeks later I had solved their problem too by extending the concepts developed for my original algorithm.

By that point I realized that while those two very different algorithms were cool, the half-baked computer science I had developed to construct them was even cooler. I spent the next decade fleshing out, generalizing, and extending the scope of the computer science while figuring out how to efficiently express that theory in practical software (which is not trivial). While I wrote quite a few papers on this computer science many years ago when it was new, the distribution was required to be non-public. I sporadically teach bits of it but writing up hundreds of pages of old research in my spare time is a lot less fun than working on my backlog of interesting computer science projects.

Ah, my coin dropped here. Most likely you mean the content addressable segmented architecture.

Actian Vector, formerly Vectorwise: https://docs.actian.com/vector/5.0/index.html

From the "vectorized columnar execution, operator at a time" school.

HyperDB: https://hyper-db.de/

That's from the "row-oriented execution, query plan JIT compilation" school.

The hyper-db folks do great work. Indeed, Huanchen collaborated with Viktor Leis (one of the senior folks on HyPer) on the SuRF "Succinct Range Filter" (think Bloom filter but for range queries) work: http://www.cs.cmu.edu/~huanche1/publications/surf_paper.pdf and github: https://github.com/efficient/SuRF

and both Viktor and Thomas Neumann's work in this area is fantastic and everyone should go read it. :)

The code for several of Huanchen's projects is available in our group's github repo: https://github.com/efficient/

The order-preserving compression work outlined in his thesis will be appearing soon at SIGMOD, but there's no need to wait for it, since it's roughly the same in both. We'll get the source to that released by the time of SIGMOD.

Is your benchmark and data published as a paper or article somewhere. I would be interested to see it

Unfortunately, this database engine implementation (like virtually all I work on) was for a customer. The details are not public.

Note by compression they mean keeping internal blocks closer to full. It looks like a good thesis and advisors are reputable.

But this comes with a trade-off. As blocks are full, inserts trigger more often a cascade effect. Batching inserts helps but once you need to apply the batch that could take a long time to rebalance potentially the whole tree. This adds a fat tail to insert times. But in many read-heavy scenarios it is a good trade-off.

There was an aspect of that in the first part of the thesis, but the rest has techniques that are independent. You can roughly break the thesis down by paper:

  * Hybrid Indexes (a read-only, full-block kinda thing where you have to handle inserts by using a second read-write tree)
  * SuRF, a succinct range filter data structure
  * Order-Preserving Key Compression for In-Memory Search Trees
Each of them can be used independent of the others, or combined. You can find two of those papers on Huanchen's web page: http://www.cs.cmu.edu/~huanche1/ the third is to-appear, but you can find a preprint on arxiv: https://arxiv.org/abs/2003.02391

Excellent. Great job!

How does it perform against googlebtree?


Off topic to OPs paper however consider the case of volt-db an all in memory SQL db with replication. In the sales'y' write speed was written down to less locks, latches, disk I/O. But somewhere else I read that a big culprit in fact isn't that: it's formatting data to be written to disk then decomposing a disk block back into memory for use. All memory dbs avoid that. Thoughts ?

Working my way through paper. Looks very cool. And practical ... It's also exceptionally well written. It's clear. Nice job

Only read the conclusion and it's not mentioned there: did they consider locality and thus cache/paging misses?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact