One short-string-compression-library to compare with might be https://github.com/antirez/smaz by the Antirez, the author of redis.
In the last decade Tokudb and then MyRocks showed how to do a fast storage engine for data too big to fit in traditional RDBMS. They are multi-level, they do page-level compression etc. And yet there is still so many easy wins yet to be had.
Generally, databases are completely inefficient and there is just so much low-hanging fruit; if the team size working on e.g. myrocks could be doubled, and tasked with looking at the inefficiencies in the server as well as the storage engine, things might change. I have a list in my head of the various really-promising-ideas that databases don't actually do:
* linux doesn't have a syscall batching system, but if it did, the number of context switches would be cut down dramatically. Research in 2010 https://www.usenix.org/legacy/events/osdi10/tech/full_papers... proved this and it wouldn't just be databases that benefit. These days context switching is more expensive than ever.
* database engines all use blocking io. Finally io_uring offers workable async io and that would benefit database storage engines immensely. See https://fosdem.org/2020/schedule/event/rust_techniques_sled/
* tokudb showed that simply not tracking the affected row count could speed things up massively (tokudb called it NOAR)
* query engines often don't handle things that a few more lines of code would handle efficiently. I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql evaluating constraints that it can against the other columns in the key before dereferencing the row? Arrgh. Dumb dumb dumb. Etc.
Not generally, unless you only consider only popular open-source databases on Linux to be the "general" case.
> Linux doesn't have a syscall batching system
Windows has scatter-gather I/O, and notably SQL Server uses it.
> Database engines all use blocking io
SQL Server almost exclusively uses asynchronous I/O, and the entire Windows kernel I/O stack has been async since the 1990s.
> I've got some tables with big compound keys and often do prefix searches in them, and why isn't mysql
That's because MySQL barely qualifies as a database engine. It has a only fraction of the features of any commercial database, but it is cheap and it has low latency for simple queries, which makes it popular.
PostgreSQL is substantially more feature rich, but they made tradeoffs elsewhere in the design that results in high write amplification, unfortunately making it a no-go for many potential use cases.
You would be shocked at the concurrency levels achievable with modern in-memory databases, and the compression levels available with technologies like SQL Server's ColumnStore.
Unfortunately, all of that technology comes with a licensing cost...
Yes, so in the end it doesn't decrease TCO—it decreases OpEx but increases CapEx.
(Or, to put that another way: anything one SQL Server instance can do, five-to-ten Postgres read-replicas can do for the same price.)
Low-hanging fruit is "low-hanging" (i.e. tempting to pluck) because it'd be high-positive-ROI to solve these problems. Switching to an RDBMS that has already solved these problems, but has a licensing cost, is more likely just ROI-neutral.
Sincere Qs from a novice database developer, what good is all that async IO when you also have to consider data integrity? How would you e.g. uphold the integrity of file that stores the state of a autoincrementing field if you didn't use a singleton writer? In what db layer would I be helped by async IO?
Imagine a simple search like SELECT * FROM my_table WHERE foo = ?
Imagine that the column foo is indexed.
With blocking IO, the we go through the foo index, finding the rows that have the correct value of foo.
The rows we find are on some kind of pages in the storage of the main table. So we now have a list of pages we have to fetch.
With blocking IO, we would fetch a page at a time, process it, then request the next page.
With async IO, we would request the fetching of all pages, as we discover we need them, and then process those pages in the order the IO subsystem can return them to us.
Writes are the same; we have some number of pages we want to write to disk. With blocking IO, we write a page at a time and wait for it to be written. With async you can buffer up the writes, let the OS write them in the order it chooses, and wait for them all to finish.
Turns out this causes a lot of issues when working with large diverse teams in particular. If you have a team of 4 you can socialize the idea of avoiding situations like this.
I mean, you wouldn't get write-concurrency for that field. And in fact you'd avoid serial PKs in general in a write-concurrent design, in favor of e.g. client-generated UUIDs.
Where you'd get wins is when 1000 different clients are each opening a transaction and upserting 1000 rows, with, say, 50% of them being duplicates of other clients' rows. Rather than linearizing+deduping all of those right away, you can just write down all 1000 individual MVCC-world-state WAL log fragments concurrently, as 1000 queued write-batches. This lets you take actual advantage of all those write IOPS that RAIDed NVMe gets you.
The CPU cost is increasing all the time due to all the cache attacks being revealed and the mitigations blowing away various cache contents, which is especially damaging in VM contexts like AWS where the mitigations can't be reasonably disabled.
Data integrity is a function of design and implementation correctness, in addition to having robust mechanisms for dealing with data corruption during I/O, hardware failures, etc.
Only in open source, most advanced database engines have not been designed with blocking I/O in a very long time in closed source. I have not worked on any database engines with blocking I/O in the last decade.
This is an area of easy integer factor performance improvement for open source databases, and one of the primary reasons they are not competitive with the performance of many closed source designs.
Well, you already mention io_uring. That's your syscall batching system. It doesn't cover all existing syscalls because not everything is safe to do asynchronously.
Asynchronous io_uring is good for things that take a lot of time, e.g. io. Its really just a queue between the user-space and kernel threads.
But it doesn't replace just making a synchronous syscall for small things e.g. stating files etc.
A lot of those syscalls are called in quick succession, often with no logic between them.
The Soares paper I linked too showed how buffering up a list of syscalls, doing a single context switch to the kernel, executing them and then switching back is a major gain to a wide variety of applications.
Sychronous syscall batching would be much more straightforward to adopt than aysnc io; imagine if cstdlib file stuff all used buffered syscalls - everyone would benefit! Whereas I would imagine it adopting true async io would be very hard because it is a exposing a blocking api to its user.
Speaking of which, I noticed on macOS that the “du” command-line utility seems much slower than a third-party GUI utility called DaisyDisk  at collecting the sizes of everything on a disk.
This has me wondering what DaisyDisk is doing, if anything, that “du” isn’t, to achieve this.
One theory of mine is that perhaps APFS caches metadata about file sizes and that this can be accessed faster than walking all directories and stat’ing all the things.
An alternative theory I have is that perhaps GUI apps just run with a higher priority.
If anyone knows, I’d be delighted to learn.
A third possibly is that all of this was just due to “noise”. For example, disks happening to be under less I/O pressure when I ran DaisyDisk than when I’ve run “du”. My perception of their relative speed is based on very little scientific rigor.
As far as I remember, it is something the OS decides to turn on or off for different folders. It’s wasteful for /tmp, for example, because it makes file creation and deletion slightly slower.
I’m not entirely sure if du would want to tap into that. A lot has changed in MacOS, and it’s not entirely clear if we still have a universal definition of “file size”. You can let iCloud store large files now, for example, truncating the local file when it is rarely accessed.
Don’t remember the exact name, sorry.
There is a MacOS-native way to explore recursive size, by the way: about this Mac -> storage -> manage -> documents -> file browser. This is limited to your user folder by default, but you can just create a soft link to take you to /
Do you know how that translates to the BSD layer? Does such a file appear to exist and have a certain size, but is e.g. exclusively read-locked, unless you manually download it? Or does fopen(2) block, trigger an iCloud download, and then unblock once it's complete?
Perhaps not a problem for databases (which probably keep all relevant files open anyhow),
I thought I saw a PR fly by for mysql but I can't find it now so maybe I was imagining things.
So why not just take that set of DBMS services and put them in ring-0, where they won't need any context-switch overhead, will have fine-grained control over their own queuing for kernel resources, and where they can pass data structures by reference all the way from the network to the disk and back?
In Linux, we already have Open-iSCSI, which just has the control plane in userspace, while the data plane is entirely a Linux kernel service, gaining it all these advantages. This architecture works very well there; I'm unclear on why others attempting to provide the higher-level "data-management solutions", with the same high-throughput/low-latency requirements, haven't copied it.
Last month I delivered yet another database engine, benchmarked against the best open source comparables, which provides a rough but concrete example of the gap:
The designed memory:storage ratio was 1:1000, an order of magnitude higher than even the 1:100 ratio mentioned as aggressive in the paper. In fairness, my prior systems were designed much closer to 1:100 ratio and it used new CS research to significantly extend the ratio without materially sacrificing performance. For data models with fairly complex indexing requirements, insertion performance was >100x(!) the best open source comparables.
A large part of this performance is due to dramatic improvements in cache efficiency that are not even particularly novel -- the gains attributable to improved cache efficiency in the paper are eminently believable. The data-to-index ratio in the above is around a million-to-one, small enough to fit in CPU cache for many TB scale data models. The high data-to-index ratio is largely attributable to using search structures that forego total order and balancing, which enables dramatic improvements in succinctness with minimal reductions in selectivity.
The other major contributor to performance is scheduler design, which wasn't really touched on in the paper and is largely ignored entirely in open source databases.
tl;dr: current open source database engine designs leave a massive amount of performance on the table due to very poor cache efficiency, and this paper correctly touches on some of the ways this is materially improved in closed source database engines.
> ...adaptive succinct indexing structures
We've been optimising for low-memory and compact / succinct data-structures have been super impactful (especially, static succinct Radix Tree and Data Sketches like CountMinSketch). Apart from the excellent sdsl-lite , roaring-bitmaps , apache data-sketches , and pizza-chilli , what are other resources that you recommend we take a look?
> ...storage ratio was 1:1000, an order of magnitude
This is insane: Is it due to compression techniques or... Could you please point a few reasons why the gap is this high?
> The other major contributor to performance is scheduler design...
Out of curiosity: Any pointers to papers that we can take a look at?
When working with storage this dense, write performance matters immensely if you can't wait (literally) a year to index your data model. Scaling write throughput on indexing structures has a problem of write sparsity: given a fixed amount of cache, there is a scaling point where virtually every single insert requires at least one page fault. Indexing structures like B+Trees that consume large amounts of space will not only push the data pages out of cache, the index itself may no longer fit into cache at high densities. Every database exhibits this characteristic write cliff with indexing structures but the cache:data ratio where this occurs varies with indexing algorithm design. (This also impacts query performance but we'll ignore that for now.)
Succinct indexing structures can eliminate the competition with data pages for cache space. Write sparsity is intrinsic though, so this only lets you maintain write performance for maybe another 1-2 orders of magnitude. Eventually the set of actively written data pages will exceed the size of the cache and write throughput will drop precipitously before leveling off.
Being able to index tens of terabytes of data at wire speed is much better than typical but at least an order of magnitude short of what is desirable for a petabyte of storage in a single server. Ideally, you'd want the flat write throughput to extend out to the end of your storage, which means extending the write cliff at least another order of magnitude. There is not an obvious way of doing this without adversely impacting performance in other unacceptable ways.
I started researching this problem a few years ago from a very different direction that does not require directly improving cache efficiency per se, since that is essentially tapped out as a strategy. Most software engineers don't understand the nature of caches as abstract mathematical objects but there are some interesting NP-Hard problems surrounding the behavior of caches that we essentially ignore for database architecture purposes because (1) NP-Hard and (2) exploitable efficient approximations are incompatible with many common database architectures in any case. I've worked through a couple new algorithm prototypes that attack these properties to significantly extend the performance party for even higher storage densities. I'm just starting to design kernels that are purpose-built to take advantage of this research but the heavy lifting is done in the schedulers.
Yes, I do this kind of thing for fun.
I’m also curious what application domains are storing close to petabytes on a single machine with only a few terabytes of dram.
This has become a great emerging market for new database tech, tremendous amounts of money are being spent on these data models and traditional data infrastructures are very poor for this. The size and velocity of the data implies that you need a single database instance that can run ingestion workloads concurrent with analysis workloads.
The research on caching algorithms for ultra-dense storage is new, it is literally just starting to be worked into production designs, so I wouldn't expect references. Most advanced database research is done outside academia, so a significant fraction is published long after implementation or not formally published at all. Publication is not the objective per se and there is little time or incentive to do that work. Database research has a discovery problem, I hear about most interesting new research via informal community channels and much of that never shows up in literature.
The fact that the research is being done outside of academia and not being published is pretty sad imo. Hopefully those in the know decide to publish at least tech-reports/whitepapers once they have a share of the market.
I mean, the mainstream RDBMS like Oracle, DB2 etc don't seem to be ahead of the open source databases; they are all stagnant too!
Most new high-performance database engines are intended to give the developing company a new qualitative uplift in capability, scale, or operational efficiency. No one sells public licenses these days. You've heard of the organizations that are buying building these semi-bespoke database engines but they are intended for internal use only.
The reason no one sells these capabilities as a product anymore is pragmatic: it is extremely expensive to design a database engine for general public consumption and the economics are difficult to justify as an investment. But many large companies are willing to pay many millions of dollars for a narrowly focused database capabilities, and the reduced scope makes the development cost more palatable.
"Discrete topology internals largely obviate secondary indexing"
To make this work in a practical database engine, you can't index the data model per se but you can make it work by indexing a moduli space into which arbitrary data models can be mapped. These tend to naturally expose the topological structure of the underlying data model for computational purposes even though you are not computing on the data model per se. Designing very general moduli spaces for database purposes is non-trivial and, to make matters worse, they are pathologically incompatible with typical surrounding database infrastructure once you figure out how to construct them. But you can use the exposed topology to execute complex searches and joins on the underlying data model.
None of my database engines use secondary indexing at all, hence the excellent scaling and write performance, even for complex mixed-mode data models. A decade ago the representations were pretty brittle and limited because I didn't know how to express many things, but these days I know how to elegantly express just about every common data model.
> a moduli space into which arbitrary data models can be mapped
Very interesting. Somehow reminds me of using latices for deterministic concurrency. Is this a topic that is discussed in public literature or an innovation of yours? Love to learn more about this.
I accidentally invented it many years ago when I discovered an unorthodox but elegant solution to an algorithm problem which had stymied researchers for decades (and which I needed to solve for a specific application). Some months later, a research group I was working with were convinced that the unusual algorithm construction might be applicable to an unrelated algorithm problem they had been working on. Six weeks later I had solved their problem too by extending the concepts developed for my original algorithm.
By that point I realized that while those two very different algorithms were cool, the half-baked computer science I had developed to construct them was even cooler. I spent the next decade fleshing out, generalizing, and extending the scope of the computer science while figuring out how to efficiently express that theory in practical software (which is not trivial). While I wrote quite a few papers on this computer science many years ago when it was new, the distribution was required to be non-public. I sporadically teach bits of it but writing up hundreds of pages of old research in my spare time is a lot less fun than working on my backlog of interesting computer science projects.
From the "vectorized columnar execution, operator at a time" school.
That's from the "row-oriented execution, query plan JIT compilation" school.
and both Viktor and Thomas Neumann's work in this area is fantastic and everyone should go read it. :)
The code for several of Huanchen's projects is available in our group's github repo: https://github.com/efficient/
The order-preserving compression work outlined in his thesis will be appearing soon at SIGMOD, but there's no need to wait for it, since it's roughly the same in both. We'll get the source to that released by the time of SIGMOD.
But this comes with a trade-off. As blocks are full, inserts trigger more often a cascade effect. Batching inserts helps but once you need to apply the batch that could take a long time to rebalance potentially the whole tree. This adds a fat tail to insert times. But in many read-heavy scenarios it is a good trade-off.
* Hybrid Indexes (a read-only, full-block kinda thing where you have to handle inserts by using a second read-write tree)
* SuRF, a succinct range filter data structure
* Order-Preserving Key Compression for In-Memory Search Trees