I've always asserted that the reason we've never seen a proper kernel bypass database engine in open source is that the Minimum Viable Product is too complex. A bare, stripped-down, low-level database engine that does full bypass of the operating system is usually at least 100kLoC of low-level C/C++, and that is before you add all the features a database user will actually want. That is a big initial investment by some people with fairly rare software design skills.
Not only is the initial devepment expensive, so is the maintenance burden. It makes every new idea cost more to implement.
Postgres has been extraordinarily innovative; offering things like transactional DDL, advanced indexing, first-class extensibility, serializable snapshot isolation, per-transaction durability, sophisticated constraints (e.g. non-overlapping ranges), etc. These features are possible because postgres didn't get bogged down reimplementing and maintaining a filesystem.
That being said, there are some low level features that are really worth doing. Robert Haas (the author) did some great work with lockless algorithms, which has achieved great concurrency with a manageable maintenance burden.
You are correct that the initial development is steep. However, once the infrastructure is there it really is not much different than working with the operating system infrastructure and you gain a level of predictability and stability in terms of behavior that saves engineering time. Also, bypass implementations have almost no locking internally (either "lock-free" types or heavier types) which reduces complexity considerably.
Some bypass kernel code bases allow you to compile with the bypass implementation disabled, using highly-optimized PostgreSQL-like internals. I've seen and run quite a few comparative benchmarks on the same design with and without bypass enabled, as well as absolute benchmarks against engines like PostgreSQL. We don't have to guess about single node performance.
Broadly speaking, a properly designed bypass kernel buys you 2-3x the throughput of a highly optimized non-bypass kernel in my experience. If it was only 25% no one would bother. Furthermore, for massively parallel databases, you essentially require a bypass kernel to design a well-behaved system due to the adaptive operation scheduling requirements.
I agree that it is a lot of work but it is also entirely worth it if you need to either (1) maximize throughput on a single node and (2) build a well-behaved massively parallel database kernel. The differences are not trivial.
Also, you dismiss ideas that help the database and the OS work together better. For instance, I did "synchronized scans" for postgres. It coordinates sequential scans to start from the block another scan is already reading, improving cache behavior and dramatically reducing seeks. This could have been done by lots of extra code controlling the I/O very carefully (as at least one paper seemed to suggest was a good idea). But I chose to do it the simple way, just start the scan off in the same place as another scan, and concurrent scans got almost ideal behavior -- each ran in about the same time as if no other scan was in progress (with no overhead in the single scan case).
Linux is clearly interested in allowing more hooks and making them more useful. From an engineering standpoint, that makes more sense to me.
Two other points:
* I'm a little skeptical that such a bypass can easily be made resilient to some strange/degenerate cases.
* You say that the reason an open source system won't do it is because the MVP is too expensive. But the MVP for a cost-based optimizer is also very expensive, and postgres has one of those. I think that was a much better investment than investment in the filesystem/scheduling layer.
While the increased throughput is a complex function of hardware, workload, etc, it is also consistently substantial. The reason why it works is simple: the database processes have nearly omniscient view of hardware and state and there is only (in modern designs) a single process per core. Consequently, even if you have thousands of concurrent high-level database operations, each process can dynamically select and continuously reorder the low-level operations to (nearly) optimally maximize the throughput for the execution graph at that moment because the execution is completely cooperative. You can do the “synchronous scan” optimization for CPU caches that you do for disk systems. You can schedule around any conflicts in the execution graph and even the impact of outside CPU interrupts can be detected and optimized around. And it is easy to track the aggregate costs of these choices. To the extent possible, every clock cycle is spent on end-user database work instead of database internals overhead.
So minimal processing stalls, micro or macro, and no context-switching or coordination overhead. All combined with incredible locality knowledge (by inference) that is not available if you let the OS manage things for you.
On your other two points:
- Bypass is generally more resilient partly because the software has more explicit and immediate knowledge of the nature of the fault and can do something sensible about it. Obviously you have to handle faults when they occur. A lot of OS behavior when faults occur is pathological from the standpoint of optimizing databases. It is like memory management in C; it requires extra effort but also adds extra power if you handle it well.
- Postgres has expensive capability add-ons to an existing, useful system so it is more incremental in nature. The problem with OS bypass database kernels (and I learned this the hard way) is that (1) they are huge in terms of LoC long before rudimentary functionality is available and (2) it takes many years of atypical software design experience to be competent at trying to write one. It could be done, but it would require a critical mass of a tiny demographic willing to do a lot of work. My argument in this regard was less about inevitability and more about statistical probability.
I spent a lot of years hacking on and customizing Postgres. I recommend it to anyone and everyone that will listen because it is a great piece of engineering and would still use it for many OLTP systems. But it does leave a lot of performance on the table for a variety of reasons that probably make sense for a portable, open source project. The fact remains that I can design and have built bypass kernels that are substantially faster largely by exploiting the optimizations bypassing offers.
Postgres leaves a lot of performance on the table in much more basic ways, too, so I certainly am not suggesting that postgres is anywhere near optimal.
What are the key things that a kernel bypass version does different? Can these be separated out in a concise way which would lead to multiple DB implementations being able to use these same interfaces? Essentially for any major DB system, you'd want the kernel tailored anyway - you're not going to be doing much else on your DB server (are you?)
However 3-6 months later, you'll get comparable performance from improved kernel, CPU and disk speeds. Are those 20% in performance for 6 months worth the premium oracle is charging (which, in part, reflects their harder work)? For most customers most of the time the answer is no.
If you depend on performance, you don't use Oracle in the first place - Vyahu, OneTick, kdb, Vertica are the speed demons (as well as TimesTen which was acquired by Oracle - but is distinct from their "standard" offering)
Once you have these resources, you can organize them and use them as you see fit. Because it is not going to the kernel for any resources or buffering or scheduling or memory etc, there is little opportunity for the OS to do the wrong thing with resources that already are tightly controlled by the runtime. However, this is also why it is an "all or nothing" kind of situation.
Most commercial analytical databases are not bypass, due in large part to the fact that most of them are based on Postgres, ironically.
Maybe I don't understand this stuff, but maybe the kernel should have some bypass API for high performance applications, instead of coders finding curious ways to fight it.
To be clear, while the bypass APIs are simple to use you actually have to know what you are doing since you become responsible for doing things the OS used to do for you. I/O scheduling, disk caching, process scheduling, memory management, etc all have to be reimplemented in userspace.
It is why I mentioned that the skill set required to do bypass kernels is fairly rarified. You can't just reimplement what the OS already does, you need to implement something that is different than the OS design but also better at providing functionality the OS provides for the use case. You are essentially writing a purpose-optimized OS without the device drivers.
You think that VMs aren't subject to the host OS scheduler, caching, and memory allocation quirks?
eh, it could pin threads at realtime priority
vm wouldn't use the OS cache
it can easily allocate memory up front
Stuff like this really needs to make it into the PG tuning guide (https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv...). The only place where it will ultimately be seen by a worthwhile audience.
What are you, a school bully like the ones we see on American shows?
I can't wait for "Subtly Bad Things Linux May Be Doing To 2048"
Since when is "in depth" and "technical" equivalent to "infinitely nerdy"? We're professionals using that kind of information for work, not nerds doing infinitely nerdy stuff.
Could everybody wiser than me tell me if I should be concerned and the possible implications of these decisions? Should I invest in alternative platforms?
If you are running a huge database and need every possible bit of performance this will matter, otherwise it's not something to worry about.
It was always a fight with the storage guys, because they wanted to use their fancy Veritas File System for optimizing disk utilization, and us prima-donna DBAs wanted raw LUNs and allow the database engine to manage our disk, because it maximized our transaction throughput. Some DBAs even wanted whole disks allocated, so they could control where data lived from a disk geometry POV. There were (mostly) valid arguments for doing this, most of which have gone away over the years.
This is an issue like my disk issue -- corner cases that need to be thought about in situations where you are investing lots of engineering effort into your databases. If you don't have a couple of angry DBAs whom you're always arguing with, you don't need to worry about this.
But it's one of a myriad of little things, not something that could inform a platform decision. It's much more interesting for kernel devs than it is for postgresql users.
I think what this shows is that issues related to interactions between RAM, caches and CPU cores are becoming a lot more complex on all platforms.
Second issue is relevant for postgresql mostly only if you use very large shared_buffers which anyway is not recommended for general workloads. Writing page that exists on disk and was not read short time before is not especially common thing to do.
ZFS enables some interesting things for pgsql:
* http://open-zfs.org/wiki/Performance_tuning#PostgreSQL - it seems like the primarycache setting prevents the double buffering problem that Linux' page cache has
I run pgsql on FreeBSD/ZFS and have no complaints but am not taxing the system.
One clarification: Oracle can't actually cherry-pick back OpenZFS bug fixes because they are (ironically) violating the CDDL by not making available source code. This isn't an issue for the code for which they hold copyright -- but that doesn't include any of the bug fixes and features that we've seen in OpenZFS since 2010. And yes, it is absolutely negligent, but of a different sort than you intended...
SHMMAXPGS, SHMMAX in postgresql.conf
It was necessary to increase the default limits for larger shared_buffers, etc on pgsql prior to 9.3 where SysV shared memory was used, but this was common to many *nix operating systems.
You can demonstrate this with a program like the following:
int main(int argc, char *argv)
for (i = 0; i < sizeof pattern; i++)
pattern[i] = 'X';
fd = open("testfile", O_RDWR | O_CREAT | O_EXCL, 0666);
if (fd < 0)
pwrite(fd, pattern, sizeof pattern, 0);
posix_fadvise(fd, 0, sizeof pattern, POSIX_FADV_DONTNEED);
On the other hand, if you subtract one from the size of 'pattern', you'll see that you also get reads (as partially writing the last page requires a read-modify-write cycle).
2: Not sure, this may be specific to FS, or something that has to do with the behaviour of MMAPed files however I don't know how do you guarantee that what you're writing corresponds to a single block in the FS (unless you're writing directly to /dev/sda and even then)
So basically most physical servers.
In other words: could a database have only minimal built-in caching and instead rely on the OS cache?
Consider an inventory system for a big box retailer. I can't think of anything better than a fat-ass RDBMS as the primary data store. Sharding sounds like a horrific idea. There are myriad workloads like this.
Personally, I've seen pgsql handle terabytes of data just fine and it wasn't really noteworthy or a source of problems to even bring up considering something else. YMMV but it's a good idea to use logic and reason to dictate architecture instead of following the shiny thing or hubris.
Everybody knows that relational databases don't scale because they use JOINs and write to disk.
Also, relational databases weren't built for web scale. MongoDB handles web scale. You turn it on and it scales right up.
And before you knock shards, shards are the secret ingredient in the web scale sauce. They just work.
Furthermore, relational databases have impetus mismatch, and Postgresql is slow as a dog. MongoDB will run circles around Postgresql because MongoDB is web scale.
Edit: Whoops, just read your reply :)
"without a clear indication of the author's intent, it is difficult or impossible to tell the difference between an expression of sincere extremism and a parody of extremism"
I am a happy Postgres user and always default to it unless I am really sure a project calls for something else.
And same here, I tell people to start their datastore selection with looking for a reason NOT to use Postgres.
However, if you have any of the following: (1) vastly different security requirements for different parts of your datastore (2) vastly different backup schedules or temporal sensitivities (3) privacy requirements deriving from different legal jurisdictions (4) wish to scale by running on commodity hardware (5) cannot tolerate any downtime whatsoever ... and probably many other cases ... then in my experience you are going to meet some serious issues with conventional RDBMS, at least with the vast majority of configurations.
I'm all for logic and reason too... but your comments seem closer to name-calling and a single example.
Anyway, in a lot of environments it's application's that drive choice of database engine - not the other way round.
I can't think of anything that is magnificently easier or better at solving your numbers, especially all together. #4 seems less relevant, is it really cheaper than operationalizing a distributed system? These days, likely for situations where consistency can be relaxed. Not so for many business workloads.
Can you enlighten us with some example products for your numbers?
IN DEFENSE OF BEING LAZY AS A PROGRAMMER
The essential mission of a computer programmer is to use computers to solve problems. Being lazy can come in one of two forms:
1) Solving problems badly or not solving them at all, or
2) Relying on someone else's solution instead of coming up with your own.
Using a RDBMS is Type-2 Lazy. Now, I want you to get out a pen and paper and write this next bit down, because it is the most important thing you will ever learn:
EVERYBODY SHOULD BE TYPE-2 LAZY BY DEFAULT, ONLY DEVIATING FROM THIS IF THERE IS A COMPELLING REASON NOT TO.
1) Other people's solutions have been used, which means they've been tested in real-world use. Things you haven't thought of yet because you don't yet have a working solution have been at least discovered, because people are using it. Sometimes they're even addressed.
2) Other people's solutions may have tools, documentation and communities built around them, making them easier to learn about, use and work with.
There are two decades of work put into Postgres itself, and even longer periods of work put into the general field of relational databases. Corner cases you can't even conceive of have been encountered and patched for. The entire codebase of Postgres contains large amounts of accumulated wisdom on how to store data in a safe and retrievable fashion. And large communities have sprung up, to provide you with tools and wisdom on how to use it to best suit your needs.
NoSQL databases are useful for certain workloads and setups. It would be absolutely wrong to dismiss them out of hand. Having said that, anyone whose DEFAULT PREFERENCE is to eschew traditional RDBMS as a data store in favor of software that has been around for less than a quarter of the time that even the newer of the popular RDBMS systems have been around because using well-tested solutions is LAZY needs to have a restraining order keeping them at least 100 yards away from a keyboard.