- Row-level anything introduces write alignment and fsync alignment problems; pages are easier to align than arbitrary-sized rows
- PostgreSQL is very conservative (maybe extremely) conservative about data safety (mostly achieved via fsync-ing at the right times), and that propagates through the IO stack, including SSD firmware, to cause slowdowns
- MVCC is very nice for concurrent access - the Oriole doc doesn't say with what concurrency are the graphs achieved
- The title of the Oriole doc and its intro text center about solving VACUUM, which is of course a good goal, but I don't think they show that the "square wave" graphs they achieve for PostgreSQL are really in majority caused by VACUUM. Other benchmarks, like Percona's (https://www.percona.com/blog/evaluating-checkpointing-in-pos...) don't yield this very distinctive square wave pattern.
I'm sure the authors are aware of these issues, so maybe they will write an overview of how they approached them.
> - Row-level anything introduces write alignment and fsync alignment problems; pages are easier to align than arbitrary-sized rows
OrioleDB uses row-level WAL, but still uses pages. The row-level WAL becomes possible thanks to copy-on-write checkpoints, providing structurally consistent images of B-tree. Check the architecture docs for details.
https://github.com/orioledb/orioledb/blob/main/doc/arch.md
> - PostgreSQL is very conservative (maybe extremely) conservative about data safety (mostly achieved via fsync-ing at the right times), and that propagates through the IO stack, including SSD firmware, to cause slowdowns
This is why our first goal is to become pure extension. Becoming part of PostgreSQL would require test of time.
> - MVCC is very nice for concurrent access - the Oriole doc doesn't say with what concurrency are the graphs achieved
Good catch. I've added information about VM type and concurrency to the blog post.
> - The title of the Oriole doc and its intro text center about solving VACUUM, which is of course a good goal, but I don't think they show that the "square wave" graphs they achieve for PostgreSQL are really in majority caused by VACUUM. Other benchmarks, like Percona's (https://www.percona.com/blog/evaluating-checkpointing-in-pos...) don't yield this very distinctive square wave pattern.
Yes, it's true. The square patters is because of checkpointing. The reason of improvements here is actually not VACUUM, but modification of relevant indexes only (and row-level WAL, which decreases overall IO).
How do you plan to make your new project keep up to date with the release cadence of the parent project?
...because otherwise, I can't see how this is a good idea.
Look, I have the same reaction whenever someone does this.
If someone goes and forks rust and creates a new programming language call dust that solves I dunno, the fundamental async compatibility story, or adds (somehow) a zero cost native GC type back into the language, I'd say the same thing.
You've taken a big open source project, forked it and laid some significant changes on it, which you don't believe this be accepted upstream.
Ok...is this a toy that you made for fun?
...or a serious project you expect to maintain?
If the answer is 'serious project', please make explicit your plans to avoid becoming abandonware in the future, your plans to fold future release from (original project) into yours, or your plans to diverge henceforth into an entirely new project.
To be fair, I get it, this is an extension that seems like it could... probably... receive changes that are made upstream in postgres; but, if it was that easy, it belongs as part of the postgres projecct; so, I guess, it's not that easy.
The author wrote this, answering the question in a reply on the post:
> Yes, sure! But that's the long way to go. Right now OrioleDB is an extension, which comes with PostgreSQL core patch. The mid-term goal for OrioleDB is to become a pure extension. The long-term goal is to make OrioleDB part of PostgreSQL core.
are you considering using this in production somewhere in the next few days? your reply comes off as absurdly aggressive, especially when you mention no intention of supporting the project monetarily. and that's on top of this question being addressed already, as other commenters pointed out.
It could likely have been better phrased but I found it more pointed than aggressive.
There've been quite a few postgres forks that have either died or stayed based on 7.x versions and when you're replacing the entire storage engine - and hence also the on-disk format - migrating away from it if circumstances require it later is going to be annoyingly non-trivial.
So while I think I agree with "don't come in so hot", "absurdly aggressive" may nonetheless be over-egging it slightly given the context.
OrioleDB is based on earlier work done in the core Postgres to introduce a storage extension framework called Table Access Methods.
This phase adds significant enhancements to make the OrioleDB extension feasible and aggressively performant.. and the delta code to be committed upstream is less than 2K LOC.
Comparing this project to earlier forks that got stuck at 7.x and 8.x would be a huge disservice to the maturity and extensibility of the Postgres project.
On your latter point, OrioleDB does not “replace” the built-in storage engine (which works quite well for many many use-cases), it “augments” the core capabilities with an additional storage engine optimised for many use-cases where the legacy engine struggles.
> .. and the delta code to be committed upstream is less than 2K LOC.
Then mainline it?
The last commit that /postgres/postgres and /orioledb/postgres share (1) is 6 months old for 15.2?
I mean, this is literally what I'm pointing out in my comment; you're chasing a moving target. Every change postgres makes, you have to merge in check it doesn't conflict, roll out a new patch set...
...and, you're already falling behind on it.
So, going forward, how do you plan to keep up to date...? ...because it looks to me, like a < 2K LOC isn't a fix for this problem; it hasn't solved it for you, as time goes forwards, it will continue not to be a solution to the problem; annd...
> Currently the changes needed to Postgres core are less than a 1000 lines of code. Due to the separate development schedules for Postgres and OrioleDB, these changes cannot be unstreamed in time for v.15. (2)
^ They were < 1000 lines a year ago, apparently. So the problem is getting worse over time.
Seems like the solution is mainlining those changes... not just intending to mainline them.
Until then... I really just see this as a postgres fork.
Reasonable points. As an onlooker who has run into VACUUM in years past, I have wondered: is this a fundamental necessity, or could it possibly be fixed? Seeing an example of it being fixed is certainly helpful. If I work at a company with the resources to maintain a Postgres fork, great! If not, we can evaluate whether the challenges of using this fork are worth the performance benefit.
I with people would stop with the "Uber migrated from Postgres to MySQL" thing. Uber migrated from Postgres used as relational database to something that is basically their own non-relational database using MySQL as distributed key-value store. It is not really situation applicable to most users of Postgres.
Anyway, this design of MVCC which moves older data into undo logs / segments is used by Oracle DB, so it definitely works. The common challenge with it is that reading older versions of data is slower, because you have to look it up in a log, and sometimes the data is removed from the log before your transactions finishes, getting the dreaded "Snapshot Too Old" error.
E: I don't see in the article when rows get evicted from the undo logs. If when they are no longer needed, I'm not sure where the improvement comes from because it should be similar amount of bookkeeping? If it's a circular buffer that can ran out of space like Oracle does it that would mean under high write load long-running transactions starts to fail which is pretty unpleasant.
> E: I don't see in the article when rows get evicted from the undo logs.
The undo records are truncated once they aren't needed for any transaction.
> If when they are no longer needed, I'm not sure where the improvement comes from because it should be similar amount of bookkeeping?
It depends on what exactly is "bookkeeping".
If we consider amount of work, then improvement comes because old undo records can be just bulk deleted very cheap (corresponding files get unliked). No vacuum scan is needed.
If we consider amount of space occupied, then indeed the same amount of versions take the same amount of space. But saving old versions of rows in the separate storage can save their primary storage from long-term degradation. Also, note that OrioleDB implements automatic merging of sparse pages.
> If it's a circular buffer that can ran out of space like Oracle does it that would mean under high write load long-running transactions starts to fail which is pretty unpleasant.
OrioleDB implements in-memory circular buffer for undo logs. Once circular buffer can't handle all the undo records, least recent records are evicted to the storage. Currently, we don't place limitation on the site of undo logs. Undo records are kept while any transaction can need them. So, no "Snapshot Too Old" errors. However, we can consider implementing this Oracle-like error as an option, which allows to limit the undo size.
That's because they don't store non-current versions of rows in the table itself, so why would they need a vacuum? MySQL does need to vacuum indexes, however.
I'm pretty sure SQL Server and MySQL use locking instead of MultiVersion Concurrency Control so they don't keep more copies of data around. No vacuum needed but there's a possibility of things blocking.
InnoDB (MySQL's default storage engine) implements MVCC using undo logging and background purge threads. It scales to highly concurrent OLTP workloads quite well. It doesn't work well with OLAP workloads / long-running transactions though. The oldest active transaction will block purging of anything newer than that transaction's snapshot.
Yes, but doesn't it require opt-in to enable snapshopt isolation? Most T-SQL devs will probably default to locking (TABLLOCK, etc) becuase that's what the bulk of google search results for "how do I fix my broken query?" tell people to do: it's only very, very rarely do I see a stackoverflow or dba.se answer that mentions MVCC-related topics.
I'm excited about the title, but I have to say that my initial impression has left me frustrated. The main README on GitHub[1] smells of corporate-speak. So far I've learned that:
- OrioleDB is a new storage engine for PostgreSQL
- PostgreSQL is most-loved (whatever that means)
- OrioleDB is an extension that builds on.. other extensions?
- OrioleDB opens the door to the cloud!
In the wake of crypto and other Web 3.0 grift, this is not the tact that I'd take to release something that extends and improves on something as important as PostgreSQL.
> OrioleDB is an extension that builds on.. other extensions
I assume you are referring to this part:
> OrioleDB consists of an extension, building on the innovative table access method framework and other standard Postgres extension interfaces.
I don't know how they could be more clear? Table access methods were introduced in PostgreSQL to support alternative storage methods (like zheap, which tries to do something very similar, or possibly columnar data stores).
Mentioning this fact is important, because there are a bunch of forks of PostgreSQL with alternative data storage systems; this is designed to work as an extension for an unforked PostgreSQL. (It doesn't yet)
The Readme seems very clear if you are familiar with PostgreSQL.
The PostgresBuild 2021 slides of OrioleDB [1] (also linked in the GitHub project's readme) mention that there is a 1K LoC patch that adds features to the extension interface. I guess the patch is larger by now in 2023.
Oriole's design seems to require transaction-aware indexes with point entry removal, which has its own cost.
E.g. a GiST equivalent (for e.g. spatial indexes) would be a hassle to maintain due to its nature of having no precise knowledge about the location of each index tuple, GIN (e.g. FTS indexing) could be extremely bulky due to a lack of compressibility in posting trees, and I can't imagine how they'd implement an equivalent to BRIN (which allows for quickly eliminating huge portions of a physical table from a query result if they contain no interesting data), given their use of index-organized tables. Sure, you can partition on PK ranges instead of block ranges, but value density in a primary key can vary wildly over both time and value range.
Does the author have any info on how they plan to implement these more complex (but extremely useful) index methods?
This doesn't even consider the issues that might appear if the ordering rules (collation) change. Postgres' heap and vacuuming is ordering-unaware, meaning you can often fix corruption caused by collation changes by removing and reinserting the rows that are in the wrong location after the collation changed, with vacuum eventually getting rid of the broken tuples. I'm not sure Oriole can do that, as it won't be able to find the original tuple that it needed to remove with point lookup queries, thus probably requiring a full index rebuild to fix known corruption cases in the index, which sounds like a lot of additional maintenance.
> Does the author have any info on how they plan to implement these more complex (but extremely useful) index methods?
Regarding GiST analogue my plan is to build B-tree over some space-filling curve. Also, I'm planning to add union keys to the internal pages to make search over this tree faster and simpler.
Regarding GIN analogue, it would be still possible to compress the posting lists. The possible option would be to associate undo record not with posting list item, but with the whole posting list.
Regarding BRIN, I don't think we can do some direct analogue since we're using index-organized tables. But we can do something interesting with union keys in the internal pages of PK.
> This doesn't even consider the issues that might appear if the ordering rules (collation) change.
You're right, collation issue is serious. We will need to stick every collation-aware index to particular libicu collation version, before we go to GA.
> Regarding GiST analogue my plan is to build B-tree over some space-filling curve. Also, I'm planning to add union keys to the internal pages to make search over this tree faster and simpler.
This is my first time hearing of "union keys", and I can't seem to find it using DDG or arxiv. Would you mind explaining the concept (or pointing me in the right direction)?
The article makes some convincing arguments and the benchmarks seem to corroborate their performance claims, but I don't understand the dichotomy between this proposed new storage engine (OrioleDB?) and PostgreSQL itself.
Besides the commercial motivations and wanting to profit from the innovations discussed in the article, is there any reason why this needs to be a whole new database marketed as OrioleDB versus contributing these improvements upstream?
I'm seeing OrioleDB as a future engine for PostgreSQL. I'd like to see it as the default engine.
However, the changes in OrioleDB are too big to be made incrementally. This is why I'm comparing the current PostgreSQL engine (with more than just heap, but many other subsystems as well) with OrioleDB.
Alexander Korotkov (OrioleDb author) idea, - based on his Postgres committer experience, I believe, - is that these changes are way too big to be ever accepted upstream, hence separate engine. More info https://www.socallinuxexpo.org/sites/default/files/presentat..., see esp. slides 9-11
Since this is an engine extension, I wonder if it would have any effect when combined with others. For example, timescaledb [0] acts on underlying tables. I wonder if this would have some effect if you did something like
create table xyz(...) using orioledb;
select create_hypertable(xyz, ts);
Is OrioleDB interested in committing to a stable on-disk format removing the need for an upgrade process between Postgres major versions? Seems like an opportunity to solve this problem.
> stable on-disk format removing the need for an upgrade process between Postgres major versions?
The need for the PostgreSQL upgrade process doesn't generally arise from the low-level on-disk formats of Postgres' heap and OrioleDB's table access method, but from changes in Postgres' catalogs. Things like the addition of a new type and its support functions will need to be inserted by some upgrade process. Then there are other catalog changes that change the column layout of the catalog tables, which also requires a process to update the stored data between the versions.
Without an upgrade process, you cannot change the catalogs, which is why only minor version upgrades of PostgreSQL can be done with only the swap of a binary, and can be rolled back safely without issue. It would limit upgrades to only internal APIs, planner, and executor changes, which would severely limit development.
I doubt that OrioleDB would be able to remove this need for an upgrade process for you.
Not sure how that would help. PostgreSQL already has basically already committed to a stable on disk format for tables and indexes which is why pg_upgrade works. Most of the work upgrading a cluster cones from rewriting the catalog tables. And under most workloads pg_upgrade is very quick so I am also unsure where you think the big gain would be.
I love the whole “2.3x less CPU overhead per transaction” where Postgres scales from 5% to 65% CPU usage and Oriole sits constantly at 90%. That doesn’t seem like a huge success to me? The predictability sure is nice, but moving the lower end up by 85% is something I’d be rather worried about
You generally want to keep your CPU fully utilized. It looks like Oriole is doing significantly more transactions and is CPU-bound, due to much lower IO requirements. The good news is that it implies you could get even more performance out of Oriole by vertically scaling to a more powerful CPU, whereas Postgres would not continue to increase in performance this way.
Those idle times on the Postgres server could be used for something else, if you're thinking in a desktop OS mindset. But for servers, you tend to want machines that are doing one thing and are optimized for that thing.
> You generally want to keep your CPU fully utilized.
Not in real life concurrent systems where latency matters. In addition to the queuing/random request arrival rate reasons, all kinds of funky latency hiccups start happening both at the DB and OS level when you run your CPU average utilization near 100%. Spinlocks, priority inversion, etc. Some bugs show up that don’t manifest when running with lower CPU utilization etc.
This is a benchmark that tries to execute as many queries as possible, so the interesting stat is transactions per second, not CPU usage. This benchmark is testing top speed, not real world behaviour.
If you tested both systems with the same workload (eg. a specific number of queries per second), then the average CPU usage would be much lower for the more efficient engine.
The low CPU usage in this benchmark is just a sign that the performance is not CPU bound, but limited by other factors like locking or IO.
Here the system is doing a pretty consistent 750k TPS instead of oscillating between 0 and 225k-- often sitting near 0TPS for tens of seconds. Which system do you think will have better latency for any given loading?
So what you're saying is that an acceptable way to compensate the system's bugginess is by making it more inefficient? I'd rather use a system that's stable under load.
A service should use 100% CPU when loaded fully, anything else means you are suffering from bottlenecks that are actively limiting your throughput. Having lower CPU load because your disks are barely hanging in there is certainly not better.
If you want lower max CPU load, just limit its resources (e.g., CPU quota, cpuset limitation) or load it less.
> You generally want to keep your CPU fully utilized
Only if your load is very predictable. If there is a chance of a spike, you often want enough headroom to handle it. Even if you have some kind of automated scaling, that can take time, and you probably want a buffer until your new capacity is available.
I think many here is misunderstanding what was likely meant: postgresql was not able to use all the available CPU under this situation, in that it was oscillating from 10% to 70% CPU use. That 40% average cpu use isn't an asset on a dedicated database server: it just means that the other 60% of available cycles are a perishable resource that are immediately spoiling.
In that sense, you want to be able to have your database be able to use all the resources available: all the IOPS, all the CPU cycles, etc.
And, of course, the real thing is the amount of work you get done: this thing does more work-- partially by using more CPU cycles, and partially by doing more work per CPU cycle.
It’s hard to generalize on these points. In a situation where the throughput was inverted but the proportional system usage was the same, you would instead say “you can still vertically scale by adding more disks”, rather than saying adding bigger cpu. It’s not meaningful in isolation.
It may be reasonable to suggest that for a new code base that is cpu bound there’s a good chance there is low hanging fruit for cpu optimizations that may further increase the throughput gap. It’s also the case however that the prior engines tuning starting life on much older computer architectures, drastically different proportional syscall costs and so on, it very often means that there’s low hanging fruit in configuration to improve baseline benchmarks such as these. High io time suggests poor caching which in many scenarios you’d consider a suboptimal deployed configuration.
It’s not just the devil that’s in the details, it’s everything.
To be a little more clear on what the detail of the benchmark in question is: it’s a benchmark that explicitly exercises a pathological use case for postgresqls current design, one that nonetheless functions, and demonstrates that the advertised engine does not have that pathology. A key takeaway should probably be, if you’re a Postgres user: if your workload looks exactly like this (sparse upserts into a large data set at a high rate) then you might want to evaluate your the runway of your architecture before the geometric costs or latency stalls become relevant - just as for cost analysis of any other system. What is somewhat interesting in this article, and not super clearly presented, is that this workload is actually fairly pathological for most existing engines offering this set of structural and query facilities, and that’s interesting, if this is the niche you need. Most people do some amount of this, but not always at a super high rate, and there are ways to get the same effective writable/readable data using a different schema, while avoiding it. Nice thing here is you can do the one-liner version.
> you can still vertically scale by adding more disks
Parallelizing IO is a lot different from scaling up CPU power, though. I'd imagine DB server IO performance has a lot less lower-hanging fruit than CPU/software performance.
That depends, the ratio of free bus capacity for data fetch, and free capacity for inter-CPU synchronization is skewed _massively_ in favor of capacity for data fetch. An x86 system is already under-capacity at the cpu/bus interface, which is why we keep throwing more and more cache at the problem and it works.
Similarly in the cloud on AWS fro example, you have publicly available scalability options starting from 5k IOPS up to 2M IOPS, >400x or 3 orders of magnitude. By contrast you're going from 1vcpu to 192 cores, about half the raise, and a lower performance scaling due to the increased cost of cross-package shootdowns.
Yup, they're different, for sure, but the implication that CPU is easier is not all that clear. In either case, with a database style workload, and with either of these engines in practice you're going to hit a limit at the bus in practice long before you hit a limit on compute or disk io, for any sustained workload - bursts are different.
My read is that it's at 90% because they are saturating the CPU to that point with the TPS threshold they use for comparison, the TPS of Oriole is constant and way higher than pg in these charts at least.
I'd think the CPU will drop proportionally to the TPS, they just want to show how high it can go here.
Yes, but now that your CPU utilization is uncapped, you can more easily scale the utilization down and retain some form of proportional performance, so it doesn't matter. If you capped the system to 60% of your CPU, it might change the overall numbers, but say you're doing 1.8x more TPS at the same usage, it's a win either way. It's not a marketing trick; those numbers come across as "Very good", to me.
If Expensive Server CPU = X dollars per unit, and it's only used at 60% capacity and can realistically only be used at that capacity, then you have effectively just set .4*X amount of dollars on fire, per unit. If you can vertically take a workload and scale it to saturate 90% of a machine, it's generally easy to apply QOS and other isolation techniques to achieve lower saturation and retain some proportional level of performance. The reverse is not true: if you can only hit 60% of your total machine saturation before you need to scale out, then the only way to get to 90% or higher saturation is through a redesign. Which is exactly what has happened here.
With the same equipment, your performance is now five times better. (5X higher TPS) We need to test again with more hardware, but if you can maintain 3 times the performance at the lower end, it could be a good alternative for some users.
"As the cumulative result of the improvements discussed above, OrioleDB provides:
It was a performance test, where presumably the objective was to apply the maximum possible load each DB engine could handle, and apply that load continuous for a long period of time.
The CPU load jumping up and down isn’t Postgres “scaling” it Postgres hitting performance bottlenecks on a regular basis, presumably driven by the need to perform vacuums which are very IO insensitive. So instead of using IO to serve queries, Postgres is using IO for janitorial work, and TPS (and thus CPU usage) crater.
Oriole on the other hand manages much higher throughput, and much more consistently than Postgres.
What would you prefer a car that does a constant 100mph when your foot’s down. Or one that wildly oscillates between 40mph and 70mph, despite you trying to put the pedal through the floor?
The article contains a link with the rather curious title "10 things that Richard Branson hates about PostgreSQL".... Turns out the guy who wrote that blog is called Rick Branson, not Richard.
In fact, I'd argue that many of the most effective ways to mislead people involve sticking rigidly to literal truth, because it makes them so much harder to counter. When there's no literal untruth to correct, it's natural to end up implying bad faith _without having any definitive proof_, and that is mighty unstable ground from which to argue.
I get what you’re saying, but imagine your name was Richard Branson. You’d hear no end to the jokes. At what point can you consider this an internalized behavior of the author? Is it still clickbait if the author believes his main raison d’etre is to have a meme name?
I think I basically agree with you. And this example is pretty benign — I'm not actually meaning to criticize anyone here.
However I will not that the author in question refers to himself as "Rick Branson", and the article title is "10 Things I Hate About PostgreSQL". So I think it's just the person who made the link who is being a bit cheeky.
They may have simply been intending to be relatively formal as a mark of respect. (there's enough language and culture dependencies in how one decides such things that 'may' is very much load bearing in that sentence, mind)
Certainly it wouldn't've occurred to me to think it was the businessman rather than a name collision.
But, eh, agreed on tangent, and I'm not intending to criticise either.
See the guy on Bluesky who is called Steve Wozniak and isn’t trying to pretend to be Woz and yet has issues while Bluesky also let someone with a racial slur username get an account
If this engine is so much better than the internal one shouldn't we expect that at least the big cloud providers will use it on their managed servers? They have an economic incentive to do so. If that happens eventually the PostgreSQL project itself will replace the default engine, or am I wrong?
I’m not sure the risk of an immature engine is worth it to them. Customers pay for hosted Postgres because they want to not worry about doing it themselves for cheaper. They are paying for reliability.
I think you’re correct about the existence of an economic incentive for the cloud providers, but I anticipate it would be offered as a distinct product to “vanilla” (at least in the sort term).
Things get interesting though because this space of database products has trended towards restricting who can host in their license terms (TimeScale, ClickHouse, etc). If that’s Orioles cash-in play then maybe cloud providers can’t use it anyway.
I suspect the fate of the engine will be determined by its funding source
But Aurora RDS is a separate product - it's not sold as standard Postgres which you can also get. It's not like they are trying to pass off aurora as the same thing.
Sorry for the very late reply, but Aurora PostgreSQL isn't just wire protocol compatible like CockRoachDB. It actually is PostgreSQL modified for their service. It includes all your favorite PG extensions and even large object functionality.
I read object relational? Can someone enlighten me? Entity relational fine but what makes it object relational? Has someone flipped on the buzzword in the years I did not pay attention
The oldest I can find is from 1998 (PostgreSQL 6.3), but it was probably in use even before.
> Postgres offers substantial additional power by incorporating the following four additional basic concepts in such a way that users can easily extend the system:
classes
inheritance
types
functions
Other features provide additional power and flexibility:
constraints
triggers
rules
transaction integrity
These features put Postgres into the category of databases referred to as object-relational
Object Databases were once a thing, and PostgreSQL PostgreSQL used the term Object Relational to indicate it could be used as both an Object Database (it supports table inheritance) and/or a Relational Database. Not that you should ever use the feature, being a historical artifact full of historical gotchas and your clever design becomes a maintenance burden.
Ledger-like (only inserts and selects) table design and management just remove the need of vacuuming.
Vacuuming becomes important with large tables. In those cases, naive design (with row updates and deletions) instead of a ledger-like one (without) is the real culprit, IMHO.
If you have billion rows tables I can imagine all those data are relevant. So, why not using a ledger-like approach and also keep a history as an extra bonus?
I generally put Postgres WAL on nonvolatile RAM (battery backed thing) and the database on a bunch of NVMe RAIDz3 arrays striped with two 64-core AMD EPYC CPUs. Is fast.
Experimental format to help readability of a long rant:
1.
According to the OP, there's a "terrifying tale of VACUUM in PostgreSQL," dating back to "a historical artifact that traces its roots back to the Berkeley Postgres project." (1986?)
2.
Maybe the whole idea of "use X, it has been battle-tested for [TIME], is robust, all the bugs have been and keep being fixed," etc., should not really be that attractive or realistic for at least a large subset of projects.
3.
In the case of Postgres, on top of piles of "historic code" and cruft, there's the fact that each user of Postgres installs and runs a huge software artifact with hundreds or even thousands of features and dependencies, of which every particular user may only use a tiny subset.
4.
In Kleppmann's DDOA [1], after explaining why the declarative SQL language is "better," he writes: "in databases, declarative query languages like SQL turned out to be much better than imperative query APIs." I find this footnote to the paragraph a bit ironic: "IMS and CODASYL both used imperative query APIs. Applications typically used COBOL code to iterate over records in the database, one record at a time." So, SQL was better than CODASYL and COBOL in a number of ways... big surprise?
Postgres' own PL/pgSQL [2] is a language that (I imagine) most people would rather NOT use: hence a bunch of alternatives, including PL/v8, on its own a huge mass of additional complexity. SQL is definitely "COBOLESQUE" itself.
5.
Could we come up with something more minimal than SQL and looking less like COBOL? (Hopefully also getting rid of ORMs in the process). Also, I have found inspiring to see some people creating databases for themselves. Perhaps not a bad idea for small applications? For instance, I found BuntDB [3], which the developer seems to be using to run his own business [4]. Also, HYTRADBOI? :-) [5].
6.
A usual objection to use anything other than a stablished relational DB is "creating a database is too difficult for the average programmer." How about debugging PostgreSQL issues, developing new storage engines for it, or even building expertise on how to set up the instances properly and keep it alive and performant? Is that easier?
I personally feel more capable of implementing a small, well-tested, problem-specific, small implementation of a B-Tree than learning how to develop Postgres extensions, become an expert in its configuration and internals, or debug its many issues.
Another common opinion is "SQL is easy to use for non-programmers." But every person that knows SQL had to learn it somehow. I'm 100% confident that anyone able to learn SQL should be able to learn a simple, domain-specific, programming language designed for querying DBs. And how many of these people that are not able to program imperatively would be able to read a SQL EXPLAIN output and fix deficient queries? If they can, that supports even more the idea that they should be able to learn something different than SQL.
> I personally feel more capable of implementing a small, well-tested, problem-specific, small implementation of a B-Tree than learning how to develop Postgres extensions, become an expert in its configuration and internals, or debug its many issues.
It gets harder as you delve into high concurrency and ensuring ACID: if you are using an established database, these are simply problems you don't have to deal with (or rather more truthfully, there are known ways to deal with them like issuing an "UPDATE x=x+1" instead of fetching x and then setting it to x+1).
Still, writing an application expecting the datastore to ensure consistency is one thing, and ensuring that consistency are different problems requiring a different mindset (you are thinking of hard problems of your business logic, but you also have to think of hard problems common to db engines at the same time?).
> But every person that knows SQL had to learn it somehow. I'm 100% confident that anyone able to learn SQL should be able to learn a simple, domain-specific, programming language designed for querying DBs.
The benefit of languages as ubiquitous as SQL is that once you need something that you did not think of, SQL already enables it. But plenty of non-relational databases provide their own non-SQL APIs already (ElasticSearch, Redis, MongoDB, DynamoDB...), and as you suggest, developers cope with them just fine.
However, people used to expressiveness of SQL (even if we all know it's imperfect), always miss what they can achieve with a single query moving performance (and some correctness) considerations to the database. The idea is as old as programming: transfer responsibilities for accessing data performantly to whatever is managing that data, even if we know that there are always cases where it's an uphill battle.
It's that combination of good-enough performance, good-enough expressiveness, impressive consistency and correctness, and relational databases (and SQL) are a great choice for most applications today.
The ACID and concurrency aspects are definitely harder to deal with, but it also depends on what you need. I wonder if many people would find a nice perf increase by running a simpler, well designed db that runs in a single process of a beefy modern computer in a compiled language. In any case, writing any multithreading or multiprocess code is hard, and I doubt a multi-million LoC codebase makes it any easier.
> you are thinking of hard problems of your business logic, but you also have to think of hard problems common to db engines at the same time?
YES! everyone is complaining these days about slow software in our beefy machines. I guess the core of my rant is that it feels like all of us programmers should start caring a lot more about data organization, code size, minimizing dependencies, data oriented design and "mechanical sympathy". Advances in languages, tooling and accessibility to information should demystify the how-to of managing our own application data ourselves.
I symphatise with your last point! And I agree that great developers should understand how to build a sufficiently performant database for their app, even if they won't build one.
However, I think our applications are not slow due to database access, but one too many layers of indirection otherwise: eg even ORMs usually introduce a huge performance and complexity cost.
Just like we are trying to come up with better and less error prone concurrency models in code (async/await, coroutines...), I get that you are trying to come up with better tooling support for data access, and we should.
But we also need to be aware that some people simply want to solve a problem more efficiently, but not most efficiently (look at most ML code and you can barf at it — yet it still makes a huge progress in one area they care about).
> A usual objection to use anything other than a stablished relational DB is "creating a database is too difficult for the average programmer." How about debugging PostgreSQL issues, developing new storage engines for it
that's exactly what OP company is doing: they are building storage engine for postgres.
I’m not sure the cpu load being higher is a bad thing, isn’t that basically showing that it is using less IO so it can use more of the cpu?
The throughput is way, way higher, so it’s using less cpu per transaction. If this were showing equal numbers of transactions the CPU usage would be lower.
Ideally, in a benchmark, I think we’d be seeing basically 100% cpu usage because that would mean the test hardware is being fully utilized and the software being tested isn’t being bottlenecked in some way.
I think you might have misread the graphs. The graph is showing a 4x peak/~6x average improvement in TPS. Because of this the load is less I/O bound and thus CPU is able to be fully utilised. If you want to measure efficiency you would instead measure at constant TPS.
I suppose that more CPU load is because more cores are loaded by doing tasks in parallel. Less CPU load is actually indicator that performance is limited by something else (IO, locks, memory coherency) that couldn't be scaled with the ease of CPU's adding.
So bigger CPU load and bigger performance mean that parallelising of tasks is more efficient, which is a clear benefit.
It is not about my "conceptualising", I was simply referring to the fact that the conclusions refer to something different than the graphs. The author himself uses the term "load", so I guess we should stick to what he meant. Still, graph is showing the absolute value of the load, not per-transaction value. Then it takes an extra effort to actually realize that maybe the author's claim is valid, but in a per-transaction context. Why then the graph wasn't made to plot per-transaction values if that was author's point? It adds unnecessary confusion. That confusion is perceived just after reading another inconsistency - that there is supposed 5x tps speedup while we see 4× (visible 80k divided by visible 20k is 4 not 5). So why 5x? Was this 5x based on a median perhaps, or on some percentile- then why such a median or percentile wasn't shown on the graph? ..and so on.
Please don't get me wrong. 4x tps speedup is nice achievement already. It's great enough to congratulate the author and be happy. But it's also presentation of the result that matters, if there are inconsistencies, or the author based his claims on a different measurements than what is shown, then it's natural that it can make one to raise in eyebrow. It doesn't solidify the trust, as opposed to presenting the conclusions matching the graphs exactly.
- Row-level anything introduces write alignment and fsync alignment problems; pages are easier to align than arbitrary-sized rows
- PostgreSQL is very conservative (maybe extremely) conservative about data safety (mostly achieved via fsync-ing at the right times), and that propagates through the IO stack, including SSD firmware, to cause slowdowns
- MVCC is very nice for concurrent access - the Oriole doc doesn't say with what concurrency are the graphs achieved
- The title of the Oriole doc and its intro text center about solving VACUUM, which is of course a good goal, but I don't think they show that the "square wave" graphs they achieve for PostgreSQL are really in majority caused by VACUUM. Other benchmarks, like Percona's (https://www.percona.com/blog/evaluating-checkpointing-in-pos...) don't yield this very distinctive square wave pattern.
I'm sure the authors are aware of these issues, so maybe they will write an overview of how they approached them.