Hacker News new | past | comments | ask | show | jobs | submit login
Zheap – Reinvented PostgreSQL Storage (cybertec-postgresql.github.io)
263 points by PhilipTrauner 3 months ago | hide | past | favorite | 69 comments



For those unaware, Zheap is a new storage engine for Postgres that handles updates and deletes in a different way. Currently, when you update a row in Postgres, Postgres creates a new copy of the row and marks the old one as deleted. This is done for several reasons, specifically it makes handling concurrency easier and it makes rolling back transactions easier.

The issue with this approach is over time this leads to lots of "dead rows" - deleted rows that are taking up space in your DB. Postgres has a background job called the "vacuum" which effectively garbage collects all deleted rows. Depending on your settings, the vacuum can be a pretty expensive job and even with the vacuum, you can still wind up using a lot more space than you actually need.

Zheap addresses these problems by using a different approach. When performing an update, instead of marking the old row as deleted and inserting a new one, Postgres will replace the old row with the new one and write a copy of the old row to a separate file. This means the main table file doesn't need to be larger than necessary to keep track of the dead rows.

Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast. For example, what happens if I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.


> Zheap does lead to lots of tricky scenarios. If for any reason you need to access the old copy of the row, you have to fetch it from the separate file. If the transaction that performed that update is rolled back, you need to replace the new version of the row with the old version of the row. This sounds straightforward, but gets really tricky really fast.

This is pretty much what Oracle is doing.


I have the same impression. It solves all the autovacuum problems, but has a different set of tradeoffs, e.g. the infamous 'ORA-01555 snapshot too old' when your query tries to read the old version but it has already been cleaned away.


Ah the memories! Yes, the "other file" talked about in these comments is also not an unlimited resource and, depending on implementation specifics, I've found can be more limiting than the current PostgreSQL MVCC approach. In Oracle, I use to see data maintenance related procedures being written that would loop and update some large data set, but intermittently commit progress every so often... this would end up exhausting the undo availability. The more frequent the commits, the faster you'd get to the ORA-01555 before the end of the run. Not saying the procedure was right... just that the pattern was common as was the error.

Also, not too long ago PostgreSQL added Stored Procedure support... which allows for mid-transaction commits. Maybe it will be a case of "what goes around, comes around".

I should note I'm not at all against this Zheap idea, stored procedures, or multiple approaches to MVCC/storage back ends in PostgreSQL... being able to choose my effective pros/cons for a project is really desirable. But, solving the vacuum problem will create new problems.


EnterpriseDB (which started this project) specializes in providing postgresql version that can emulate most of Oracle database functionality, so this is not surprising.


No, it provides a basis to ease your migration from Oracle. There is a lot it doesn't do. Basically, if you have a homegrown app on Oracle it removes a lot of the low hanging fruit to migrate but you aren't running peoplesoft or something on it.


I personally never used it, but worked for a company who at the time used EDB to migrate away from Oracle. They apparently had a tons of stored procedures and they mostly worked, but it is possible they didn't use any of the things you mentioned.

Anyway maybe I should say differently, maybe they don't provide most of the functionality, but their goal is provide database that can help to migrate away from Oracle, and this storage engine that appears to mimic one in Oracle (and I'm guessing has similar tradeoffs) fits that model.


> emulate most of Oracle database functionality

What is the basis for this statement?

* MATCH_RECOGNIZE isn't supported

* reference partitions aren't supported

* TIMESTAMP is incompatible and the default compile options are comically bad

These are the first three things that came to mind and none of them are supported so I stopped looking further.


Its also how MySQL works, using overwrite+undo_log rather than multiple row versions in the same table.


Thanks for this. So I’m fairly familiar how MVCC relates to CoW, and the way that I read this is that it’s kind of the inverse of CoW: it’s an in-place update and a copy of the old data.

I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

It’s a very interesting approach though. Once all the corner cases are solved and delt with, I wonder: are there any reasons not to choose zheap over the standard storage engine?


> I can completely imagine the corner cases this ends up touching, as there must be an insane amount of logic in Postgres that assumes immutability of data on disk: if I read data block X for transaction version 1234, it will always remain the same.

Most of the code having such low level assumptions has been centralized for PG12, as part of the introduction of pluggable table storage. Also note that that assumption as stated isn't actually true - we don't expect on-disk data to be immutable. There is on-access "pruning" of dead rows, there's VACUUM, there's new row versions on the same page etc.


There are probably some cases where traditional heap would be better that (ideal, finished) zheap. Some ideas: Crash recovery is faster without an UNDO phase. Accessing data that has been modified many times since you took your snapshot may be slower if you have to walk back in time through multiple hops to find the right version of each tuple (compared to the regular heap, which also has to step over lots of dead tuples in that case but can do so sequentially). There may be space management problems when there are too many active transactions interested in a page. There may be microscopic overheads that show up in some workloads that don't benefit from updating in place, perhaps because they are insert/delete-only.

I guess both options would be good to have, and it's interesting that recent SQL Server can do it both ways at the user's option (see "Accelerated Database Recovery", something that sounds conceptually a bit like PostgreSQL vacuum instead of ARIES style rollback where you have to undo changes).


the cow approach likely plays better with flash based storage since rewrites happen less often. so if you've got a lot of otherwise extra storage then maybe it's a good idea. though if you run on top of something like f2fs maybe it'll be better in that respect than pg does on it's own.


IIRC one issue with the current design is that as new rows are written in a different place, all indexes on that table have to be updated as well, leading to write amplification. If you do an update in place, you don't need to modify indices (but now that I think of it, what about concurrently running transactions that might want to read the old values using the old indices...?)

Also during the vacuum there's a lot of writes marking rows as available.


In theory, yes, but PostgreSQL's current storage is not optimized for flash.


> row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b.

I thought there is a fixed amount of space allocated per row and varying length blobs are stored elsewhere, is my understanding wrong?


Yes. Rows are variable length. Think null values, strings, etc.

Only really large values that don't fit into a page are stored elsewhere, and only if they can't be compressed and made to fit. I think it's > 3kb. But I don't remember why that number comes to mind. Pages are 8kb.


2kb by default, configurable per-table with the toast_tuple_target storage parameter if you want more control: https://www.postgresql.org/docs/current/sql-createtable.html


Your write-up (or something similar) should be on their front page. Most projects assume you already know everything about them.

I don't remember the name but there was a NoSQL DB on HN a couple weeks ago. I had to Google and read different websites for a solid 10 minutes before I got that it was indeed a NoSQL DB and what was the differentiating factor from other DBs.


Oof. That sounds super hairy. Don’t you run into all sorts of weirdness with things like bitmaps expecting to find a specific version of a tuple at a specific location in the heap?


Nope. Any time you are looking at a tuple in the existing heap, you have to test it for visibility against your snapshot. With zheap, it's much the same, except that, rather than just "yes or no", the answer could be that you have to following an update chain into the undo log to get an older version.


Ah gotcha. Thanks for the explanation, makes sense. So I guess you still have transaction ids you need to honour on your way back.


This is more or less how MySQL works, hence it's lack of vacuum.


This sounds like temporal tables, does it not - except deep inside the storage engine?? The historical copy goes to a separate table, the current row is in the current table.

I wonder if there is a logical overlap in the ideas between temporal table handling and this low level storage engine optimization.


Indeed, Postgres used to have "time travel" [1] — allowing you to view the table as it was a given point in time. It was removed in 6.2, though some of

Of course, support doesn't require Postgres-style MVCC. Oracle has time travel, which it calls Flashback ("SELECT AS OF"), even though it uses a different type of concurrency control.

[1] https://www.postgresql.org/docs/6.3/c0503.htm


PostgreSQL used to have a limited form of unitemporalism by exploiting its MVCC implementation. You could retrieve unvacuumed rows through some additional magic.

For true bitemporalism you need more than current-row/historical-row separations, though.


> I have row X that takes up 1kb. I replace it with row Y that takes up 500b. I then write row Z after row Y that takes up 500b. If you want to rollback the original transaction, row X will no longer fit in its original spot because row Z is now taking up part of the space it used to occupy.

But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?


> But this only can happen if replace X with Y and add Z are in the same transaction. If you rollback the transaction, then you start by removing Z, then replacing Y with X. What I'm missing here?

Z can be in a different transaction than X and Y. If two transactions run concurrently, one that replaces X with Y and a second one that inserts Z, the above scenario can happen.


No it cant. If they run concurrently, Z has no notion of Y happening. This is the whole point of transactions.


Transactions also have no "notion" of storage implementation details, so your point really isn't relevant to the topic at hand. They operate at a higher level of abstraction.


Thanks for the explanation. In layman terms how much is the benefit from separating the old rows from the database vs the cost of accessing it from a separate file?

Just curious why it’s being done now as sounds like a major design decision that would have been considered a long time ago


In Postgres 12, the pluggable storage API was introduced. That allows use of multiple different storage engines with Zheap being one of the new storage engines.

IIRC, one of the big reasons for implementing pluggable storage was for Zheap.

FWIW, the existing implementation has worked really well for most use cases.


It's a complex tradeoff. It depends on the specific workload which scheme is better. There's no clear winner. This is why it's being introduced as a pluggable choice vs a wholesale change.

Andy Pavlo's class has a lot of good general information on the topic: https://15721.courses.cs.cmu.edu/spring2020/slides/03-mvcc1....


To put it simply (very simply), most of the time you don't need old rows (rollbacks are rare, crashes are rare, etc.). If you store old rows (undo) in a separate place, you don't have to read (and skip) them all the time while reading actual data. That's the main benefit.


> replace the old row with the new one and write a copy of the old row to a separate file

Does it do this in the reverse order? If not, how does it prevent dataloss in case the main table update succeeds but writing to the "old copy" table fails?


Well, conceptually it happens atomically. The change is described in a single WAL (write ahead log) record, and data in the buffers representing the table and undo log isn't allowed to touch the disk until the WAL is on disk, so in any crash/restart scenario that leaves the job half done, we'll simply do it again when we replay all changes since the last checkpoint. A related question is: what happens to the uncommitted change after that? The system recognises that the change belongs to an uncommitted transaction, and rolls it back, which in this case means that we'll copy the old copy back into the main table before allowing any user to access it, and free up the space in the undo log. This is approximately how most other RDBMSes work.


Ah of course, how silly of me to forget about the log. Thank you, that makes perfect sense.


It's more complex than the above. The key constraint is actually that the transaction is not allowed to return as committed until the WAL is stable on disk. Whether the writes it did to various pages are allowed before that point or held until after is an implementation choice, and different databases make different choices. It gets complex fast, but the short summary is that the recovery protocols understand how to reconstruct the state after the fact, redo what needs to be redone, and undo what needs to be undone.


> Does it do this in the reverse order?

I would guess so, but I haven't looked up the implementation so I'm not sure. There's a ton of race conditions that can come up depending on the exact order you write things so I'm sure the actual implementation is pretty messy.


Have you stress tested it? How does this end up faring with large scale deployments?


If you look at the most recent status update on the project:

> 12-10-2020: Most regression tests are passing, but write-speeds are still low.


In theory this is much better than what Postgres currently does. Performance depends on autovacuum cleaning up the mess you leave behind, with the amount of dead tuples increasing you also get index bloat which leads to reading much much more data than necessary. I see 3-5x read amplification before vacuum kicks in on update heavy tables (with vacuum tuned a lot).

This is basically what Oracle does with UNDO and it is obviously very scalable. Vacuum on the other hand has its limits, there is only so much you can squeeze out of a single threaded (at table level) process.

What I look most forward to is that it should enable FLASHBACK queries.


I doubt that because in practice PostgreSQL is faster on many workloads than Oracle and InnoDB despite Oracle and InnoDB using similar models to zheap. If the theoretical difference has been large then we should also expect to see big differences in practice, which is not the case.

While zheap may be a superior design I do not think you should expect any major performance improvements other than on some very specific workloads.


Undo logging vs redo logging (except both are MVCC).


Confusingly, there is another new PostgreSQL storage effort called "Zedstore": https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5...

Disclosure: I work for VMware, who are sponsoring zedstore development via Greenplum.


Are you working on zedstore? It is one of the postgres developments that I am anticipating the most. cstore_fdw looked pretty cool but it never lived up to the hype and had a lot of not so intuitive limitations. A storage engine built using the new table access APIs seems like the right foundation.

Any word on when you are expecting stability? I'd love to see this in RDS.


I'm not working on zedstore. I just like peppering the folks who do with my wishlist items (bitemporalism! query time forecasting and progress measurements! support for query results in binary formats like Avro, ORC or Arrow! kittens and unicorns!) while they try to do useful work.


There is a super informative blog post from last week at https://www.cybertec-postgresql.com/en/zheap-reinvented-post... - discussed at https://news.ycombinator.com/item?id=24710765


> What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.

I would be wary of this. Innodb, for example, also has an optimistic (in-place) UPDATE mode, and a pessimistic UPDATE mode.

Repeatedly updating the same row under the pessimistic mode would end up stalling the database, even at a rather low QPS.

https://bugs.mysql.com/bug.php?id=53825 was originally reported by Facebook 10 years ago, and is still not fixed in 8.0.21 with VATS / CATS scheduling.

And then there is also the performance gap between the ideal case where undo log is still in memory, versus the pathological case where undo log needs to be fetched from disk.

The last thing postgres needs is something that looks good on paper / in benchmark, but has a bunch of issues in production.


It is nice that PG got another storage engine. I hope that will ease the implementation of the column based storage in official PG, as the row based storage is a major problem with very large datasets.


When can we expect Zheap to land? 14?

With the new pluggable Storage API, Are there any other new storage engine other than Zheap?


Great to see Antonin and others working on this! I did some work on some pieces of this (along with many others), and I gave a talk about an earlier prototype of the undo log machinery and how it fits at PGCon 2019 (which now seems like a whole lifetime ago!). Slides and recording here in case the background is interesting:

https://speakerdeck.com/macdice/transactions-in-postgresql-a...


This seems neat!

Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.

My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.


Such workflows with scheduled DELETE/INSERT often mean that the data is "derived", there's unlogged tables feature for that in PostgreSQL. Table configured with unlogged are not being written to WAL and thus generate much much less I/O. The downside is that after a crash occurs the table might be empty (before it is repopulated by the DELETE/INSERT workflow again).


It should be an atomic replace - create a new table, do your inserts, swap the new table in. This way you aren’t left without the data during the rebuild, and you are guaranteed a consistent state since DDL is transactional in PostgreSQL.

Make the new table unlogged while it’s being populated as well for that performance improvement if you want, since the atomic replace guarantees your consistency anyway.

Alternative if it’s not a full rebuild, move from a delete/insert to using INSERT .. ON CONFLICT DO UPDATE where possible. I do this for cache tables that store materialized summary views that are updated by pgAgent jobs.

If you are deleting “old” data out then maybe use partitioned tables so you can just DROP the table containing the data you want gone.


Yeah, transactional DDL is a good thing, the potential problem with this solution though — if you have views pointing to the old table they won’t be updated with the new table and you will have to deal with them as well.


UNLOGGED tables is new to me, thanks for sharing.


It's tremendously useful for ETL.


That's super interesting. Can you share some more information on why it's so useful for ETL?


Think of it as a temporary table that doesn't automatically disappear at the end of the session. If nothing but your ETL process cares about the data until it's done being processed, and you still have the original data until the processing is done, in the worst case scenario (the server crashing) you just have to start your ETL process from the start.


Why not use partitioned tables? Simply drop/archive the child tables that you don't need, and avoid the vacuum overhead.


That’s a different thing. In some cases the whole table needs to be replaced. Then inserts into UNLOGGED tables are still much faster than into regular tables (even if there wasn’t a DELETE before).


well there is truncate for that. (not mvcc safe)


The table needed to never be visible as empty, so truncate wasn't a short term option.


I solved this by using partitions (at the time it was done through inheritance, where you could make one table made out of several smaller ones), and just dropping a table with unneeded data.


This should solve one of the problems Uber had with PostgreSQL reported by Christophe Pettus back in 2017[1].

1- https://thebuild.com/presentations/uber-perconalive-2017.pdf

Previous discussion from 2018 https://news.ycombinator.com/item?id=16526623


I am cautiously optimistic that my own recent work on version churn in B-Tree indexes will go a long way towards fixing those problems:

https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFAST...

This can be used without changing anything in the heap. It's less than a thousand lines of C. You could say that it's complementary to zheap, actually.

zheap makes it possible to update the same row many times without requiring new index entries, even when there is a long running transaction that holds back VACUUM. However, it does not avoid the problem of requiring a whole new set of index entries for all indexes in the event of even one indexed column being modified by updates.

Strictly speaking my patch doesn't "fix" that problem, either, but it comes pretty close. It teaches the indexes to "fight back" against version churn caused by updates that cannot use the HOT optimization. It makes non-HOT updates responsible for cleaning up their own mess -- no more negative externalities. In practice this seems to more or less fix the exact thing that the Uber blog post complained about, which was the "write amplification" incurred in all indexes when only one indexed column was changed by an update.

(I am omitting some subtleties here, but that's the general thrust of it.)


I wonder how this approach will compare with an LSM tree style index that has gained popularity recently (mostly via RocksDB). LSM trees are also write optimized and allow for some optimizations by batching up updates and merging them in bulk.


Look no further, after playing around with TimescaleDB, there are no updates anymore and we can put this problem into a vacuum :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: