
A new storage engine for PostgreSQL to provide better control over bloat - amitkapila
http://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
======
jacobkg
This is potentially BIG news. At my last company Postgres bloat was a gigantic
problem. We did a lot of data updates and sometimes customer databases (we
sharded by schema) would be bloated by as much as 300%. This really hurts when
some customers are 50+ GB. Not only does bloat take up more space but it slows
down queries.

We would occasionally resort to taking the customer offline for scheduled
maintenance to do a pg_dump/pg_restore to get their DB back in shape.

I'm very curious to hear when this might become available for production use
cases.

EDIT: Article states this has been proposed for Postgres 12

~~~
felixge
Are you aware of pg_repack [1]? I haven't had a chance to use it myself yet,
but it's designed for reclaiming bloated space without downtime.

[1] [https://github.com/reorg/pg_repack/](https://github.com/reorg/pg_repack/)

~~~
jacobkg
We tested pg_repack early on. Our experience (at least 5 years ago) was it
didn't reliably clean up after itself in all cases so we were left with random
triggers in the database. With thousands of customer databases this was a huge
potential hassle.

I have heard plenty of people use pg_repack so maybe it got better or we
didn't test it enough.

~~~
al_james
We use pg_repack and hit that exact problem (left over triggers etc). In the
end we resorted to creating and dropping the pg_repack extension after each
run.

(edit: Missed an "and")

~~~
kawsper
Do you have any idea why their docs are not mentioning this, or even why their
software aren't doing that itself?

~~~
cookiecaper
I haven't used pg_repack in production, but on my development box with PgSQL
10.x, if you end the process early, pg_repack will complain on the next run
and ask you to clean up by dropping and re-adding the extension. Maybe this
occurs in the wild if the cleanup code is not reached, for whatever reason.

------
_pj_
I really wish the EDB guys (both Robert Haas and Amit Kapila) would mention in
their blog posts about this feature that it's meant to use the pluggable
storage API which is also slated for PG12. That would make it more clear that
we are not abandoning the current storage format, just giving users choice
(per table) to use what works best for their use case. It would also alleviate
the worries that some comments expressed for both of the blog posts.

~~~
fanf2
Where can we find a summary of the pluggable storage API?

~~~
_pj_
I don't think there is a good summary, it hasn't been settled yet. The (rather
lengthy) discussion is like with everything on -hackers mailing list -
[https://www.postgresql.org/message-
id/flat/20160812231527.GA...](https://www.postgresql.org/message-
id/flat/20160812231527.GA690404@alvherre.pgsql)

------
macdice
There was an intelligent comment about the extra writes when writing to UNDO
and the extra reads when anyone needs to read UNDO and suggesting that it
might not always beat the traditional system which just has to write a new
tuple. It was deleted while I was composing my reply, so here, without
context, is my reply!

Certainly there are complicated trade-offs here. Some things to consider: (1)
unless the pages get flushed to disk due to memory pressure or a checkpoint,
the UNDO data won't hit the disk so that '2x write' isn't a disk write, (2)
in-place updates don't require every index to be updated (only indexes whose
key changed), skipping the famous "write amplification" problem (more
generally than HOT can), (3) the 2x reads for transactions that need modified
rows applies only to transactions running with old snapshots; in the
traditional system _every_ transaction pays for reading past the bloat that
can't be cleaned up (in other words, an old snapshot is a tax on every
transaction in the traditional design, but UNDO is only a tax on the
transaction that's using an old snapshot in this proposed system).

------
CWuestefeld
I'm a PostgreSQL newbie [1]. From the discussion below, it seems that most of
what's being addressed [2] is only really relevant in cases where there are
really long-lived transactions. So a better-behaved system in which the
transactions are of fleeting duration wouldn't see as much benefit because
they don't suffer much from the problem.

Am I understanding that correctly?

[1] Our DBs are currently all SQL Server, but due to licensing costs, we're
going to do all new stuff in pg, and over time, migrate the legacy stuff.

[2] I see there are other across-the-board benefits from e.g., smaller tuple
headers; I'm talking about the main thrust.

------
garyclarke27
Considering that: storage costs are falling rapidly - update in place is a
poor design, immutable far superior - pg is rock solid because of this
architecture including DDL rollback - vacuum can recover dead space anyway! -
this does seem to me to be a bit of a wasted (large) effort. Reducing size of
headers and padding sounds good though. I would much rather see effort spent
on incremental auto refresh of materialized views. NOT IN Foreign Keys,would
also be useful to enforce XOR constraints. And Table Input parameters for
Functions.

~~~
amitkapila
Vacuum can't recover dead space in all cases. Consider if the dead rows are
distributed throughout the table, it won't be able to reclaim it, only it can
facilitate the future insertions into it. Also, the work it does to do that is
many-a-times quite heavy for many workloads as it can slow down the system. I
think you haven't fully looked at all the other benefits of this system which
includes reducing write amplification by (a) by avoiding rewrites due to
freezing, hint bits (b) making it possible to do an update that touches
indexed columns without updating every index.

I think there are other benefits too in terms of just raw performance because
it avoids a lot of work like HOT pruning. Yes, it instead needs to discard
undo, but that is much cheaper.

In general, I think it is an overstatement to say that this is a wasted
effort.

~~~
garyclarke27
Yes I didn’t mean completely wasted, just feels to me that this requires a
huge effort and is a risk to Postgres’s legendary reliability - for a
relatively small gain, especially considering that ssd storage capacity will
continue to increase dramatically and plummet in price. Would be better I
think to prepare for new paradigm of unified immutable memory coming v soon,
(Optane already shipping) which will enable Postgres to compete with ‘in
memory, databases like SAP Hanna

------
jayflux
> zheap will prevent bloat (a) by allowing in-place updates in common cases
> and (b) by reusing space as soon as a transaction that has performed a
> delete or non-in-place update has committed

I’m not going to pretend to understand all of this, but is there not a trade
off here? Is MVCC not fast and concurrent because it does create new rows
instead of trying to mutate the old/current ones, would it be slower writes?

In other words, are we sacrificing speed for less storage consumption?

~~~
amitkapila
It appears like we are sacrificing speed in some cases, but it is not actually
true. In the current heap, we have to modify a current row, add a new row, and
then WAL log diff of both rows (or in some cases need to write both rows) and
routinely perform Hot-pruning. Now, when the new row can't fit on the same
page, we have to write both the pages, the page which contains old row and the
page that contains new row.

In zheap, we need to write an old row in undo, but in many cases, it won't be
written to disk as undo worker will discard the undo very quickly unless there
are open snapshots. In this, we don't need to perform Hot-pruning. So, there
is an additional cost of memory copy, but that is more than compensated by
savings.

You can read 'macdice' reply in this regards.

