"Add support for LZ4 with compression of full-page writes in WAL
The logic is implemented so as there can be a choice in the compression
used when building a WAL record, and an extra per-record bit is used to
track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with
support for the following backward-compatible values:
- "off", the default, to not use compression.
- "pglz" or "on", to compress FPWs with PGLZ.
- "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be
also an interesting choice, but going just with LZ4 for now makes the
patch minimalistic as toast compression is already able to use LZ4, so
there is no need to worry about any build-related needs for this
implementation.
Author: Andrey Borodin, Justin Pryzby
Reviewed-by: Dilip Kumar, Michael Paquier"
Actually, that's not true. LZ4 is faster at compression and decompression, zstd (even at the minimum setting) is slower and has higher compression ratio.
zstd-fast is almost as fast as LZ4 and gets a tiny bit better compression, though not a lot of software supports that mode. ZFS does, hence I have some numbers to back that up.
I just had a little look into these somewhat new zstd options --fast 1 to fast 5. fast 5 does seem almost equivalent to lz4 performance in compression ratio and compression 'speed' (or cpu load) except decompression 'speed' is still almost half of lz4, so at that extreme of zstds options it is squarely beaten by lz4. But in terms of compression ratio - which in most cases should be the more crucial characteristic for any compressor than cpu load - this setting seems to be well under zstd's knee of diminishing returns. --fast 1 has about 15% more cpu load for about 112% of compression performance than --fast5. And then zstds lightest non-fast setting (zstd-1) for 50% more load in compression and decompression, compresses 32% better than --fast5 (and lzo and lz4).
The decompression load is an eye catching feature of lz4, but its compression ratio is not just marginally compromised, it is a much less powerful compressor in that prime characteristic than zstds weakest non-fast setting. I can appreciate why Btrfs is in no hurry to include lz4 while it supports zstd:1 to 15. Although it would be interesting to test if lz4 could have a niche on fast storage combined with slow cpus.
There are use cases for lz4, I've had one in an embedded project. Slowish SoC (i.MX6 DualLite IIRC), relatively fast storage (eMMC - these tend to perform better than regular MMC). I measured boot time for the available kernel compression options and lz4 won by 100-200 ms or so compared to the next best option, which may have been gzip. It didn't make a big dent in overall boot time, but not bad for just setting an option. zstd, if available, which it wasn't, would have gotten a marginally worse result probably. IIRC, lzma was slower than uncompressed.
gzip would seem to be a surprising runner up as its decompression cpu load is reportedly around 3-4 times as heavy as zstd:1 with a similar compression ratio.
I would not rule zstd out, it seems to better every alternative given the right level setting, except lz4 which is the outlier but fixed at a relatively low compression level.
As the high performing outlier it tantalisingly suggests room for improvements :]
I read it was only mainlined last year following this thread[1] where a kernel dev presented benchmarks and advised "zstd is an improvement on almost all metrics"
Of course zstd somewhat of a portfolio project of facebook developers, and they seem to have integrated it quite nicely into Btrfs to atone for their debt to society - I'll take that :D
Post that uses specious word: "performant" and a reply starting with "Actually, " - can we not create a prize for this? Maybe hn bingo cards? Dang, that would be good.
Would it be a good idea to use zstd's dictionary compression for databases? It seems like it could be a big win since you are compressing relatively small blocks with known structure.
I've wondered the same. Maybe a per-table dictionary created as part of CREATE STATISTICS? I guess the difficulty will be updating the dictionary, you might need some kind of page epoch to look up what the dict was at the time the page was written.
LZ4 is amazing - it's basically the same idea as LZ77, only implemented carefully, with all the dozens parameters and tradeoffs made in a certain way. Couple this with modern amounts of brute force thrown in an optimal parser (LZ4HC) and you get something quite extraordinary, on the "Pareto frontier" of compression efficiency. There's a tendency to focus on algorithms and scoff at implementation details, but LZ4 was possible as an algorithm 30 years ago, yet the implementation details crystallized to perfection only, say, five years ago.
We also have voluminous amounts of cache and multicore processors that can decompress in cache faster than reading uncompressed data. The tech did need to catch up.
If you're on your way down this rabbit hole, there's a bunch of old-machine-specific compression algorithms, developed by the emulator community, e.g. LZSA: https://github.com/emmanuel-marty/lzsa
LZ4's choices may have also made less sense then due to memory latency, cache sizes, deep pipelining changing missed branch penalty, and other changes since the 70s.
If you are using ZFS, I strongly recommend using LZ4 or ZSTD compression with PostgreSQL. Performance is still awesome. On average I get 2x compressionratio with LZ4 and 4x with ZSTD.
With this, you are compressing everything, not just columns. And ZFS has dynamic block sizes which works really great together with compression. For example a 8kb PostgreSQL page, may be stored as a 1kb compressed block on disk.
For high-speed pgsql (on ZFS) enable zfs-compression (lz4) just on the filesystem, and disable zfs-caching (the arc, just leave metadata on) because postgres has it's own cache.
Have not used this in anger but - from what I've read best practice is also to use separate datasets for the WAL and the rest of the data (including committed data) - this helps ZFS understand that WAL and committed are both separate datasets that both need separate QOS handling within ARC rather than just letting the WAL eat the whole cache. It'll try its best to balance them regardless, of course.
You can still snapshot them atomically by making them both children of a parent dataset and performing a recursive snapshot on the parent. so you have dataset:
myservice/pgdata
myservice/pgwal
or
myservice/pgdata
myservice/pgdata/pgwal
or whatever.
So question, given that this article is about LZ4 memory compression for postgres, would you want it enabled for both/either/none of those datasets? Obviously compression-on-compression doesn't generally help at all, but lz4 performance obviously doesn't hurt much at all either, so if there's anything that postgres doesn't compress, maybe it would be faster in a few niche situations.
>would you want it enabled for both/either/none of those datasets?
Well zfs probes and stops when the data is non compressible, that why stuff like mp3 gets not compressed, so if the pg-datas are already compressed zfs would try it and give it up after some kb's.
I've seen it debated both ways as to whether the zfs `recordsize` tunable should be set to 8k to match postgres or if you should keep it to something a little bit higher. Here's a talk by a FreeBSD/hashicorp guy about performance wins with a compromise of 16K [0]; this lets compression work better (the larger the block, the more compressible it is) and doesn't increase writes too much, although it would probably depend on what usecases you put postgres to and the nature of your reads/writes.
The same slides mention `primary_cache` and the suggestion is use `metadata` if db working set fits in RAM and use the default of `all` if it doesn't.
Yes in a virtual machine just one vdev/dataset (if it's about speed, if redundancy...well that depends on you host-filesystem, then maybe two virtual disks?), maybe some tuning with the write-through etc on the Host to that virtual-disk? And raw would be a good thing too.
Compressed ARC has been a thing in ZFS for a long time. So let's say you allow ZFS cache to take 50% of total memory. With 4x compress ratio from ZSTD, you will be able to cache 2x the total memory. It is amazing.
Yes, but on the other hand, my toast relation now needs (e.g.) 4x the amount of blocks it would have had when compression was enabled. So we have a net 0 storage savings, except that we now need to request 4x as many pages from the OS to get the full toast data, those pages taking up 4x as much space in Postgres' shared buffers, also evicting 4x as many pages from the buffers as it would have otherwise.
Unless your read operations to the file system are somehow very cheap, I don't think disabling toast compression is a good idea.
I see your point. Perhaps the best way is to use TOAST compression for wide field values, and then rely on ZFS filesystem compression for others? For some context, in MySQL, both the table-level compression and page-level compression are problematic, while ZFS has served me well, thus I have a lot more faith in it.
> a Perhaps the best way is to use TOAST compression for wide field values
You don’t have to do anything. Toast compression is enabled for compatible types by default but never kicks in until a cell hits 2K bytes in size (by default).
Unfortunately that's just how postgres works? I thought the rule of thumb is still to allocate 25% of memory for postgres 's shared buffer, and 50% of memory for OS buffer cache (i.e. ARC when using ZFS)?
For a transaction processing workload (where hitting indexes is the all-important thing), that's exactly right. But if you have more of an analytic workload involving sequential scans, Postgres definitely expects the underlying storage system to be handling at least some of the caching.
Are you finding the benefits and features of ZFS worth the perf penalty? Or maybe even its compression makes it overall a faster alternative to say ext4 for a database?
I’ve considered in the past running a DB in AWS with direct nvme drive used for SLOG and for L2ARC cache while the main storage would have been a mounted EBS volume, I haven’t tried it but am still wondering if it’s a better alternative to running a DB on top of bare EBS in Amazon (it should be a lot cheaper than say io2 type volume with provisioned IOPS)
I compared ZFS to XFS on a number of large database deployments and XFS was not uniformly faster. All databases were on fancy hardware with a bunch of U.2 SSDs, which is theoretically where XFS would be a big advantage. Postgres’ speed was very similar in both cases. Upsides of ZFS included vastly lower disk usage (with huge gains on poorly normalized databases) and also way easier/faster offsite backup than the non-ZFS Postgres backup strategy.
It’s essentially impossible to beat ZFS incremental snapshot send. It was literally like 100x faster than rsync (about 10x more bandwidth efficient and 10x higher bandwidth) when I was doing these tests.
With cloud network-based storage like EBS or pd-ssd, I doubt you will notice any IOPS-related perf penalty when running a RDBMS on top of ZFS. Assuming a mixed read/write workload, you will likely hit the disk write throughput limit first, as ZFS just writes ... a lot more, due to how blocks are stored internally: https://github.com/openzfs/zfs/issues/6584#issuecomment-3848...
On an innodb database, I get about 3x compress ratio with ZSTD compression, and ZFS still has to write about 2x more than EXT4.
There just isn't much of a perf penalty. It uses more CPU, yes, but with today's highly-threaded CPUs I can hardly tell there's a cost.
For any data you want to keep, I'd say it's well worth the performance cost. Even for data you don't want to keep, so long as it isn't absolutely performance-critical I find it still worth the cost.
> There just isn't much of a perf penalty. It uses more CPU, yes
Not in my (although small) experience with ZFS vs. databases; the bottleneck is the disk (although it's correct that ZFS is quite CPU-intensive).
Databases have their logging strategies, and ZFS has its own too. So the expense of a write operation is (in principle) doubled.
AFAIK, the typical strategy to mitigate the ZFS overhead is to use a log on a separate, fast storage ("SLOG"). On the db side, in MySQL, it's possible to disable the doublewrite buffer (which is another form of overhead, which, in theory, ZFS should avoid), but nobody really knows if that is 100% reliable or not.
> On the db side, in MySQL, it's possible to disable the doublewrite buffer (which is another form of overhead, which, in theory, ZFS should avoid), but nobody really knows if that is 100% reliable or not.
It is. Actually, that's the best thing about running MySQL on ZFS. The doublewrite buffer is not just another form of overhead, the doublewrite buffer is *the bottleneck* for any workload with moderate amount of write (until it got revamped in 8.0: https://dev.mysql.com/worklog/task/?id=5655).
> It is. Actually, that's the best thing about running MySQL on ZFS. The doublewrite buffer is not just another form of overhead
There has been an interesting discussion on the MariaDB mailing list, questioning this, called "Is disabling doublewrite safe on ZFS?"¹.
The last post has been from the InnoDB lead, saying²:
> I believe that it is technically possible for a copy-on-write filesystem like ZFS to support atomic writes, but for that to be possible in practice, the interfaces inside the kernel must be implemented in an appropriate way.
It doesn't sound like the guys understand how ZFS CoW works? So let's say the machine crashed after InnoDB had written 4KB out of a 16KB data file page to the filesystem. Due to CoW, the old data file page would still be there, unmodified. In contrast, on XFS or EXT4, without the doublewrite buffer, the data file page would be corrupted.
I have a lot of respect for Marko Mäkelä, as he seems like the only one who tries to steer InnoDB away from an evolution dead end. For example, in https://jira.mariadb.org/browse/MDEV-24449, he fixed a corruption bug that had existed since the very first InnoDB commit. At the same time, he also tries to simplify the internals, unlike Oracle who just churns out features along with regressions.
But there are always more corruption bugs within InnoDB. I am not even sure the log checkpoint mechanism is sound, where it does a fsync of the data files first, before getting the checkpoint position from memory, which may have moved forward after the fsync. This behavior also exists since the very first InnoDB commit.
> he also tries to simplify the internals, unlike Oracle who just churns out features along with regressions.
Sadly, very true. I don't know how it was on 5.7 and before, however, at least since v8.0, even patch updates have an alarm probability of breaking existing installations, due to either bugs, or subtle changes in existing functionality.
Regarding the general point about PostgreSQL and ZFS it might be noted that the community are not too keen on it (mostly from a performance perspective) as per recent mailing list thread[1].
> And ZFS has dynamic block sizes which works really great together with compression
This is not just a nice-to-have but is the reason why it is viable at all. In comparison, compression in Btrfs for this sort of workload produces insane write amplification (20x or more), as it forces large blocks for compression
LZ4 is just such a nice little algorithm for compression. It's downright simple, and fast, and still works. It's so simple seems like it should have been invented long ago, but it's only 10 years old.
Initial release (according to Wikipedia) 24 April 2011.
It is derived from the LZ77 family of compression algorithms, which have papers published in 1977 and 1978 and thus US patent expiration around the year 2000. It is understandable that it would take time for discourse, inclusion in educational curriculum, and programmers feeling free to tinker with the concepts to produce useful new art.
It's almost like patents are a bad thing for the progress of science and skilled arts.
Was LZ77 patented? As far as I can tell, patents related to various techniques in LZ77 were filed by various parties in the mid 80s and 90s, but that wouldn't matter, since LZ77 was already published at that point.
PNG was explicitly designed to be unencumbered by patents, and it was released in 1996.
Yes, but GIF is based on LZW, as the article you linked to says. LZW was derived from LZ78. They both IIRC use a dictionary coding approach with an explicit dictionary that is constructed during processing. Indices into the dictionary are used to encode repetitions.
PNG uses deflate, which is based on a combination of Huffman coding and LZSS, which is derived from LZ77. LZ77 encodes repetitions using back references into a sliding window. To my understanding, LZSS also uses a dictionary based approach, but only for finding such repetitions during encoding.
So this does not answer the question if LZ77 was patented or not.
The LZS page mentions a law suite sounding similar to the one from the talk page, but contains no direct reference to the patent: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Sta... It also states that the LZS patent was filed in 1993 and expired in 2007 (due to not paying their fees).
At very least, tech patents really should be shortened to something like 5 years instead of being 20 years. It's nuts that the SSE2 instruction set has JUST come out of patent protection.
I implemented LZ77 when I was at university as I was trying to load gifs for an image processing project. I was absolutely blown away by the simplicity and elegance algorithm. Need to have a look at LZ4.
I believe GIFs use LZW which is based on LZ78 - the one that builds a dictionary as it goes. (It's been decades since I looked at this one.)
LZ77 emits verbatim text and instructions to go back x and copy y characters. A lot of compressors are based on this, including PNG, gzip, lz4, snappy. I think this is partially due to earlier patent issues with LZ78. (LZ4 or Snappy would be a good intro, gzip has extra stuff like huffman encoding tacked on.)
You might also be interested in the Burrows-Wheeler transform, which bzip2 is based on. It's completely different and kinda magical. The original paper is here:
Not sure about postgresql, but in MS SQL turning on the page or row compression usually improves the performance AND storage, esp. on spinning discs, because it's faster to transfer compressed + decompress on the fly that transfer uncompressed - which is a win-win situation.
"The following options are available:
compression: [none|pglz|zstd|lz4|lz4hc] - set the compression type for newly-inserted data. Existing data will not be recompressed/decompressed. The default value is zstd (if support has been compiled in).
compression_level: <integer> - Sets compression level. Valid settings are from 1 through 19. If the compression method does not support the level chosen, the closest level will be selected instead.
stripe_row_limit: <integer> - the maximum number of rows per stripe for newly-inserted data. Existing stripes of data will not be changed and may have more rows than this maximum value. The default value is 150000.
chunk_group_row_limit: <integer> - the maximum number of rows per chunk for newly-inserted data. Existing chunks of data will not be changed and may have more rows than this maximum value. The default value is 10000."
The columnar implementations are cool and all, but they don't really do what you'd expect columnar storage to do—i.e. make each column independently retrievable / make each column effectively its own "single data column plus shared CTID" table for the dataset. Instead, when you fetch one column of a row from e.g. a Citus "USING columnar" table, that still bakes down to the query engine fetching the entire row-tuple for that row (and then discarding most of it), which in turn means that the columnar storage engine must read in + decompress the equivalent compressed blocks for every column participating in that row.
So, unlike in a columnar database, you still have to worry about splitting up your narrow+hot "fact" columns from your wide+cold "dimension" columns, and joining them back together later only when you need them, in order to not be polluting your page cache with the wide+cold column data.
Maybe I'm misunderstanding the feature that you would like to have, but I think Citus columnar supports what you want. Citus columnar storage has projection pushdown, so it only fetches the columns that you requested in your query.
To quote from the blogpost[1]:
What are the Benefits of Columnar?
- Compression reduces storage requirements
- Compression reduces the IO needed to scan the table
- Projection Pushdown means that queries can skip over the columns that they don’t need, further reducing IO
- Chunk Group Filtering allows queries to skip over Chunk Groups of data if the metadata indicates that none of the data in the chunk group will match the predicate. In other words, for certain kinds of queries and data sets, it can skip past a lot of the data quickly, without even decompressing it!
Disclaimer: I work on the Citus team at Microsoft, but have not worked on the columnar storage.
I remember benchmarking fast compression algorithms a while back and had lz4 as one of the top contenders, but I remember going with Snappy over lz4, because I ran into some compatibility issues in lz4 land... I can't recall any specifics, but I think it was some Hadoop libraries not being able to read files lz4-compressed by some other versions of the library.
Has anyone run into the same issue? I'm considering reopening this investigation (even though I'm very happy with Snappy).
An index is typically a mapping of values to database pages. Nothing about that cares whether the data has been compressed or not. It may or may not be compressed, but it is still on the page somewhere.
To be clear, an index is compression agnostic, it's the uncompressed value that will appear in the index, or the value selected e.g. bigJsonbDoc->>'email' is that right?
Indexes are generally a mapping from the key to a row location (or sometimes to the primary key of the row). You can compress the index itself, but it's usually accessed randomly, so it might be detrimental.
Depending on the database, you might have compressed pages (sets of consecutive rows), rows, or columns in a row. Postgres does the latter (it's called TOAST) where the large columns are compressed and/or stored in a hidden table. So you first search the index for the key, figure out where your row is, read the row from the table, then, if needed, you read the columns you need from the TOAST table and uncompress them.
It would be nice if AWS built their PG14 offering for RDS with this supported. PG14 is available on their beta channel now, but I haven't had a chance to play. Has anyone had a chance to explore this yet?
The LZ77 family of compression is really nice. So many use cases. ZSTD has saved me many times. But it is perhaps not the right tool in this case. Too bad so many tools are ZIP exclusive.
Another one of these magical down votes on this garbage forum. Seriously, what is there to down vote. LZ77 is not nice? ZSTD is the the right tool for Postgres compression even though it does not perform as well for smaller amounts of data? It is good to waste so much time in zipping things like the horrors of Docker?
https://github.com/postgres/postgres/commit/4035cd5d4eee4dae...
"Add support for LZ4 with compression of full-page writes in WAL The logic is implemented so as there can be a choice in the compression used when building a WAL record, and an extra per-record bit is used to track down if a block is compressed with PGLZ, LZ4 or nothing.
wal_compression, the existing parameter, is changed to an enum with support for the following backward-compatible values: - "off", the default, to not use compression. - "pglz" or "on", to compress FPWs with PGLZ. - "lz4", the new mode, to compress FPWs with LZ4.
Benchmarking has showed that LZ4 outclasses easily PGLZ. ZSTD would be also an interesting choice, but going just with LZ4 for now makes the patch minimalistic as toast compression is already able to use LZ4, so there is no need to worry about any build-related needs for this implementation.
Author: Andrey Borodin, Justin Pryzby Reviewed-by: Dilip Kumar, Michael Paquier"