Hacker News new | past | comments | ask | show | jobs | submit login
Introduce ZSTD compression to ZFS (github.com/openzfs)
164 points by c0d3z3r0 on May 17, 2020 | hide | past | favorite | 38 comments



See BSDCan 2018 "Implementing ZSTD in OpenZFS on FreeBSD" by Allan Jude:

* https://www.bsdcan.org/2018/schedule/events/947.en.html

* https://twitter.com/allanjude


This is an iteration of Allan's code


It looks pretty mature. Why is this compression not implemented? I checked out the youtube channel of ZFS, but it is nowhere mentioned in the years of monthly sync ups.


There's one thing I don't understand. Each time a new compression algorithm is introduced, it's the Next Big Thing. Why isn't the implementation of the algorithm as simple as linking in the related library, assuming they'd all have a similar interface? After all, it seems like what you need is a header and a function that converts a compressed block to a decompressed one and the other way round. Where's the complexity from the API/implementation perspective given that a library is already there?


In this case it could use the Linux crypto API, which already has ZSTD support and provides compress/decompress functions. But that is exported as GPLv2 so ZFS needs to do its own thing. And idk if the API is sufficient w.r.t. to e.g. workspace management/reuse.

W.r.t. to ZSTD: The usual thing is to provide a zlib compatible API, which ZSTD does ( https://github.com/facebook/zstd/tree/dev/zlibWrapper ). But the ZSTD zlib compatibility layer causes lower performance, so it is better to use it directly. Maybe all future compression algorithms can provide a ZSTD compatible API, so we have to do less work in the future?


>In this case it could use the Linux crypto API

Could it? That doesn't sound very portable, and ZFS works on FreeBSD too.


Zstd is dual licensed BSD and GPL. It is kind of weird the kernel limits it to GPL modules.


For one thing, kernel models don't link against libraries. So a new implementation is almost always written.


This incorporates the zstd source code ‘as is’. FTA:

Frequently Asked Questions

“Q: Why is it so many lines of code, I can't review all of that...

A: Most of this code is the ZSTD library, which has not been altered.”

One reason this needs testing because this is a file system. If it breaks, it can lose you much more data than the file being worked on.

There also may be serious performance degradation on hardware or configurations the developers didn’t look at.

There also is some new code added to call the zstd library.


It often works that way - when binary RPM compression switched from xz to ZSTD in Fedora recently, it was reaaly just about making sure the tooling can now read ZSTD while keepin xz support in place for compatibility (basically msking sure it links to ZSTD and can use it at runtime). Then just switch the Fedora build system to compress binary RPMs with ZSTD when they are built and you are done. :)


It's usually the same as linking, except there are tunables in every library, and the hard part is figuring out which one is appropriate for the job. Maybe, this call is to ask people how this library works out in the real world, on the current datasets that people have. Besides, they are talking of testing implementation specific limitations, and may they want to test for that.


> Why isn't the implementation of the algorithm as simple as linking in the related library

Because filesystems need to store compression metadata and compressor settings differently than individual archive streams do. Because of the way ZFS stores configuration like this, in the previous version of these patches, they had to choose a subset of the available compression levels when adding zstd support.

Different compressors and decompressors also have different state sizes for different settings. Allocating, reusing, and discarding buffers for compression/decompression state in a sensible way inside an operating system kernel is not trivial.


lz4 is performing pretty well. Better than most zstd-fast settings. It will not be obsolete after adding ZSTD and most likely stay the go-to algorithm if you are unsure what to choose.


FYI lz4 and zstd are made by the same guy, so there is no surprise both of them complement each other. lz4 targets the (de)compress-as-fast-as-possible domain, while zstd is for the rest (good (de)compression with slower speed).

For something that is write-once read-many-times, like a filesystem, a good compression algorithm might be more interesting in the future. lz4 is more targeted at write-once read-once, like file transfers for example


Isn't it the opposite? If LZ4 is optimized for decompression speed as you say, then you would want to use it when you read many times, the same file, very fast.

From a quick read, ZSTD looks more about saving space while keeping reasonable speeds, both at write and read.

And I'd assume there are other algorithms that focus only on size, trading speed for it.


Depends what "read many times" means.

- If you mean that the same archive will be distributed to many peers, such as is the case in package distribution, then in practice archives will be read only once by each process, so one "slow" compression will translate into significant gains in added decompression speed. That's the reason Archlinux switched to zstd for its packages (https://www.archlinux.org/news/now-using-zstandard-instead-o...)

- If you mean that the same archive will be read multiple times by the same machine, I don't really know what kind of scenario that is; I'd deflate the archive into its initial representation once and then let processes access that folder directly. Note that zstd claims that it isn't that much slower in decompression than competitors, even if you always use compressed archives the difference will be minimal

zstd was built more or less to "replace" all formats that favor compression over speed. From their benchmarks (which means what it means) whatever the compression/speed ratio you want, zstd is going to be better than all of them, with a hard exception on extremely fast speed that is still the kingdom of lz4.


By read many times I mean a file that is opened many times. For instance, the kernel and libraries every time you boot.

From my understanding (but please correct me if I am wrong), LZ4 will decompress them significantly faster than ZSTD even if the latter compresses more.

In other words, the decompression speed is measured on the decompressed data, right?


Yes, speeds are measured relative to the uncompressed data. ZSTD outputs around 600-2000MiB/s/core (in all the higher levels), depending on how fast your processor is.


It also depends on your workload and the speed of your disks.

If your disks are faster than your decompression algorithm when that algorithm is running alongside the rest of your workload (generally not the case) then it can make sense to use the faster decompressor (lz4). In my understanding of the tradeoffs of zstd though, having used it recently in an application, chances are you have a free hardware thread that can saturate your disk without affecting your compute workload.


Considering that nowadays many people have an SSD, for boot files that would mean LZ4 is best.

But perhaps for your data files that you don't open often, ZSTD is best because you save space on the SSD.


> Considering that nowadays many people have an SSD, for boot files that would mean LZ4 is best.

That depends on how concurrent boot is, and how fast your CPU and memory are. It may be true on a Celeron, but maybe not on a ThreadRipper.

Furthermore, if you look at the performance testing, the sequential read performance was almost always better with zstd than with lz4, in ZFS; and the zstd-fast mode was about as fast as the lz4 mode in sequential writes. This may be a matter of their specific integration of lz4, but nonetheless it pays to look at the actual numbers before drawing conclusions.


Indeed in these cases lz4 makes sense. I'm thinking of what lz4 was created to "replace" namely snappy in SSTables: the pattern is mostly write once, read at most once or twice, and both of those are sequential to benefit of non-SSDs speeds. In these situations fast (de)compression schemes are definitely a gain.


It depends on whether read rates are more likely to be limited by CPU or by disk speed. If the former, you'll probably want something like LZ4 to minimize CPU usage. If the latter, it may be better to use a more CPU intensive algorithm which provides better compression, so that there is less compressed data to read from the disk.


xz would be an example of the latter kind.


Does xz actually outperform ZSTD anywhere? IIRC even at high compression ratios ZSTD is a lot faster than xz.

Edit: Arch Linux switched from xz to ZSTD for their package manager and somebody compared both: https://sysdfree.wordpress.com/2020/01/04/293/

The Arch Linux developers state that they expect an 0.8% you increase in package size but an 1300% speedup in decompression. Not too shabby.

I'm running Arch on my personal system and it's really noticable, especially when I create by own packages, compression doesn't take longer than compiling anymore.


There definitely exist certain kinds of data which xz can compress better or as well but faster than zstd, even taking into account the more extreme compression options offered by zstd. I know because we have many terabytes of such data, and I've done thorough comparisons between xz and zstd at all different compression levels.

However in all cases zstd is much faster for decompression. It just happens that getting the best compression ratio in a not insane amount of time is still the better tradeoff for us.



Guy who stands behind this has insane CV

2015- Jagiellonian University, Institute of Computer Science, assistant professor,

2013-2014 Purdue University, NSF Center for Science of Information, Postdoctoral researcher (webpage),

2006-2012 Jagiellonian University, Cracow, PhD in Theoretical Physics (thesis)

2004-2010 Jagiellonian University, Cracow, PhD in Theoretical Computer Science (thesis)

2001-2006 Jagiellonian University, Cracow, MSc in Theoretical Physics (thesis)

2000-2005 Jagiellonian University, Cracow, MSc in Theoretical Mathematics (thesis)

1999-2004 Jagiellonian University, Cracow, MSc in Computer Science (thesis)


IIRC it doesn't use ANS for every compression level, can't remember much more than that, though


Basically yes, but also depends on if you're I/O bound and have free CPU cycles. If there's limited bandwidth to underlying storage, with zstd it's possible to get more data in/out.


I think it's likely that LZ4 remains the default. There are almost no downsides to having LZ4 enabled.

zstd is more likely to represent an obsolescence of gzip. It surpasses gzip pretty much always.


Just going by those graphs, I could double my compression ratio by going from lz4 to zstd-1 without going below the speeds the drives in my pool can manage. The usual caveats apply, but it seems to me that this is a pretty good upgrade for the usual case where you're using hard drives instead of fast ssds in a pool.


This sounds good for fileservers. But generally speaking, computers usually do more than i/o.


An option in the future to write data with a fast zfs level so everything is speedy and recompress blocks which have not changed in some time with a more efficient compression ratio would be really great. So you would have almost no performance penalty writing data and very high compression ratio for old data.


Fast compression levels of a given algorithm means lower compression ratio.

I don't know if ZFS supports variable compression levels (maybe per dataset), but Btrfs ZSTD support uses a mount option, e.g. mount -o compress=zstd:[1-15]

Thus it's possible to use a higher level (high compression ratio, slower speed, more CPU and RAM) for e.g. an initial archive. And later use a lower level (or even no compression) when doing updates. Writes use the compression algorithm and level set at mount time; and it's possible to change it while remaining mounted, using -o remount.


Btrfs is soon getting support for specifying the level on a per-file basis: https://github.com/kdave/btrfs-progs/issues/184


> I don't know if ZFS supports variable compression levels (maybe per dataset)

> Thus it's possible to use a higher level (high compression ratio, slower speed, more CPU and RAM) for e.g. an initial archive. And later use a lower level (or even no compression) when doing updates.

yup. works the same in ZFS. you can change the compression setting any time you like, for future writes.


I think that would require the same architectural changes needed for offline deduplication (i.e. probably not going to happen any time soon, unless I've missed some recent developements).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: