zstd is incredible, but just in case the thought hasn't occurred to someone here...

Smerity · on Jan 25, 2018

Agreed that if you have the control (and potentially the time depending on the algorithm), type specific compression is the way to go. Having said that, zstd beats Snappy handily for text ^_^

On enwik8 (100MB of Wikipedia XML encoded articles, mostly just text), zstd gets you to ~36MB, Snappy gets you to ~58MB, while gzip will get you to 36MB. If you turn up the compression dials on zstd, you can get down to 27MB - though instead of 2 seconds to compress it takes 52 seconds on my laptop. Decompression takes ~0.3 seconds for the low or high compression rate.

wolf550e · on Jan 25, 2018

ppmd will get you better compression on text (better than zstd and even better than lzma), but it is slow to compress and decompress. Use `7z a -m0=PPMd demo.7z demo.txt`

hokkos · on Jan 25, 2018

Why not use a XML/XSD specific compression like EXI for that ?

https://www.w3.org/TR/exi-primer/

cldellow · on Jan 25, 2018

It really is mostly just text. It's not _quite_

but almost.

dorfsmay · on Jan 25, 2018

It really depends. I tend to use a specialized compression tool if I need to compress once and send/decompress often, but use zstd when I compress / decompress a lot. In my experience, if you have a fix, small amount of time(single digit minutes or less), zstd is the one that will compress to the smallest size. I even often pick `-3` as it is typically a lot faster than `-4` and subsequent, for not a huge difference in resulting size.

In my experience, if compression time is not a factor, for text (non-random letters and numbers), lzip is the best. I recently had to redistribute internally the data from python nltk, and tried to compress/decompress with different tools, this was my result (picked lzip again):

    gzip -9                 10 m  503 MiB  31 s
    zstd -19                29 m  360 MiB  29 s
    7za a -si               26 m  348 MiB     s
    lzip -9                 78 m  310 MiB  50 s
    lrzip -z -L 9 (ZPAQ)   125 m  253 MiB  95 m

petre · on Jan 25, 2018

I did some tests myself on a 22MB SQL file and it turns out:

* 7za -m0=PPMd produced the smallest file being faster than bzip2

* bzip2 turned out to be way faster than both lz (684%) and xz (644%) and produced a smaller file

* xz is marginally faster than lz, compressed sizes are about the same with the xz file being a tad smaller

* without any switches 7za produces an archive a bit bigger than xz and lzip in about the same amount of time

* gzip and zst produce about the same compressed size, only zstd is a lot faster (517%) than gzip

The 7z file was produced using the -m0=PPMd switch. For the other files no command line switches were supplied. Here are the file sizes:

  23668150 file.sql
   3899477 file.sql.7z
   4149962 file.sql.bz2
   5954982 file.sql.gz
   4540628 file.sql.lz
   4506720 file.sql.xz
   5961291 file.sql.zst

dorfsmay · on Jan 25, 2018

When going for smallest size, it'd be interesting to see your comparison using command lines switches for best compression (makes a big difference, both in terms of time and size).

Was bzip2 slightly, or considerably slower than zst?

petre · on Jan 25, 2018

Bzip2 being slower than gzip, yes, it's also considerably slower than zstd. Yet zstd -19 produced a bigger (4.3M) file in about the same amount of time.

If I can remember correctly zstd = 0.2s, gzip = 0.8s, 7zip (PPMd) = 2.1s, bzip2 = 2.7s, lzip, xz, 7zip (lzma) = 15..16s. This is CPU time from memory, might not be fully accurate.

I'd say zstd and gzip is better suited for general use, while bzip2 and 7zip (PPMd) are better suited for high compression of text files.

paladin314159 · on Jan 25, 2018

We've also had great success using zstd with training. We dump a lot of JSON data into kafka, most of which has a similar schema, and training it easily gave a 2-3x reduction in size over lz4.

tinco · on Jan 25, 2018

Is there a type specific algorithm for data that mostly consists of close numbers? I figure if I send the deltas only it would be just a sequence of small / close numbers that would easily be compressed by standard compression libraries.

An example of close number sequences is just simple graphs. Your CPU temperature is 78 degrees, most likely it'll be 78, 79 or 77 the next tick, so they're almost close, the delta's will be 0's and 1's usually.

amaranth · on Jan 25, 2018

Compression via next-symbol prediction seems to be what you'd be looking for. That's what the PAQ compression schemes focus on, although they're very slow and definitely overkill for non-archival purposes. You'd probably just want to write out that data as deltas manually and have the reader know a delta format is being used. So I guess the answer is actually delta encoding, because that's a compression algorithm too.

felixhandte · on Jan 25, 2018

Good point. While it won't beat a hand-optimized algorithm for a specific use case, compression with a dictionary (like zstd supports/encourages) is sort of a partial specialization of the algorithm to the type of data you're compressing.

beagle3 · on Jan 25, 2018

Provided that the compression stems from repetition of blocks. If you have a file with 16-bit integers, each either exactly 1 or 2 greater than the previous one, you will have 128K with no repetition that zstd will be unable to compress; However, if you transpose the bytes (all high order bytes, followed by all low-order bytes), the compression will be very significant; similarly, if you replace the numbers with their difference.

The correct term from information theory is that it approximates a "universal" compressor with respect to observable markov or fsmx sources.

hackcasual · on Jan 25, 2018

zstd is in snappy's domain, as a low overhead way of reducing bandwidth usage. You've got devs writing some service that talks in JSON or protobuf, just rub a little zstd on it, and bingo, your bandwidth is reduced.

halayli · on Jan 25, 2018

I'd pick zstd over snappy for text