: see timestamp in the mailing list message http://lists.gnu.org/archive/html/bug-gzip/2017-10/msg00000....
> Poorly tested obscure format support is a goldmine for hackers looking for exploitable code.
IMO the issue is when these code paths are automatically invoked. This is how you get gstreamer running 6502 opcodes for a codec no one has ever heard of when you download a file from the internet.
Ok, I'll bite: context?
Suppose you have a shell script which does something like "blah ... | gunzip > somewhere", where input to the "gunzip" step is under control of the attacker. Requiring a command line flag, or even an environment variable, would be enough to avoid exposing the code in question to the attacker-controlled input.
Usually, it would be too late to add a new command line flag, since people might have scripts which depend on being able to unpack files without passing that flag. In this case however, since it was broken for years and nobody else complained, it's very probable that nobody actually depends on this feature working, so requiring a new command line flag or even removing the feature would not cause many problems.
I, too, am concerned with future data loss due to codec extinction. A solution seems like it’d take the shape of an archive of the codecs in a simple, pure C subset with an actively maintained compiler. Unused code should be refactored out of mainstream tools
It’d probably require some sort of VM specification too, maybe java is a better choice. keeping the code “alive” by keeping it in a project when it’s never used doesn’t seem like a strong guarantee that it will still work when you need it.
"ANSI C" is a bit of a misnomer, since a) there have been updates to the ANSI standard which are not considered "ANSI C", and b) all of the updates are sworn to be backward compatible.
But if we're talking about obscure applications being preserved for posterity, they're not going to get built.
Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.
It can’t prove a program is 100% conformant, but it can detect a much broader range of undefined behaviors than regular compilers, and does other things more strictly (like, IIRC, limiting the declarations exposed by standard library includes to what the standards guarantees they expose).
Well, they don't really need to be, that's the point.
> Really I was talking about the other direction though: a way to confirm that a given program is written using only ANSI C. There's no testsuite for that, all you can do is test with the compilers that exist - and then sometimes your program suddenly stops working in the next version of the compiler.
See --std=c89 or -ansi flags on GCC or Clang. One other way to prevent other sources of rot is to avoid UB that is frequently subject to change. Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.
And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.
They won't help you if your code relies on behaviour not defined in the standard, which is almost always what happens in practice.
> Most UB is defined, at worst, by the constraints of real ISAs (i.e. the behaviour of idiv when dividing INT_MIN by -1 is an exception on x86, but the ARMv7 equivalent simply produces a number), and most UB-altering characteristics are standard across popular ISAs, and new (general purpose) ISAs tend to go with the status quo when in doubt.
ISA extensions aren't always so conservative - x86 family ISAs used to never have alignment requirements, but some of the newer SSE-family instructions do. And modern C compilers will optimise code that invokes UB in surprising ways - codepaths that do signed integer overflow on x86 used to silently wrap around in twos-complement fashion, long shifts used to behave the way the hardware did, but these days compilers will omit to generate code for that codepath since the behaviour is officially undefined.
> And if all else fails, you can still run your code in an ISA simulator, and compile it with an old compiler, to figure out exactly what it meant in the past.
Sure, but in that case you're not gaining anything from using ANSI C, and might as well stick with whatever the code is written in, or just keep the binaries and run them on an emulator.
Not being able to compile and run your code, with few or no minor changes, is an edge case. I frequently run unchanged or minimally changed code from before ANSI C. Sure, if you are definitely going to need a simulator to understand the code, you might as well just use all the original infrastructure; but if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.
Not my experience. A lot of pre-ANSI code would segfault if you didn't build with -fwriteable-strings, for example, and there was certainly talk of removing that option from GCC entirely.
> if you'd like to run it in the interim, C is nice, and sticking to a restricted subset means you'll have fewer things to change if you want to adapt it to a new system.
I wouldn't call C nice, but if you are using it I would stick with the style today's popular compilers and programs use - popularity is the best guarantee of future compatibility. Using "ANSI C" might be understood to mean e.g. using trigraphs to replace special characters, which would probably be a net negative for long-term maintainability (I think some compilers are already removing support?).
I'm not convinced it's the compression software author responsibility.
I once fired up a TOPS-10 VM on SIM-H to fire up an old system for historical research purposes:
ditto 4.2BSD and so on, various .tar.Z archives of yore, etc.
Quite alot of good code that already does what you are wanting to do was written decades before you realized you wanted to do it.
The original versions of both can be found on the Unix Tree  which I've found to generally be an incredible resource/time-sink.
compress: 302 MiB (-61%)
pack: 548 MiB (-30%)
The easily confused names don't help with the obscurity.
- Retrospective on how we have advanced with compression methods; it's quite interesting how the same file is reduced in compressed size when doing a survey of the distinct compression methods.
- nice explanation of Huffman coding
- mention of code golfing
A quintaessential HN article. I think each of the 3 items above is worth 10 hacking nerd bonus points (collect 50 points to get a Devo Energy Dome)
If code golfing is done using Befunge or Brainf*, multiply score by 3.
EDIT: BTW, Brainf____ source code probably compresses nicely with Pack.
Meanwhile, xz is seekable and lzip is not. Both formats use the same compression algorithm under the hood, so size differences are minimal. And lzip is licensed under GPLv3, which means it cannot become a universally adopted standard, which is surely important for long-term storage.
"The lzip format is as simple as possible (but not simpler). The lzip manual provides the source code of a simple decompressor along with a detailed explanation of how it works, so that with the only help of the lzip manual it would be possible for a digital archaeologist to extract the data from a lzip file long after quantum computers eventually render LZMA obsolete."
Additionally there is a separate public domain version for those allergic to GPL:
lzip is included in every Linux distro's package repository, and the three major BSDs.
Ah. I wasn't misremembering, though: it seems it was previously GPLv3, but changed to v2+ with lzip 1.16 from late 2014.
In any case, lzip has already lost the popularity war. If XZ were actually "inadequate" and "inadvisable" as the lzip author claims, I might consider that a shame and work to counter it. In reality, the blog post comes off more like sour grapes. Some other things:
- The post spends several paragraphs complaining that some things in XZ are padded to a multiple of four bytes. Seriously? It may be unnecessary, but it's also very common in binary formats and makes very little difference.
- I was mistaken when I interpreted the blog post as claiming that XZ is "very slightly less likely to detect corrupted data". Actually, it's probably more likely, because it uses CRC64 whereas other formats (including lzip) use CRC32. (Although supposedly lzip makes up for this by being more likely to detect errors through the decoding process itself.) The post notes this, but claims that reducing what it calls "false negatives" (i.e. thinking a corrupted file is not corrupted) is not important "if the number of false negatives is already insignificant".
The post does make a decent case that XZ should have some kind of check or redundancy for length fields - though arguably the index serves as that, and there's a case to be made that lzip is worse because it doesn't have one.
But where the post spends most of its time claiming lzip is superior is in reducing false positives - that is, thinking a file is corrupted when it is really intact! How is that even possible? Well, because the checksum itself could get corrupted while the actual data in the file remains intact. The post assumes that each bit written has a given chance of being corrupted, so a larger file will, on average, have more corruption than a smaller file. Because a longer checksum is slightly larger (even though checksums are still a tiny fraction of the overall file size), it's supposedly a greater risk.
Which is such a load of bunk, especially when it comes to long-term archiving. In practice, the probability of error is not independent per bit. Rather, you should have a setup that uses error-correcting codes (whether in the form of RAID parity, built-in ECC on a NAND chip, etc.) to make the probability negligible that you will ever see any errors. Which is not to say that there aren't bad setups that can encounter errors… that there aren't people who stick their treasured photos on some $99 USB hard disk that frequently takes nosedives to the floor. But even then, one error is not just one error; it's often a precursor to the entire drive becoming unreadable. In any case, the solution is not "store very slightly less data to reduce what's exposed to damage", it's "fix your setup"! On the other hand, for anyone who thinks they have a good setup, it's very, very important to detect if an error has occurred, as an error would indicate a problem that needs to be fixed to protect the rest of the data. (Or in other cases, such as corruption during transmission, it could indicate the need to re-copy that file from elsewhere.) So "false negatives" are extremely harmful. Meanwhile, "false positives" aren't necessarily bad at all, because if a given file is only detected as corrupted because the checksum itself was corrupted, it still demonstrates that there was an error! I'd go so far as to say you can never have too many checksums - but at any rate, it's absurd to treat false positives as more important than false negatives.
- Oh, this is rich. Another part of the post makes a big deal about lzip being able to handle arbitrary data appended to the end of the file (because apparently not supporting that means "telling you what you can't do with your files"). But lzip also supports concatenating two lzip files together to get something that decodes to the concatenation, and this is how "multi-member" streams work (which you get e.g. from parallel lzip, or by using an option to limit member size). There's nothing to indicate whether or not a "member" is the last one in the file; the decompressor just checks whether the data that follows starts with the magic bytes "LZIP".
But what if there's a bitflip in one of the magic bytes? How do you know that what follows is a corrupted member, and not some random binary data that someone decided to append, which can be safely ignored?
You don't. I tried it. If you make any change to the magic bytes of a member after the first one, decompressing with `lzip -d` silently truncates the output at that point, without reporting any error. That's right - the supposedly safer-for-archiving lzip can have silent data corruption with a single bit flip.
The manual actually mentions this, but brushes it off:
> In very rare cases, trailing data could be the corrupt header of another member. In multimember or concatenated files the probability of corruption happening in the magic bytes is 5 times smaller than the probability of getting a false positive caused by the corruption of the integrity information itself. Therefore it can be considered to be below the noise level.
So it's not just a question of "the false negative rate is already low, and we should also get the false positive rate as low as possible". Rather, the manual directly compares the probabilities of a bit flip either (a) causing an error when theoretically the data could be fully recovered, or (b) silently corrupting data. Since the chance of (b) is a whole 5 times smaller than (a), its reasoning goes, it can be disregarded. That makes no sense at all! Those are not comparable scenarios! If you have a backup, then any detected error is recoverable no matter what bytes it affects - but an undetected error could be fatal, as backups are rotated over time.
I was curious: if you flip a random bit in an .lz file, what's the chance of it being unprotected? In other words, what fraction of the file is made up of 'LZIP' headers? Using plzip (parallel lzip) with default settings - which makes one member per 16MB of uncompressed data - I measured this as ranging anywhere from ~2 in 10 million for mostly uncompressible data, to over 1 in 1000 (!) for extremely compressible data (all zeroes).
To be fair, I am not sure whether xz has any comparable vulnerabilities. But if you're going to talk about how superior your format is in the face of bitflips, your format had better be actually resilient to bitflips!
Speaking of which, is there even a unixy way of adding/applying error correction information to a stream or file?
Ideally, one should be able to do this in a pipe.
Basically, you want the tool "par2".
# 151M linux-4.14-rc6.tar.gz
# GZIP decompression
~$ time gzip -dk linux-4.14-rc6.tar.gz
# 787M linux-4.14-rc6.tar
# LRZIP compression
~$ time lrzip -v linux-4.14-rc6.tar
linux-4.14-rc6.tar - Compression Ratio: 7.718. Average Compression Speed: 13.789MB/s.
Total time: 00:00:56.37
# 137M linux-4.14-rc6.tar.lrz
# LRZIP decompression
~$ time lrzip -dv linux-4.14-rc6.tar.lrz
100% 786.16 / 786.16 MB
Average DeCompression Speed: 131.000MB/s
Output filename is: linux-4.14-rc6.tar: [OK] - 824350720 bytes
Total time: 00:00:06.35
~$ du -hs linux* | sort -h
~$ time xz -vk linux-4.14-rc6.tar
100 % 98.9 MiB / 786.2 MiB = 0.126 3.0 MiB/s 4:25
~$ du -hs linux* | sort -h
$ time xz linux-4.14-rc6.tar
$ wc -c linux-4.14-rc6.tar.xz
$ time zstd --ultra -20 linux-4.14-rc6.tar
linux-4.14-rc6.tar : 12.81% (824350720 => 105564246 bytes, linux-4.14-rc6.tar.zst)
$ time cat linux-4.14-rc6.tar.xz | xz -d > out1
$ time cat linux-4.14-rc6.tar.zst | zstd -d > out2
ar is used as format of .a/.lib files by almost all C compilers (IIRC even by MSVC). Libraries are simply archives of .o files, on some platforms with additional symbol->file index, which is what is added by running .a through ranlib.