Hacker News new | past | comments | ask | show | jobs | submit login
The Elegance of Deflate (codersnotes.com)
251 points by ingve on Aug 22, 2016 | hide | past | favorite | 82 comments



There were a lot of encoders at the time using this general scheme (a few more values to indicate match length or distances). PKZIP won at the time because it was faster, and PK had the name recognition from his PKARC, which was a superfast implementation of SEA ARC (the dominant archiver on PCs at the time).

PK had to stop working on PKARC because of a C&D request from SEA. He wrote the first algos of PKZIP, which were on par with SEA ARC on compression (and with PKARC on speed), but weren't much better. (And have been deprecated since the 1990 if I'm not mistaken).

Then, the Japanese encoders started to rule the compression ratio (and had comparably reasonable compression times) - LHArc, LZAri, don't remember the rest of the names. LHArc or LHA (don't remember which), had basically the same scheme that PKZIP converged on, except it used adaptive arithmetic coding. PK replaced that with static huffman coding, trading a little compression for a lot of speed, and the format we now know and love as "zlib compression" was born (and quickly took the world by storm, being in a sweet spot of compression and speed).

There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive! This sounds simple, but back then everyone adopted the tar-style "file header then data" appendable style, which meant that just listing the files inside a 300KB zip file (almost a whole floppy disk!) meant reading that entire floppy (30 seconds at least). PKZIP could do it in about 2 seconds.


> There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive!

Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.


The directory is only at the end in the PKZIP, there is additional info before each compressed file, like in TAR. It's actually good to have both, as it allows recovering individual files even when there's a corruption in other files or in the directory.


Yes, it's possible to use the information to recover a corruption.

The problem is, what happens if the two are different? This can be through accident or maliciously. If one tool uses the first and another uses the second, then you'll end up with different results, and it can be hard to figure out why they are different.

It's like a CD, where audio players interpret the CD different than CD-ROM players. Some anti-copying techniques tried to take advantage of this to produce un-rippable CDs. A problem, however, was that car CD players used CD-ROM drives, so these CDs weren't playable in those cars. (https://en.wikipedia.org/wiki/Compact_Disc_and_DVD_copy_prot... )

Or some HTTP header attacks based on duplicate headers. Let's say there's a firewall which allows only access requests for "Content-Type: text/plain". If the attacker includes the header twice, once as text/plain and the other as image/gif, then the firewall might only check the first while the back-end web server interprets the second. This is a silly example; I couldn't remember the real attacks which take advantage of the same mechanism.


Zip inconsistency vulnerability: https://nakedsecurity.sophos.com/2013/08/09/android-master-k...

Duplicate HTTP header vulnerabilities:

https://bugzilla.mozilla.org/show_bug.cgi?id=376756

https://splash.riverbed.com/thread/7772

If a spec is open for interpretation, you can be sure that some software gets it wrong.


The "zip inconsistency" linked is not a problem of the Zip file format specification. The goal of the zip format was never to provide a tamper-proof cryptographically secure binary package. Whoever uses this structure for some more security-wise complex demands is responsible to make sure that the security assumptions he needs are respected by the code he implements. One simple method were signing the resulting archive, then no modification is possible, and therefore also isn't possible adding a second entry with the same name by the attacker.

I really don't care about HTTP headers or Reed-Salomon, as they are completely irrelevant to the Zip format designed in eighties for very specific purposes then (being fast on the floppy disk and 4 MHz 8088 processor, allowing to recover as much as possible from the floppy even after the floppy disk sector failure), I just claim, through the whole conversation here, that the Zip format is not a bad format for having a central directory and the metadata outside of it too, and I gave my arguments for that. NTFS also has some metadata on more than one place and it's an intentional design decision solving real problems: in Zip case allowing for fast access of the list of the files in the archive but also easy recovery of the single files when some part of the archive gets corrupted.

I also claim that the Zip format is not in any way "bad" for not making impossible per the archive format design and specification of having the two identical file names in the archive. I can even imagine use cases where exactly allowing that is/was beneficial.


I have news for you: you'll find the same "problem" (cache consistency) everywhere in computing. Still having the cache is a good thing. Maybe even the most important thing.

"Almost all programming can be viewed as an exercise in caching." —Terje Mathisen

The directory at the end was a fantastic cache as the seek times were unacceptably high. It's still worth maintaining for a lot of use cases. If you have a different use case where its nonexistence would make your life easier, fine, but that doesn't mean nobody needs it or that it's "wrong" to have it.


I didn't argue that was right or wrong. I explained why creshal's comment about it biting them in the ass made sense.

Your first argument was about the benefits of fixing data corruption, not about latency.

Duplication is a poor way to handle data corruption. If you really wanted to recover data robustly, there are better methods for error correction.

Yes, you are free make a new argument that there are other benefits to data duplication. Do be aware that the HTTP header duplication example I gave is not one of cache consistency, but the more general problem of data consistency.


> If you really wanted to recover data robustly, there are better methods for error correction.

Please tell me which format is better for recovering most of the files from the multifile archive if the corruption occurs somewhere (e.g. a small part of the whole archive file gets corrupted): the tar.gz or the Zip?

Which archive format would you use to be "better"? Is it widely used? What are its disadvantages?


> Please tell me which format is better for recovering most of the files from the multifile archive if the corruption occurs somewhere (e.g. a small part of the whole archive file gets corrupted): the tar.gz or the Zip?

For the same filesize as the zip you could transmit the tar.gz + a par2, which would be much better for recovery from such corruption.


Sometimes the corruption is in the metadata, in which case you are lucky and the duplication can help.

But neither method you mentioned lets you recover your data if the data itself is corrupted.

If you really want to recover data, use something with error-correction, like Parchive. RAR 5.0's recovery record is another solution, also based on Reed-Solomon error correction codes.

Archive tools usually don't use those methods because performance and compression size are the prime reasons people use an archive system, not robustness to error. In large part because the storage and network protocols are so good that such errors rarely occur. Because they use error-correction codes.


> But neither method you mentioned lets you recover your data if the data itself is corrupted.

I have another experience: If I've made a zip archive of a 1000 relatively same files, even if it's a few GB big, is if I'm left with a third of the archive I can easily recover first third of the files, if I have half of the archive, around the half of the files. Exactly because the common implementations of zip archivers would recognize that the directory at the end is missing but still allow me to extract everything that it can read, as they would use the data in front of every file, even with the fully missing "central directory".

If the corruption affects the file 3 of a 1000 files I can extract both the first two and all 4-1000. If the corruption is in the first 0.3 part of the tar.gz, most probably I won't be able to extract anything but the files 1 and 2. All the remaining 998 are then lost.

And listing the files inside of the non-corrupted ZIP archive, no matter how big, is effectively instantaneous, thanks to the same central directory. Try to list the files inside of a tar.gz which is a few GB big, then compare. ZIP is a very convenient format for fast access to the list of the files and to the single file in the archive, and still very convenient when the part of the archive is missing.

Edit, this is the response to your response: I've never ever mentioned Reed-Solomon, nowhere in the discussion. I've supported in my opinion good designing decision of Zip archives having the central directory and the metadata in front of the every compressed file and I just give some real-life examples of the corrupted files I've encountered and where I've had the best recovery ratio with the ZIP format. And I've never experienced the kind of "data mismatch" myself that you refer to. You are welcome to describe it more, I'd like to read how exactly you experienced it. If the answer is that you yourself wrote a program that ignored the full format of the Zip, or used such a bad library, then there is the real problem. Once again: "Almost all programming can be viewed as an exercise in caching." —Terje Mathisen


Please pick a topic and stick with it.

You asked about corruption, specifically when "a small part of the whole archive file gets corrupted". I addressed that point. I'm not going to cover all of the advantages and disadvantages between the two systems, because they are well-known and boring.

You of course must choose the method that matches your expected corruption methods. A incomplete write is a different failure mode than you wanted me to address. Reed-Solomon isn't designed for that. Other methods are, and they are often used in databases.

And you of course must choose a method which fits your access model.

Just like you would of course choose tar over zip if you want to preserve all of the filesystem metadata on a unix-based machine. Zip doesn't handle hard links, for example.

My point again is that data duplication causes well-known problems when there is a data mismatch, and while the duplication in zip as the result of its write-only/journalling format helps with some forms of error recovery, there are better methods for error recovery if that is your primary concern.


What you call the "data duplication" is actually a "metadata redundancy" which results in the faster access to the list of all the files in the archive (a good example of the pre-computed pre-cached info). The format is good, if some library can't cope with the format, the problem is with the library making the corruption of the archive. And that archive is still, even after being corrupted by wrongly written tool, recoverable because of the existing redundancy.

The concept of metadata redundancy isn't anything specific to the archives, the filesystems do it too, and it's always an engineering trade-off, and I consider the one in Zip format is a good one, based on the real-life examples I've given.


I agree that it's a good format. It's an engineering compromise. The benefits you see also cause creshal's comment:

> Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.

Do you think creshal is mistaken?

Of course they are due to bugs in the toolchain. That's a natural consequence of the file format. That's the whole point of creshal's comment.

What do you think I'm trying to say, because I don't understand your responses based on what I thought I was trying to say.


Garbage in, Garbage out.


Weren't there also (alleged) patents on arithmetic coding back then? Huffman coding had that as an advantage, too, especially when GNU were looking for an alternative to Unix' compress program.


Yes there were. JPEG was also affected, and most decoders still don't support the optional arithmetic mode.


LZH actually remains very common in BIOS images, probably because of its superior compression in relation to the other formats along with its low implementation complexity.

See this article, for example:

http://www.intel.com/content/www/xr/ar/support/boards-and-ki...

The file it references can be downloaded here:

https://downloadcenter.intel.com/download/8375/Logo-Update-U...

...and it's somewhat amazing that it contains the original LHA.EXE from 25 years ago:

    LHA version 2.13                      Copyright (c) Haruyasu Yoshizaki, 1988-91
    === <<< A High-Performance File-Compression Program >>> ========  07/20/91  ===


Hey, the first link seems to be from a RTL language, here is the Left-To-Right version that should look better:

https://www-ssl.intel.com/content/www/xr/en/support/boards-a...


And "there's another non-trivial thing that PKZIP had going for it":

https://en.wikipedia.org/wiki/Phil_Katz

"SEA was attempting to retroactively declare the ARC file format to be closed and proprietary. Katz received positive publicity by releasing the APPNOTE.TXT specification, documenting the Zip file format, and declaring that the Zip file format would always be free for competing software to implement."


Really enjoyed reading this, I think I'd like to look more closely at an implementation to understand it further.

Reading the article, I was reminded of a nagging question I've had in the back of my mind for a bit.

The ASCII table was invented in the '70s, when the cost of disk storage was much higher than it is today. But it's a shared dictionary, a standard that we can all agree upon, something that's already on every computer.

The thing I've wondered about is whether there could be any advantage to creating new generations of shared dictionaries that are domain-specific, and much larger.

For example, in the specific (and over-simplified) case of transmitting large quantities of English text from a server to a client, you could reduce the amount of data sent over the wire if both parties shared a lookup table that contained the entire English dictionary. In that case, you wouldn't transmit a book as a collection of characters, but rather as a collection of words.

Furthermore, it would seem like you could apply the same traditional compression methods like what's described in the article to furthermore reduce the amount of data being sent. Rather than identifying repeating patterns of letters, you would identify repeating patterns of words.

Of course, the obvious drawback is that an English lookup table is useless for transmitting anything other than English text. But again, disk storage being as cheap as it is, I wonder if it wouldn't be such a monumental problem to store many domain-specific dictionaries.

Of course, you'd always want to keep the ASCII table as a fallback. Much in the same way that a conversation in sign language (ASL, specifically) is largely composed of hand gestures referring to words or concepts, but the language still includes the complete alphabet as a fallback.

The thing I don't understand well enough is whether modern compression already incorporates concepts that are similar enough such that a shared-word dictionary would be useless. It seems like the "LZXX" style of compression essentially includes a similar but dynamically-generated sort of dictionary at the beginning of the transmission, and subsequently refers to that in order to express the message itself. Would the gains of having enormous shared dictionaries cancel out the advantage of that approach, only in a much more "wasteful" way?


The cost-benefit is just too low for this to be practical in most cases:

* The large data being transferred is almost never text; it's video, images, or audio which is already using advanced (sometimes lossy) compression * Even if it is text, it's probably in some document format or HTML which greatly complicates the shared dictionary scenario (and would require a separate dictionary for each format)

In reality even deflate is already doing what you propose, but it builds the dictionary dynamically and includes it with each message. The up side is its dictionary can find common phrases and patterns that wouldn't be in your fixed dictionary. As the size of the text becomes larger the dictionary becomes a smaller and smaller percentage of the total.


Even text is usually precompressed. Both OOXML and ODF use deflate internally, epub as well, and PDF has such a high bloat-to-text ratio that compressing only the text is moot (assuming you don't have one of the increasingly common PDFs where even the text is rendered as image, aaaaargh…).


I've looked into PDFs a little for a personal project of mine that worked with auto-generated PDFs, and I've found they can really vary wildly in the amount of bloat, and I've also found that much of the bloat seems to be from including fonts. If you actually look at the PDF file's binary, it's not that hard to understand, and the actual text is stored mainly as ASCII/utf8 I think. So, if I wanted to really efficiently store a bunch of PDFs that come from the same source, it seems like it should be possible to copy the "bloat" sections (likely embedded fonts) which are all completely identical, and use those in a dictionary in a custom compressor, and then just use zlib for the rest.

I've also noticed that some PDF generators include far, far more bloat than others, though I'm not sure why.


It sounds like what you're thinking of has been included in the Brotli compression algorithm, which uses a relatively large initial dictionary. There are similar ideas with a more limited scale in HTTP/2's HPACK algorithm for header compression.


Google's SDCH is kind of that, but each website can in advance decide what the contents of the shared dictionary are. So when you download a SDCH compressed page you may also download a dictionary. Information and experiences with usage are sparse, here's one article from LinkedIn: https://engineering.linkedin.com/shared-dictionary-compressi... -- "Average additional compression across our static content was about 24%"


Microsoft implemented large pre-shared dictionary for their binary XML format. When used in WCF, the XML is a SOAP envelope with some data-contract serialized messages. A lot of XML elements/attributes/values are known in advance. Works very well in practice, here's my blog post about that: http://const.me/articles/net-tcp/


I considered a project loosely along those lines. My idea was a One-Meg-of-Data project. Define a megabyte of data that everyone would have have in a bit-identical form.

The appeal of this to me would be in the code golf style of approach of specifying the actual data. Apart from plain useful raw data, It would include a bare minimum VM that can be used to generate larger outputs from the data. More complex VMs would be permitted provided a reference implementation can be run on the Minimal VM. The code for the reference Implementation would, of course, be included in the one megabyte.

A program using the data could request

* raw bytes

* output from the VM given code and parameters (with perhaps a cycle limit to stop endless loop hangs)

* output from the VM given code and parameters specified at an index in the data.

* output from the VM given code and parameters specified at an index in the data plus user parameters.

The system would be stateless, the VM would only exist for the duration of the data generation call. The same parameters would always produce the same output.

I have doubts as to it's fundamental usefulness but it would certainly make a fun collaborative project at the very least.


Not exactly your idea, I think, but perhaps similar:

- A patent from ibm for some previous dynamic dictionary [1]

- Brotli: 120kb predefined dictionary [2]

[1] http://www.google.com.br/patents/US8610606

[2] https://en.m.wikipedia.org/wiki/Brotli


Doesn't RAR have something kind of like a VM? https://github.com/taviso/rarvmtools


You exactly describe Brotli. It has a 100KB pre shared table trained on a large corpus of web content.


The ASCII standard was from the 1960s, not 1970s. LBJ in 1968 mandated that all federal computers purchased by the government must support ASCII.

This historical correction does not alter the rest of your comment.


> Would the gains of having enormous shared dictionaries cancel out the advantage of that approach, only in a much more "wasteful" way?

It does. The extreme of it is storing all information inside pi, as is done here: https://news.ycombinator.com/item?id=6698852. In the end the dictionary is so big that the indexes themselves start to be too big.

If you really had infinite space and immediate access to all digits, there actually could be some way to make it work, though.


> The thing I've wondered about is whether there could be any advantage to creating new generations of shared dictionaries that are domain-specific, and much larger.

The HPACK header compression used by HTTP2 has a static dictionary.

> The thing I don't understand well enough is whether modern compression already incorporates concepts that are similar enough such that a shared-word dictionary would be useless.

I believe this is generally the case.


I think this is a good idea and could reduce file sizes for compressing text. Moreover, people's active vocabulary is quite small, so even more might be possible. Just as an aside, a cliche was originally a single metal slug used to cast a repeatedly used phrase. So phrase-wise compression of English text might be useful.

However, I don't know the practical improvement obtained over the currently existing methods. This is a good project to investigate, I think. Even if text compression might not be important in the larger picture of data transmission on the web, this sounds interesting on its own merit.


It's a good system. Though the full DEFLATE spec has quirks like storing literal blocks prefixed by their length and then the ones complement of their length.

The next step in improving a simple compression algorithm is to replace Huffman encoding with arithmetic/range encoding. Huffman encoding proceeds one bit at a time, treating everything as having a probability that's a power of 1/2. Arithmetic/range encoding uses more accurate probabilities and then packs them together, letting you save a fraction of a bit with every symbol. As an analogy, consider how to encode 3 decimal digits. You could spend 4 bits on each digit, and it would take 12 bits total. Or you could combine them into a single 10 bit number.


I really enjoyed this article. It's just enough detail to leave you thinking, "Hey, I should write my own DEFLATE library"


Would be fun to do that in Haskell

Edit: https://github.com/GaloisInc/pure-zlib?files=1. Beautiful they've done just the decompression! Challenge awaits


>>"Hey, I should write my own DEFLATE library"

Did that in Java (2011) and the impl. was faster than the native one (party due to lack of proper customization of zlib from java code)


The article mentions LZW. In terms of algorithm elegance LZW should be head and shoulders above 'deflate'. LZW got patented which (aside gifs) screwed up its popularity. 'defalte' allows custom dictionaries that adds some extra complexity.

I'd strongly disagree with In many people's eyes compression was still synonymous with run-length encoding as arj was extremely popular to the point it was used to cure full stealth viruses (v512 for example). hard to provide references now but wiling to search some old bbs

Other than that - fairy nice/nostalgic article.


arj ... daaamn. That brings back some sweet memories to the soundtrack of a modem handshake whistle. Good thing we still have Norton Commander in a form of Far Manager to show some of the past glory to youngsters :)


I very recently had to look at an LZW decoder, and it is indeed more elegant than Deflate as described here. However, LZW doesn't have a Huffman post-encoding step, so it does not compress as well as Deflate.


Asymptotically it doesn't matter - that's the genius of all LZ schemes (LZ77 =~ LZSS, LZ78 =~ LZW). All the other details just take care of the early pre asymptotic stage.


No, it still does - LZW assumes flat probabilities for each entry in its dictionary. After a while, both the encoder and decoder will know more accurate probabilities, but LZW has no way to take advantage of that.


But asymptotically it doesn't matter because the LZ78/LZW code words represent longer and longer sequences, with the more probable sequences getting shorter codes. The additional letter (per each code word, in the case of LZ78) doesn't matter asymptotically. Practically, of course, it does.


How do the more probable sequences get shorter codes?


That's basically the objective of every compression algorithm. Huffman coding does it directly by construction, Lempel Ziv do it indirectly - in LZ78/LZW, you get essentially and asymptotically (though randomly, per individual sequence) code length proportional to log_2 prob(sequence), because the sequences added to the dictionary are added by order of probability.

In LZ77/LZSS, it is basically the same, except the dictionary is not constructed explicitly but rather implicitly from the history. In both cases, you need to refer to Lempel and Ziv's proof that indeed, the codeword length converges to log_2 P(sequence), which is optimal by the Shannon Source Code Theorem.

I highly recommend Cover & Thomas book on information theory if you find that interesting and are not afraid of math. And Shannon's original 1949 paper ("A theory of communication") is surprisingly and exceptionally readable - though it doesn't cover LZ for obvious reasons.


Okay, so it's because the most probable sequences are more likely to be encountered first, which assumes the data is more or less uniform. (I.e., it doesn't work as well when the data starts with a large blob of random data, and the more regular data comes afterwards. But I guess that's true for most general compression algorithms. At least LZW has a reset code.)

I'll see if I can find the C&T book. Shannon's paper was part of the course material at uni, IIRC; at least I think I read it long ago.


It's not just a matter of encountered first, it's of encountered more often ("encountered first" is a special case of "more often"). But you really need to delve into the proof to figure out why that's enough for optimality. At the very least, _I personally_ don't know of a simple intuitive way to explain that.

It is important to remember what each compression algorithm is optimal for.

LZ77 and LZ78 are optimal for observable markov sources -- that is, sources in which given x recent consecutive outputs (where x is the markov order), the distribution of the next symbol is fixed. While this is rarely ever the case, it is reasonable to assume that e.g. English text is a 10-th order markov model (with respect to characters) or 3-rd order (with respect to words).

The source you described is NOT markov overall, although it might be markov in the tail (past the random blob). Asymptotically, of course, it is markov :)


Hmm... I thought that in LZW, once a sequence has been assigned a code, it can never be assigned a shorter code; codes are assigned in order, and they only get longer over time (until a reset is triggered). So the frequency of a particular sequence cannot influence the length of the code assigned to it. I guess I'll read up on it a bit more until I comment further. I do see that it could end up to be optimal in the asymptotic case.


You are, of course, correct. However, in the same way that you cannot reassign codes to the letter A-Z and yet you can compress by assigning a longer code to a sequence (the code is longer but is still shorter than the individual codes needed to represent the sequence), then even though you cannot reassign a code, you will get a new code for a longer sequence that includes the shorter sequence, represents more letters, and overall is more likely to occur than the shorter sequence followed by something else.

Assuming no reset, over time the shorter codes fall out of use because the longer ones eventually include everything they represent, extended by all possibilities (with the more probable extensions getting shorter codes by virtue of appearing more often / earlier).


> This paragraph has five-hundred and sixty-one lower-case letters in

It must have taken some trial-and-error to make this sentence correct, since the number itself adds some letters.


Interestingly, it can work in-place if you change it to "This paragraph has five-hundred and sixty-three lower-case letters in"


Pick a value that's close enough, and tweak everything else until it's true :) This post has one hundred and thirty-two chars in it.


This, ten


The most clever part of deflate is actually how they store the huffman trees without storing the frequencies but rather the length in bits of each symbol. Combining the length of each symbol with its lexicographical ordering is enough to uniquely specify the actual tree.

Ok, some research showed me that the idea can probably be traced back to Schwartz & Kallick 1964 (can't find the text), popularized by TAOCP vol. 1...


The technique is known as "Canonical Huffman" and is quite elegant indeed. It is best understood by considering the codes as binary numbers, then the tree structure naturally appears. There's a very short algorithm to generate the appropriate encoding and decoding tables in the JPEG standard and it involves not much more than shifts and adds.


I wouldn't say that entropy coding is "separate" from dictionary / window compression. DEFLATE surely isn't the first algorithm to combine LZ77 and some form of entropy coding, and surely won't be the last.

I highly recommend that everybody go out and build an INFLATE implementation; it's a lot of fun.

Mine can be found at https://github.com/magcius/libt2/blob/master/t2_inflate.h


I did this in javascript years ago (I had done snappy first, but wanted smaller data). It's fun and not too difficult.

The deflate side seems a little more complex. I played with it while on vacation about a month ago, but stalled after doing the LZ77 and huffman table generation, because I really just wanted to learn on the LZ77 bits worked, and I don't actually have a project that needs it.


Anyone know what's the history behind that setjmp/longjmp as 'poor man's exception' error handling? Since C works perfectly fine without exceptions, why add such a hack? I've seen this in libpng and I think libjpeg has it too (one of many reasons I'm avoiding those libs).


If you have ever done any non-trivial coding in C, you will have come across the need (or the advantages) of using goto for local error handling within a function.

You can write the error handler code for a given function (which usually just resets some resources and returns an appropriate error value) in a single place, usually at the end, but jump to it from within any loop or structure within the function, as soon as you detect that the computation cannot proceed.

setjmp / longjmp is the natural extension of this mechanism to an entire library, as in a set of tightly coupled functions that are hopefully hidden behind a public API. I think it's a good design choice. If done properly it makes the code more readable, more easily maintainable, and more optimized for the most common execution path. I wouldn't avoid using those libraries just for that design choice.


C lacks "zero-cost" exceptions, where the code should be just as fast as it would be if no exceptions were involved as long as one isn't thrown. Although they are much more expensive when thrown than checking a return value and using indirection for any logical return values, they are a bit cheaper when you don't expect the exception to happen very often, or don't care about performance if it does (e.g. a corrupt png or jpeg file).

My dream is something like the UX of Swift's exceptions (basically normal C error handling except for it does all the tedious pointer work for you) with the ability to tell the compiler to either implement the exceptions as return values or stack unwinds for a particular code path with the flick of a switch.


My big hope in this regard is Rust, which has a very good track record if implementing good, usable, zero-cost abstractions.

If they manage to implement zero-cost futures [1][2], maybe they will also implement sime kind of zero-cost error handling abstraction (either via exceptions, or via some other useful abstraction for that purpose).

[1] https://aturon.github.io/blog/2016/08/11/futures/

[2] https://news.ycombinator.com/item?id=12268988


Rust's abstractions are only zero cost if you're looking in the wrong place for your costs. The Result / try! stuff in particular I would be very surprised if it is faster than a good exception approach, for example, particularly the happy path. It's amortised throughout though, while different exception throwing mechanisms have dramatically different performance, so people's intuitions aren't reliable.

I like Rust a lot. Zero costs is not a reason though; my reason is no GC and not as bad as C or C++.


Don't forget exceptions aren't free themselves; you need landing pads, etc, all of which are not required.


Transporting and checking for errors throughout the call graph definitely isn't free. Simple exception unwinding isn't very expensive either - store handler info on the stack and use stack frames and it's just a linked list search. Keep popping until you're done. Freeing locals in a way compatible with the language semantics along the way is the main tricky bit. But you have that cost with error results too, it's not extra.

The real cost is that exceptions force the compiler to be pessimistic and pedantically cope with exceptions everywhere. I really don't think there's much win at all, compared to that, in return values the compiler forces you to check. And if you skip the checking, that's a different kind of pain.

Checked exceptions is a different but related story. I think they're a fine idea in the small but appallingly bad in the large; when glueing together 30 different libraries with their own exception types, everything end up throwing anything, particularly when some higher level abstractions have been applied, like iterator streams or monads. Forced return code error checking like Rust's is isomorphic with checked exceptions. It is my biggest worry for the language.

That is, the problem isn't performance - the isomorphism shows that there is no necessary difference when rules are consistently applied - the problem is Java was a good demonstration the checked exceptions were misguided.


Rust's try macro will automatically use any available conversions when moving exceptions up the chain, so if you want to wrap exception A in B this can be done without pain. Rust also relegates a lot of what would have been exceptions in Java to panics, so that things like logic errors or issues with arguments can be kept from complicating error handling.

These alleviate a lot of what made Java's checked exceptions so unhelpful, so I don't think you can really conflate the two.


C doesn't work fine without exceptions. You have to write an enormous amount of boilerplate code to propagate error-returns correctly, or else an enormous amount of boilerplate duplication of your error-handling code. Or you use macros to reduce your boilerplate but it's difficult to do that in a way that handles different return types correctly, and once you get to the point of using a different return convention you're basically implementing your own language.


You don't have to use setjmp/longjmp in libpng, it's just the default behavior. You can instead set an error callback to do whatever you want.


I see setjmp/longjmp used a lot in recursive descent parsers in C.

I think people got tired of using up their return values for error codes. And it is more efficient to use this style certainly than manually breaking out of each stack frame.

C is full of things like this which break abstractions -- it doesn't seem particularly odd to me. Though I do remember that it was one of the things at the end of C books that I didn't understand for a long time.

What's the downside?

It doesn't play well with C++ because destructors have to be run at the end of every scope, but that doesn't apply to C.

And I guess because it mutates globals, it is a bit confusing in a threaded context.


> And I guess because it mutates globals, it is a bit confusing in a threaded context.

What globals?


My understanding is it works a bit like errno. errno is global that is set by the callee and then the caller can check it whenever, not necessarily right after the call or in the same function. But it's actually a thread-local, so it works fine in a threaded context, but is just a little odd to look at.


It's the only way to do a non-local return, otherwise you have to make sure you check every error return condition. That's part of why Rust forces the programmer to check all return values, I believe.


Deflate is so ubiquitous (being a web standard) that I'd like to see a hardware Deflate encoder/decoder.

It really needs to be in the core instruction set of x86 & ARM.



Using additional bits to encode lengths isn't really that different from having a marker indicating if the payload is a match vector or not. In fact, it could be argued to be less efficient for certain types of data depending on the huffman distribution since all "words" are larger than a byte. It also prevents you from encoding longer matches. I treat the decision as a compromise between the two extremes that "worked" for the majority of types of data "good enough" for some definition of both quoted words. The entire idea works well enough but I wouldn't go out of the way to declare it a marvel of engineering or anything.


"it could be argued to be less efficient for certain types of data" This is true of all lossless compression.


You're invoking a stronger "technically correct" interpretation of the statement I'm making. Let me qualify. In a blob with many short repeated sequences, the algorithm as presented is better. In a blob with many long repeated sequences, a marking bit is better. Honestly, neither example is that farfetched.


The single-bit method can actually be seen as an instance of the huffman-coded method, where the match lengths are all assigned equal entropy estimates and the probability of "match vector" versus "literal character" is hard-wired to 0.5.


This comment shows a baffling level of arrogance. And it can hardly be considered a happy accident, compression methods are tested against real corpuses


I never said happy accident. Like most algorithms (the reallocation size of a vector when it reaches capacity, the widths of slots in a timer wheel, the hierarchical sizes of a generational gc, etc.) the "correct" answer is determined experimentally. My point is that the choice of encoding small lengths is just a decision, like every other decision and is no more or less elegant than encoding a payload determining marking bit, and shouldn't be celebrated independently. Calling me "bafflingly arrogant" though is a bit perplexing. Can we disagree without resorting to petty jabs at character?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: