Hacker News new | past | comments | ask | show | jobs | submit login
“Should you encrypt or compress first?” (appcanary.com)
463 points by phillmv on June 28, 2016 | hide | past | favorite | 235 comments

There's no compress or encrypt _first_.

It's just compress or not, before encrypting. If security is important, the answer to that is no, unless you're an expert and familiar with CRIME and related attacks.

Compression after encryption is useless, as there should be NO recognizable patterns to exploit after the encryption.

> If security is important, the answer to that is no

It's a little more nuanced than that. Compression may cause information leaks or it can prevent them depending on the circumstances. If you're encrypting an audio stream, then compressing it first can cause leaks. If you're encrypting a document, then compressing it first may prevent leaks.

I think compression is a read herring here. Think about keyboard-interactive ssh authentication. Whether or not the data is compressed, the timing of the packets depends on the timing between when you pressed the keys on your keyboard. Thus information is leaked, and compression was not involved.

Therefore, I feel like compression complicates the discussion unnecessarily. Information that's not encrypted (the time a key was typed or the spacing between phonemes in a word) can be recovered by an attacker without breaking the encryption. Obvious when you say it like that ;)

Well, yeah, but the point is that compression can cause the leakage of information through timing that would not otherwise have leaked.

Yes, in other words compression can move information from the content of the data (which is encrypted) to the size of the data (which is not encrypted).

Which is one of the selling points of block cyphers. They may not conceal the magnitude of the payload but they do obscure the exact byte count by rounding up. But that probably doesn't stop CRIME from being workable.

Hmm, I don't think it is really a selling point. A block cipher also implies needing a technique like CBC to prevent watermarking, and using CBC actually helps with oracle attacks.

An interesting paper on using padding to obscure length: http://cihangir.forgottenlance.com/papers/lengthhiding-corre...

Oh, I'm a fan of prepadding, and there are other surveillance problems that TLS doesn't solve that are similar to the voice analysis attack. Web pages on a site have distinct patterns of load operations and I'm certain that if we haven't established how to profile a visitor by the pattern of packets that are sent and received, that it's only a matter of time until we do.

well. My first comment would be. If your encryption algorithm produces output that can be compressed. Your encryption algorithm is broken.

As for compress. My understanding is that the inscurity has nothing much to do with timing, and everything to do with giving an attacker some "known" plaintext they can use Bayesian (Turing?) analysis against to shorten the key space.

No, read the dang article. You leak information about the content of the plaintext because you're changing the length. The crib doesn't matter with secure constructions.

1 encrypted 512bit block is always more secure than 12 512bit encrypted blocks (average compression ratio). Larger size is always a bad thing.

Problem with compress is the protocol headers give you a known part at the start of every message. Which you can then very quickly test keys against rainbow tables or other statistical means.

The best solution i know for this is to compress, then xor with a stream cipher then run into a decent block cipher.

The "compress after encrypt" is a trick question. If your data doesnt get bigger after that your encryption implementation is royally broken.

But you definately don't want every "plaintext" message starting with "..gzip".

> read herring

Freudian slip? I like it anyway.

I swear HN's font size is now too small for my eyes by default. I can't make any post without introducing a major tyop these days :)

You can't compress after encryption.

Encrypted content should be indistinguishable from random data. So encrypt than compress shouldn't be able to yield any reasonable compression.

wit truly random data, patterns should randomly appear.

There's a pretty easy proof to show why compression of truly random data is impossible. Let C be a compression algorithm. Assume that for all bit strings S, C(S) is not longer than S, and assume that there exists strings S' such that C(S') is shorter than S'. Also, assume that for any two distinct bit strings S1 and S2, C(S1) != C(S2). Let S'' be the shortest such string: the shortest string that C shortens. Let the length of S'' be M, and the length of C(S'') be N < M. Note N >= 1, and by assumption every string of length N does not have its length changed by C. There are 2^N strings of length N, and by assumption it is easy to see that since the images of any pair of N bit strings under C are distinct, every string of N bits is the image of some other N bit string. But now, S'' has length M but has an image of length N. This necessarily collides with the image of some string of length N, a contradiction. Thus, for any compression algorithm which is injective (in effect, reversible) and successfully shortens some strings, the algorithm must lengthen some others.

What this short proof shows is that even though random data may have patterns, no compression algorithm can successfully leverage this for every random string.

Encrypted data is "indistinguishable from random data" but it is not actually random data. No patterns will appear.

If patterns appear in random data, and no patterns appear in encrypted data, then it's trivially distinguishable from random data: look for patterns.

To be clear, I'm not claiming that encrypted data is compressible. I'm just pointing out that any detectable "patterns" that appear in random data but not encrypted data are inconsistent with the claim that random and encrypted data are indistinguishable. Also if you find such a "pattern" that's probably a good starting point for breaking whatever cipher you're using.

I'm sorry, that article is far too long to tell me what you are trying to say. All I can gather is, I was wrong

tl;dr Random data may appear to have patterns, but those patterns do not repeat, so they cannot be compressed.

Ah, but they can repeat! There is a finite chance that any sequence of any length will occur in a random sequence.

So while a compression alg applied to a very large amount of random data is unlikely to reduce the size, it is quite possible to achieve some compression on smaller blocks.

Any given _example_ of a random sequence may have less than maximum entropy, and so may be compressible, but a truly random sequence _in general_ will not be compressible. So you can make a compression algorithm that will compress random data in some cases but it will never be able to compress any arbitrary random data. Does this help?

You should take a look at the almost 15 year old (and unsolved) "Random Compression Challenge"


That is still your problem and you need to take responsibility for it. It is wiser to learn and know what you're talking about, don't just add ambient, useless, incorrect noise.

Saying something true but burying it is not wise either.

Let's say you are using a one-time pad, and the encrypted text is a string of a million A characters. While unlikely to the point that it is barely worth mentioning, it is possible. And could trivially be compressed.

I think your main point is correct, just presented too strongly. In many cases compression after encryption will yield poor results. But I wouldn't go so far as to say that it can't be done.

This seems overly pedantic. In my mind "incompressible" does not mean "there are absolutely zero situations in which the output could be shorter than the input." There will always be (compression scheme, input) combinations that produce smaller outputs. The qualifying criteria is "does expected input data typically compress?"

For a reasonable definition of incompressible, properly encrypted data is it.

My intuition says that if your ciphertext after using a one-time pad is trivially compressible, it means there's a strong correlation between your plaintext and your key. It's an interesting thought.

Consider using basic modular addition of letters, with A=1, B=2 ... Z = 26. If your plaintext is HELLO, and you get AAAAA as ciphertext, then your key has to be SVOOL. If you look, that's just an inversion of each letter. H is the 8th letter of the alphabet, whereas S is the 8th from the end, etc. So, your one-time pad was incidentally a simple translation of your message.

I'm not sure if there's any interesting information being leaked this way. Perhaps by seeing that the encrypted message can be heavily compressed, or knowing that it has been heavily compressed, you can assume that the encryption key has some very improbable patterns, and use that to bound guesses about the encryption key.

Not an expert. Not even an intermediate.

Could you elaborate what kind of scenario would work on a audio stream but not on a document? In order to use CRIME, the attacker would need to repeatably injected content before the compression stage in order to test if it matches some part of the source. Audio is kind of heavy uncompressed and aren't normally streamed as such.

A raw audio stream is a constant bit rate stream. If you compress it, it becomes (in general) a variable bit rate stream. The bit rate contains information about the raw audio which survives encryption.

If you compress a document that simply obscures the size of the original document.

If you compress a document that simply obscures the size of the original document.

That is not generally true - if you compress AAA and ABC then the compressed version also leaks information about the content exactly like in the case of an audio stream and it reveals more information then the lengths of the uncompressed documents which don't differ.

Yes, you can concoct theoretical cases where compressing documents leaks information. But there's a huge difference: compressing a document only leaks a single data point whereas compressing a stream produces a continual information leak. Furthermore, there are common use cases (with audio being the poster child) where compression leaks in streams can be used in practical attacks. I'm not aware of even a single case of a compression leak being used in a real attack on an encrypted document.

Imagine a voice call, using a variable-bitrate codec, broken into small packets (not necessarily over the internet), of interest to a passive eavesdropper who cannot decrypt the content, but can observe metadata: the timing of packets and their sizes.

Feed that information into a prebuilt phoneme-in-context model trained on that codec and language, and said eavesdropper could probably reconstruct a pretty good estimated transcript - different phonemes compress differently.

There is no reason this would not work in bulk: I am under the impression GCHQ have done a fair amount of research in this area, for example.

So, it's not just compression per se, but compression plus real-time plus some knowledge about the underlying information.

These sorts of attacks don't matter against a compressed and then encrypted file of indeterminate data, correct? Similarly, an amalgamation of many types of data, such as a disk or archive?

If you know absolutely nothing about the uncompressed original, then it seems indeed pretty hard to infer anything from the compressed version. But as soon as you know something, for example the rough size of the uncompressed original, you can for example start to infer that the content is either more like AAA or more like ABC because repetitions will compress better. And in reality it is really unlikely that an attacker knows nothing at all about the content. You know the type of data, you know what the header structure of the file format looks like. Or you could try to learn something from the statistics of the observed messages. Maybe the data contains a time stamp that correlates with the time you observe the message. The possibilities are endless.

Sure, but isn't there more implied by what you just said? It's not just that you know the size, but if you know the header format, then you also know the type (which you might). But, I'm not sure how much more of less information compression provides for non-trivial size files. Where individual packets are encrypted, you have quite a bit of information. Where you have one file of non trivial size, there is much less information exposed.

For example, you know its type, and it's 150KB. Even if you know it's a text document, or a WAV file, or bitmap, I think you get far less information from the compressed version than from the non-compressed version. The non-compressed version of the text document may give you approximate word count, and the WAV file will give you some probable lengths of the recording at common sampling rates, and the bitmap will give you some probable dimensions. Compressed versions hide this information within the natural variance of the compression.

The problem seems less to do with compression, and more to do with splitting larger data (or streams of data) into smaller chunks to encrypt, which in itself loses some of the benefits of hiding information in the variance of the data presented. Compression seems to exacerbate this situation by amplifying variances in very small data sets in a way that yields additional information, but it seems to me that's just a natural progression of encrypting very small amounts of data being not nearly as effective as large amounts of data.

At least, that's what I can intuit about the situation. It may be partly (or totally) wrong given some more advanced security theory (I am not a security professional).

Thank you, that was the tldr I was looking for in this thread.

Thanks, its good to have specifics when thinking about security.

I would guess that the phones and voip is a perfect target to do this, as those system are often designed to filter out noise. I guess it also would be almost impossible to use this to transcribe a encrypted movie or music, as the unpredictable sounds would break the prediction model.

Could this threat be neutralized by salting before compressing?

No. Salting is to keep stop pre-image attacks in encryption. It doesn’t affect compression.

And bottles are used for holding liquids rather than smashing heads... until you find yourself in a bar fight.

I might be wrong, but I think the parent had something like this in mind:

Assuming you are encrypting with AES-CFB your each plaintext block P[i] produces a ciphertext block C[i] (with key K and initialization vector IV) according to rule:

    C[0] = AES_encrypt(K, IV) ^ P[0], 
    C[i] = AES_encrypt(K, C[i-1]) ^ P[i].
Given that IV is assumed to be known by the attacker, if the plaintext is a vanilla-compressed stream, it is more likely than not that most of bits in P[0] are going to be known as well, which might allow some sort of prunned brute-force attack on key K, given a small set of (C[0] ^ P'[0]), for all P'[0] that satisfy the known bits in P[0].

This particular implementation would benefit then from adding a pseudorandom P[0] block (a "salt", if I understood correctly) that the receiver is to discard on arrival. I don't know enough cryptanalysis to tell if the above scenario is valid or not, but it sounds like a legitimate question at least.

The original article links to a paper describing an attack against an audio stream by exploiting knowledge of how the stream is encoded before encryption, then reconstructing it from the length of the encrypted stream:


> If you're encrypting a document, then compressing it first may prevent leaks.

I think this is only true if you use a bad encryption algorithm where same plaintext blocks have the same ciphertext.

Oh, I'm not talking about leaking plaintext, I'm talking about leaking the length of the document.

It's more about whether the attacker has control of the any portion of the compressed contents. If they have control over the stream then you don't want compression, if they don't you do.

No, control is not necessary for compression to leak information. It's enough for the attacker simply to know something about the original stream content (e.g. that it is audio) and how it compresses.

Interesting. Keepass[0] (password store) compress with GZip by default before encryption.

[0] http://keepass.info/help/v2/dbsettings.html#compression

Are you talking about compressing plain text larger than a block, then encrypting per each block of the compressed output?

Yes, of course. What would be the alternative?

Thank you. Your comment wasn’t clear for several reasons. Stream ciphers are regularly blocked in TLS, compression can occur per block. Compression may be done per block, not just against a file larger than a single block.


Stream ciphers.

Why would you use a stream cipher to encrypt a document?

In GCM, the 'C' is for counter, and hence turning AES into a stream cipher. This part is usually known as 'AES-CTR'. The 'G' is Galois for 'Galois field multiplication', which allows for a parallelizable way of computing a MAC. AES-GCM packages AES-CTR (a stream cipher made from a block cipher) and GMAC (the Galois MAC) together into a primitive. This type of scheme, which combines confidentiality, integrity, and authenticity is called 'authenticated encryption with associated data' (AEAD) [1]. A stream cipher is the easiest way of accomplishing that the cipherstream will safely expand to cover all the plaintext.

Other famous AEAD schemes are:

- CCM (Counter with CBC-MAC), packages AES-CTR and CBC-MAC together in an authenticate-then-encrypt regime

- ChaCha20-Poly1305, which packages together the stream cipher ChaCha20 and the MAC Poly1305.

[1] https://en.wikipedia.org/wiki/Authenticated_encryption

The AES-GCM mode of operation turns the AES block cipher into a stream cipher. Doing this makes it possible to add authentication on top for relatively cheap. The ChaCha20-poly1305 has a similar construction. These are the two most secure and efficient ciphers available in TLS, so they get a lot of use. If you are sending documents over HTTPS, you definitely want them encrypted with one of these two.

Fair enough, I concede the point.

Why wouldn't you?


Not that this really matters. The advantages/disadvantages of compression are the same in both cases.

The rule is: never compress something secret together with something potentially attacker-influenced.

If the attacker can influence the traffic, they can potentially gather information about the secret by examining the effect of differing traffic patterns on the size of the encrypted result.

No, the rule is: never compress something secret together with something potentially attacker-influenced, at least if the length of the compressed data leaks.

Fixed. That was a typo/thinko.

That's an excellent rule, but that doesn't cover the other example used about voice communications.

This is exactly what the article says.

Yeah, but the article uses a lot more words to say it...

...Eventually. It takes quite a while to get there, and still treats the question as legitimate, when it's loaded and really makes no sense.

The article seems to suggest that encrypt-then-compress would be the right solution in some circumstances, but really that's just as effective as encryption only with no compression. I don't feel the article is totally clear about this.

The troublesome part is the headline.

By putting quotation marks around it I think we can eliminate the misleading bit.

Exactly. This is not a binary choice.

Does this really need to be said though?

I may be too close to the problem. I've had to explain this to project managers and customers of course, but this is hacker news.

It feels like a three page article on why you should put your socks on before your shoes and not after.

> but this is hacker news.

So? There are too many domains to be familiar with. You just happen to be familiar with this one. I bet there are plenty things I'd consider obvious on many other domains where you'd be clueless.

Once upon a time I was at a talk given by a visiting professor (associated, I think, with the Horus project[1]; I'm reasonably sure some of you would recognize the name if I could remember it) on designing network protocols using plug-n-play software components. The illustrations he showed used Lego bricks.

His big example was a Lego brick labelled "Compression" and another brick labelled "Encryption", and how you could arbitrarily compose them to achieve different protocols.

Meanwhile, I was sitting at the back of the room wearing my "I'm not a security guy, but this is garbage" look.

[1] http://www.cs.cornell.edu/info/projects/horus/main.html

Perhaps you're referring to Prof. Ken Birman, inventor of Paxos and slayer of CAP? I could barely contain myself at his recent MesosCon keynote[1,2].

[1] https://mesosconna2016.sched.org/event/6ljt/keynote-a-real-t... [2] http://go.linuxfoundation.org/mesoscon-north-america-2016-vi...

It's not just you. I've never used encryption or compression in any serious way, but the right answer seems obvious if you know the definitions of encryption and compression.

Really? I've been reading, and following encryption news for 10+ years, and it had never occurred to me that you should not compress prior to encrypting. The article was an eye opener for me - in hindsight, maybe obvious, but I bet 90% of your average technical audience wouldn't realize that you should not compress prior to encrypting.

mea culpa.

I am guilty of not waiting around for him to get to his point 3/4 of the way through the article. If you're trying to bring up subtle issues, don't bury the lede.

What he's really trying to say is that you shouldn't compress sensitive data at all. And now we're into the non obvious stuff that some of us are clearly talking past each other about.

Personally, I think he's painting too broad a stroke and some domains don't have this problem, and for some there are other factors at play such as insufficient block size giving away too much information.

You could for instance probably figure out who is speaking just by the pattern of pauses, without even trying to decrypt what is said.

And then there's session cookies, which I despair of ever being secure. Because of the chosen plaintext of CRIME, even a large block size would only make the setup phase take a bit longer (finding a message that is one byte bigger than the block size). Encryption is insufficient to protect shared secrets.

It's considered obvious now, after the CRIME attack was published, but 15 years ago everyone would have said that you obviously should compress first! (I remember this being asked as a homework question when I was in college, with "compress first" being the intended answer.)

Indeed, I think the conventional wisdom was that compressing first would in fact improve security a little bit. This idea goes back all the way to Shannon in 1949, who noted that compressing before encrypting should improve information-theoretic security, because if the messages contain redundancy, then an adversary with infinite computing power can use that to decode the message. (Just try all possible keys, and see if it decrypts to an actual English sentence.) On the other hand, if you first compress using an ideal compressor, then every compressed plaintext will look the same (just random noise), so every possible encryption key will produce some plausible plaintext.

[1] http://netlab.cs.ucla.edu/wiki/files/shannon1949.pdf ; see sections 16 - 19.

Even a few years ago, the "obvious answer" in the infosec space was that you should compress first.

I think most of my upvotes are from people who only read half the article waiting for him to say something interesting and giving up. My grade schooler tells better stories.

As someone else said, the title is terrible. What we are left with in some scenarios is choosing one or the other, not one before the other.

You were a newbie too, once. Some people read Hacker News precisely to learn things like this.

Cut the arrogance. It's not obvious to all of us, we're not all cryptonerds.

Arrogance is probably better than name calling.

"Maybe I'm too close to the problem."

From my very not-a-crypto-person perspective, I had thought compression was fine as long as you don't pad it to the expected block length before encrypting?

something something padding oracle

The paper linked in the article: "Phonotactic Reconstruction of Encrypted VoIP Conversations"[1] specifically works in a situation without padding because the compression scheme affects the length of packets in a semantic way.

[1] http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf

>Compression after encryption is useless, as there should be NO recognizable patterns to exploit after the encryption.

Not to nitpick, but this is incorrect. A encrypted file can very well have something like 100 X's in a row which the compression system could turn from XXXXXXXXXXXXXXXX.... into (100 x's go here) - Lousy example I know but it gets the point across.

Its also easy enough to test- Just encrypt a file then compress it. Sometimes it will be smaller. (Your reply indicated that in 100% of attempts compression is impossible)

No. If the encrypted output is randomly distributed, as it should be, then the expected number of bits cannot be reduced through compression. If it successfully compresses some files, it will make some files larger. For random data, the change will be negligible with overwhelming probability, plus there will be overhead.


That's a text file, which doesn't use the full range of a byte (0-255), so each character takes less than a byte.

Im happy to reproduce in other formats. The contents are still random which is the crux of the issue here.

The contents are not purely random. For example, I can predict with 100% certainty that any given byte in the file is not the value zero (ASCII NUL).

Try compressing the output of /dev/urandom on your nearest convenient UNIX-like system. If you figure out a way to reliably and significantly compress that, please report back.

My point was never that it was possible 100% of the time. Simply that it was possible some of the time.

On average, you will expand files, not compress them.

Also, the odds of getting finding even a single encrypted file of that length which can be compressed are lower than the odds of winning the lottery three times in a row. Trust me, you didn't just happen to stumble across a working example. You screwed up the test.

There are no real patterns in encrypted data. Only tiny exponentially rare streaks. This means that compressing is still useless, as originally claimed. You're arguing against something that wasn't said.

Using a random text generator http://pasted.co/encrypted.txt ( 10,002 bytes )

Then compressed it with winrar http://pasted.co/compressed.rar ( 7,743 bytes )

Roughly 20% compression isnt meaningless- so I fail to see why you are just giving a flat 'no' when its obvious what you are saying is untrue.

You aren't performing your test properly. Encryption doesn't output random text, it outputs bytes of data (0-255). It should be obvious that compressing text is possible.

$ dd if=/dev/urandom of=test bs=1k count=10

$ cat test | gzip -9 > test.gz

$ cat test | bzip2 -9 > test.bz2

$ ls -al test*

-rw-r--r-- 1 r1ch r1ch 10240 Jun 28 16:30 test

-rw-r--r-- 1 r1ch r1ch 10739 Jun 28 16:31 test.bz2

-rw-r--r-- 1 r1ch r1ch 10263 Jun 28 16:31 test.gz

Random data is not compressible.

$ dd if=/dev/zero of=test bs=1k count=10

$ openssl enc -in test -aes-256-ctr -out test.encrypted

$ cat test.encrypted | gzip -9 > test.encrypted.gz

$ cat test.encrypted | bzip2 -9 > test.encrypted.bz2

$ ls -al test*

-rw-r--r-- 1 r1ch r1ch 10240 Jun 28 16:32 test

-rw-r--r-- 1 r1ch r1ch 10256 Jun 28 16:33 test.encrypted

-rw-r--r-- 1 r1ch r1ch 10737 Jun 28 16:34 test.encrypted.bz2

-rw-r--r-- 1 r1ch r1ch 10279 Jun 28 16:33 test.encrypted.gz

Encrypted zeroes are not compressible.

That's a base64 encoding of random data, which by definition has only 6 bits of entropy per byte. You should be able to compress it 25% but you only managed 22.6%

Your random text isn't uniformly random - any byte that isn't a letter or number never shows up.

Here it is encrypted with AES256 - Pass is wasteoftime

http://pasted.co/wasteoftime.txt 4,732 bytes

http://pasted.co/wasteoftime.rar 3,753 bytes

Also, if there is issues with using the 'wrong' encryption, I feel thats kind of a straw man argument. Please feel free to upload a file over 5MB which cant be reduced in size through any of the various compression tools.

Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.

That's not encrypted with AES256, that's encrypted with AES256 and then encoded with base64.

Do you understand the difference between base64 and binary data? Base64 only uses six bits per byte. It's designed to allow data to transit cleanly through places which only allow text, such as JSON. Binary data is eight bits per byte. Binary data is what encryption and compression algorithms output. Naturally, if you take data which only uses six bits per byte, and run it through a compression algorithm which is able to use eight bits per byte, you can achieve good compression. But this is illusory, because you expanded the original by 33% when you applied base64 in the first place! All your attempts at compression can be beaten by simply decoding the base64 into the original binary data.

It doesn't matter what magnitude of compression you specified. Random data cannot be compressed at all on average. This is a simple mathematical certainty with a straightforward proof. Encrypted data looks and acts like random data. If you find a way to reliably compress the output of a modern, secure encryption algorithm (not the base64 encoding, but the original binary) then you'll have found a massive security hole in it which will make you famous.

And as a note: It's fine to use base64 if you want. But then you also need to output the compressed file as base64. And you'll see that 3753 octects equate to 5004 base64 characters, and the file is actually significantly larger than it was before compression.

> Also, keep in mind that I never said it would be a huge benefit at all- I only said that SOME compression was possible SOME of the time.

This is a common misconception. The pigeonhole principle here tells us that any scheme that ever achieves some compression also achieves some expansion. The only reason compression algorithms work at all is because they are more likely to compress than they are to expand, because we know something about the probability distribution of the plaintexts.

The pigeonhole argument treats compression as a black box with input and output, and makes no assumption about how the compression works.

Your idea—to only compress some inputs and not others—doesn't change the fact that your algorithm can be treated as if it is a black box with inputs and outputs. So the pigeonhole principle still applies. Many people before you have made this argument before, so we are very familiar with it, and very familiar with the reason why it is wrong.

FURTHERMORE, compression, by its very nature, works very poorly on data which is apparently uniformly distributed, as encrypted data is.

This is why you are wrong.

this difference is because your random text values are not random binary values.

you are trying to bring empirical evidence to an information theoretic fight :(

No. If your encryption is not busted then the entropy under /any/ model tends to perfect. If a model is able to predict with /any/ success then your crypto is probably broken.

So yes, it is random, and 100 X's could appear in a row. But if your model can effectively compress that, then the model has to be wasting space for all the other sequences.

Let's just imagine that you are doing some like basic run length coding (gif! Or jif if you're wrong :) )

So 100 x's turns into (100, x). So you've compressed that part of the stream. Unfortunately everything else is likely (1,a), etc so every other symbol/byte increases in size. If we look at the stream probabilities we get something like P(nX)=(P(X)^n) -- very rough I haven't been at uni for a long time :). You can do a bunch of reasonably simple math do verify it but you eventually end up with something that your average symbol size is sum{x in X}-log(P(x))/|X|. Where X is your set of possible symbols (n,x:X).

For a given stream that is not encrypted you may save some space representing the input bytes, but each symbol (n,X) has at /least/ 1 extra bit of information. Overall you can save space.

But let's look at the encrypted data. Each input symbol is independent so your probability of any sequence in the input is equal. You will typically have a run length of 1 (P=0.5), 2 (P=0.5), ...

So 50% of your output symbols will have at least one extra bit of information /added/. This means your output by necessity is bigger than the input.

I used RLE as an example here, but it's true for all compression schemes. This is a necessary property for compression to work: for any data that compresses by n% under a given model there must be some input that increases in size by n% (or some such, again, long time since uni). I believe Wikipedia has an article on the pigeon hole problem or some such that probably explains this better than I can.

> A encrypted file can very well have something like 100 X's in a row

Any decent encryption algorithm has output indistinguishable from random noise; the odds of 100 Xes in a row in random data is 1:2^800; there are only 10^80 particles in the universe (2^266), which is 534 base-2 orders of magnitude smaller than 2^800. Yes, every particle in the universe could be a universe of fundamental particles and you'd still never see 100 Xs in a row in a decently-encrypted message.

Feel free to upload a 5MB file under your narrow restrictions and I will see if I can reduce it with one of the 20 different compression systems available to me.

Here you go: https://mikeash.com/tmp/randomfile.bin

If you manage to get that under 5MB by applying any tool which allows recovering the original data, I will be most interested to know how you did it.

Success: compress.sh: #!/bin/bash echo "#!/bin/bash" echo curl $1

./compress.sh https://mikeash.com/tmp/randomfile.bin > randomfile.compressedzomg


/me hides

You're lucky I'm too lazy to edit a byte on the server and then ask you if it still matches.

It would. You would need to change the location of the file

Fine, fine, you're^Wthey're lucky I'm too lazy to make it return new random data each time.

\o/ :D

The larger the random file the _less_ likely you are the successfully compress it.

Everyone here is telling you that you are wrong, and have even clearly demonstrated this through both mathematical reasoning and empirical tests. Quit digging :-)

Quit digging :-)

On that note, sometimes we all get too attached to our own hard-won mistakes. We all need to remember that sometimes we should set aside pride and just say, "Whoops, my reasoning was unclear or I was misinformed, thanks for correcting my misstatent or misunderstanding."

Ooh, the encryption comment I just made reminded me of an example of compression potentially working on encrypted content.

AES-ECB mode will actually compress fairly well for some inputs (the canonical example being a bitmap image). The reason the compression can work? Because the crypto is broken. In the image case you can actually just visually (no maths or anything) see a large amount of the details of the source image.

If there is structure in an encrypted file then the encryption is broken. It is mathematically indistinguishable from true random data. No compression algorithm can compress it.

An additional point you need to consider: You can select different models and compression programs (and some programs do do this), but you have to record which one you use in your output. This again increases the size of the output - this means that even a no-op compression algorithm (say cat or cp :D ) results in output that is bigger (you need a initial byte to say you haven't tried to compress).

I think the better approach to what you're claiming is for you to provide an example that is compressed. Just provide the input, the key, the encryption algorithm (AES-GCM for preference, but seriously AES-CBC would work to), and the compression algorithm.

I don't have a handy place to stash 5MB of data publicly, but you can run the following in SBCL (or the equivalent in whatever language you prefer):

    (in-package :cl-user)
    (use-package :ironclad)
    (let ((data (make-array (* 5 1024 1024)
                            :element-type '(unsigned-byte 8)
                            :initial-element (char-code #\X))))
        :key (sb-ext:string-to-octets "YELLOW SUBMARINE")
        :mode :ctr
        :initialization-vector (make-array 16 :element-type '(unsigned-byte 8)))
      (with-open-file (tmp "/tmp/data" :direction :output
                           :element-type '(unsigned-byte 8))
        (write-sequence data tmp)))
That's a file of 5,242,880 'X' characters, encrypted with the key 'YELLOW SUBMARINE' and an all-zero initialization vector. You will not be able to compress it.

> It wont be a huge difference in file size, but it will be smaller.

may be smaller, not all outputs will be smaller, this is true of all lossless compression algorithms:


This is literally all I was saying and people went on a downvote frenzy. People are literally saying that a encrypted output can NEVER be compressed. I was saying that it can (although not all the time!) and the gains would be minimal. Ive corrected the first post with the word 'may'

You're answering a question that no one is interested in. The problem isn't "can any singular instance of an encrypted output be compressed". That's a trivial conclusion -- as it's theoretically completely random bits, of course some combinations of those bits are going to be compressible, maybe even highly compressible. It might be that the encrypted output is entirely a single repeated byte, in which case a 4-gigabyte message could be compressed to 4 bytes (being the integer representing how many times to repeat the byte).

The real question is, can you reliably compress encrypted messages in general? And, given that the encryption is not broken, the answer is no.

In practice, with any real, widely used compression algorithm, encrypted output will never be compressed. You can generate 100,000 random binary files and run them through all the major compression tools and not one of them will compress by even a single bit.

In theory, some random outputs will be compressible by common tools (for example, a file that's all zeroes). However, the probability of encountering a random file that compresses more than the overhead from any common compression tool is vanishingly small.

> People are literally saying that a encrypted output can NEVER be compressed.

No they're not. They're saying that as a class, you cannot compress the data. That is a statement about what happens in aggregate. And the aggregate result of trying to compress securely-encrypted data is that you don't save bytes. Even just having the option of compression does not save bytes, because you have to store the choice.

Monkeys hitting random keys on a typewriter can write novels (although not all the time!)

Any one given input can be losslessly compressed down to 1 bit with the exact right compression algorithm. I think everyone here understands that. The argument is just that that's not really useful in the real world.

But, (also by the same ideas), most encrypted texts will not be able to be compressed by more than a single bit.

A more interesting question is whether to compress or sign first.

There's an interesting article on that topic by Ted Unangst:

"preauthenticated decryption considered harmful"


EDIT: Although the article talks about encrypt+sign versus sign+encrypt, the same argument goes for compress+sign versus sign+compress. You shouldn't do anything with untrusted data before having checked the signature - neither uncompress nor decrypt nor anything else.

I think, you mean whether to encrypt or sign first.

Good catch!

Although the article talks about encrypt+sign versus sign+encrypt, the same argument goes for compress+sign versus sign+compress.

Why is the debate about "compress/encrypt then sign" vs "sign then compress/encrypt"?

Is there a non obvious problem with sign then compress/encrypt then sign again? (overcomplicated or unnecessary?)

It's pointless. If you sign the encrypted data, then once the signature is verified in the receiver, you know that the decrypted data is also good. Repeating the signature just wastes time and space.

After some thought, The only advantage of signing before and after i can think of is without it you are left with the (theoretical?) problem of not knowing if the output from your implementation/version of the utility to decrypt/decompress is identical to the senders input to their implementation of the utility if the sender only signs the compressed/encrypted version.

Of course, if the compression/encryption method has some way of checking the integrity of the output (that's of equal strength to the signature) then then signing first would be completely redundant.

EDIT: so in many scenarios, signing first and last would have no advantage. For example if you get to decide what implementation will be used by the sender and the recipient (most package managers?)

I can definitely see a use for checking the integrity of the decrypted/decompressed data. Of course, implementation bugs or hardware defects could wreck your data after that stage, but increasing coverage can still help.

Verifying data integrity isn't exactly the same as authenticating a message, so I think you'd probably want to use a simpler scheme for that. For example, a basic CRC would suffice for most use cases there.

Compress, encrypt and then MAC. If MAC fails, don't decrypt. If decryption fails, don't (and can't) decompress. If decompression fails, throw away data, because compression-to-encryption time corruption occurred.

Except what about the case where someone can spoof that an encrypted message came from them? In that case, you want the signature somewhere inaccessible, so that they can't selectively strip it off.

I don't understand what you mean. Your signature scheme is either secure or it isn't. If an attacker can spoof or strip the signature and have you accept it, then it isn't secure and it doesn't matter where you put it. If they can't, then you can have it be in the outermost layer.

Well the scenario I'm envisioning is that sign-then-encrypt guarantees that no signatures were stripped off in transit, that's all. It prevents people from saying that an encrypted message (which they were unable to read) came from them, even though they didn't encrypt it.

I think it's probably most important in trust on first use scenarios, where an MITM in position during trust on first use can strip off a plaintext signature and re-sign the encrypted message with their own key - if there's a signature inside the ciphertext that matches the signature outside, you can detect something like that.

Not sure that it really comes up much, though.

Encryption doesn't guarantee that the message can't be tampered with, though. In practice I believe it would be quite hard to tamper with the inner signature if you didn't have the encryption key, but relying on encryption to save you from tampering is using the wrong tool for the job. That's the whole reason you have signatures in the first place, otherwise you could just encrypt the plaintext and send it and have the fact that you knew the right key provide authentication.

Note that in the context of encryption, signatures for authentication don't mean public/private signature schemes, but just a plain MAC using a shared secret. Authentication in the sense of "Do I know I'm talking to who I think I'm talking to?" is handled at a separate level as part of the initial key exchange. For an authenticated encryption scheme, you'd exchange both the encryption key and the MAC key as part of that exchange. There's no sensible scenario where an attacker would know one of those keys and not the other, because they're generated and stored together. In fact, as far as anyone knows there's nothing wrong with using the same key for both, it's just that nobody is completely sure that's safe. Since it's easy to generate and use two keys instead, that's recommended.

There are DOS and Trojan attacks against decompression libraries. You definitely want to verify a signature before extracting the archive, but even a 2 pass verification process can leave you with a payload that expands to a terabyte. If your hash algorithm is running at 50MB/s that can take a while.

All that really matters is that the outermost layer is signed, and that the signature is required to be properly verified by the recipient before any other processing is done.

Regardless of whether you're encrypting or compressing or signing twice or what order those are in.

From what I've read it is "encrypt then sign, unless signatures are optional and could be stripped from the message in which case things are complicated"

Where everyone seems to be getting confused is handling a live flow versus handling a finalized flow (a file).

* Always pad to combat plain-text attacks, padding in theory shouldn't compress well so there's no point making the compression less effective by processing it.

* Always compress a 'file' first to reduce entropy.

* Always pad-up a live stream, maybe this data is useful in some other way, but you want interactive messages to be of similar size.

* At some place in the above also include a recipient identifier; this should be counted as part of the overhead not part of the padding.

* The signature should be on everything above here (recipients, pad, compressed message, extra pad).

. It might be useful to include the recipients in the un-encrypted portion of the message, but there are also contexts where someone might choose otherwise; an interactive flow would assume both parties knew a key to communicate with each other on and is one such case.

* The pad, message, extra-pad, and signature /must/ be encrypted. The recipients /may/ be encrypted.

I did have to look up the sign / encrypt first question as I didn't have reason to think about it before. In general I've looked to experts in this field for existing solutions, such as OpenPGP (GnuPG being the main implementation). Getting this stuff right is DIFFICULT.

This is why military voice encryption sends at a constant bitrate even when you're not talking. For serious security applications where fixed links are used, data is transmitted at a constant rate 24/7, even if the link is mostly idle.

Wow, what a trainwreck. So many comments in here talking about whether it would be possible to compress data which looks like uniformly random data, for all the tests you would throw at it. Spoiler alert, you can't compress encrypted data. This isn't a question of whether we know it's possible, rather, it's a fact that we know it's impossible.

In fact, if you successfully compress data after encryption, then the only logical conclusion is that you've found a flaw in the encryption algorithm.

Also interesting is which compression algorithm you're using. HPACK Header compression in HTTP 2.0 is an attempt to mitigate this problem:


The paper cited in this article (Phonotactic Reconstruction of Encrypted VoIP Conversations) really deserves to be highlighted, so I submitted it separately:



I don't understand... Why couldn't you do CRIME with no compression as well? Assuming you can control (parts of) the plaintext, surely plaintext+encrypt gives you more information than plaintext+compress+encrypt?

Crime relies on compression --- the "CR" stands for "Compression Ratio"

The idea is that the DEFLATE compression algorithm used in the TLS compression mode CRIME attacks will build and index of repeated strings and compress by providing keys to that index.

Here's a beautiful demonstration of another similar compression algorithm: http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equal...

So, if you control some subset of the plaintext you can make guesses as to what the secret you're trying to get at is, and if the size changes after compression you know that you got two hits to that bucket in the index, so your guess is right. You can use this technique to guess some string character by character -- reducing your seach space to n*m instead of n^m for a string of length m with character set of length n.

Excellent explaination, thanks!

If you have a blob of encrypted data, that should normally tell you nothing about the plaintext, except its approximate length (assuming it wasn't padded). Normally, the length of the plaintext should tell you very little about its content; it doesn't do you much good to know "the message is about 4096 bytes". However, if you compress first, then you know the approximate length of the compressed data, which means you know something about the content of the plaintext.

As one of several possible attacks: imagine the attacker could supply data that will get fed back to them in the encrypted blob, along with your own data, and all compressed together. The blob will compress better if your data matches their data. So, repeatedly feed in your data and get back blobs, watching the total size of the compressed-then-encrypted blob, and you can predict enough about their data to guess it.

If you control parts of the plaintext, a well-designed cipher doesn't tell you anything about the rest of the plaintext.

But if you control parts of the plaintext, a compression algorithm will absolutely tell you information about the rest of the plaintext. That's its job—identifying similarities between different parts of the plaintext.

And it will leak that information in the form of the size of its output, and the standard definition of "well-designed cipher" doesn't do anything to mask sizes; the size of the encrypted data is the same as the size of the data passed to the cipher. So now both the data and its size are sensitive, and nobody's encrypting the size.

(You could sort of work around this by padding, but then you've basically removed the point of compression.)

Actually in the context of security compressing the data you're about to encrypt still matters. The problem exposed by the CRIME exploit (and anything similar) is that the size of the payload also can be an indication of the data within it. To combat this, systems which are exchanging data interactively (in a stream) should further pad messages /up to/ a target size (which might be random per message).

As I mentioned elsewhere there's no reason that data can't also be potentially useful (though it does become difficult to ensure that this isn't also a source of attack).

That reduces, but doesn't eliminate, the amount of information you're leaking. If you pad the data to a multiple of some fixed block size, you'll still learn something across the boundary between two block sizes. An attacker can do a CRIME-style attack that includes a wrong password guess plus some of their own padding, and vary the amount of padding until it is just big enough to take n + 1 blocks instead of n. Then they can vary the password until it goes back to taking n blocks.

Randomizing the padding limit also reduces the leaked data, but doesn't eliminate it: your random numbers come from some distribution, so the attacker just has to repeat each inbound message many times, and do some stats. If a correct password guess gives them, say, 1000 bytes with standard deviation 500, and an incorrect one gives them 1001 with standard deviation 500, they simply need to issue a ton of requests.

(If you're padding the data to a fixed absolute size, period, and you know no message is smaller then that, then sure, please pad the message, but there's also not a whole lot of point in compressing it at all. Leave it uncompressed, and pad that.)

> That reduces, but doesn't eliminate, the amount of information you're leaking.

All cryptographic systems save the one time pad do leak some information, in the sense that they are not information theoretically secure. For instance we know that the output of standard block or stream ciphers (with fixed plaintext) is distributed over an exponentially small fraction of the possible output space.

So really the question here is whether one can pad to the point where it is computationally infeasible to launch this kind of attack, and whether this padding amount is so large as to defeat the compression entirely.

For example, two normal distributions with variance 1, means ~ 2^(-k) away from each other to each other can require ~ 2^(2k) trials for a constant probability of hypothesis test success (say 1/3).

Has there been any work analyzing this rigorously, taking the length leakage as side information available to the attacker?

You still want to compress in the first place as a means of removing entropy and some known plain text if possible. Most systems that give you the option of compressing or not do so because you may have already used some other method of compressing the data (E.G. it's a video file or already a compressed transport archive).

The addition of padding, even worthless (hopefully at least pseudo-random) garbage, up to some reasonable minimum message size, is a form of making analysis of message /size/ useless.

I've heard this advice before a few times, but I've never seen a rigorous analysis of what "removing entropy" is supposed to mean. Any half-reasonable encryption system will deal just fine with low-entropy inputs, and produce an equally rigorous output. (ECB mode is not a half-reasonable encryption system, but even, say, 1DES-CTR satisfies this requirement.)

Compressing inputs that are completely not controlled by an attacker is fine. For instance, gzipping static files on your CDN is totally fine.

Padding to a minimum message size does not make message size useless. It only makes message sizes below the threshold useless. If an attacker can control part of the input (which is the threat model for things like CRIME, where an attacker-controlled URL and a non-attacker-controlled cookie header are part of the same HTTP request), they can just provide their own padding to get the input past the minimum size. Setting a fixed and unchangeable size for messages works (... provided there's nothing secret in the number of messages!).

Nope, secure ciphers are resistant against attack-controlled plaintext attacks.

The reason that plaintext -> compress -> encrypt has issues is because the length of the ciphertext gives away the plaintext if the attacker can control parts of the plaintext.

No, because the length of the plaintext+encrypt depends only on the length of the plaintext and not on the contents of the plaintext; whereas the length of the plaintext+compress+encrypt depends on the length of the compressed plaintext, and the length of the compressed plaintext depends on the contents of the plaintext.

So, different plaintext of the same lengths, which would've resulted in plaintext+encrypt of the same length, instead results in compressed plaintext of different lengths which results in plaintext+compress+encrypt of different lengths, leaking additional information!

Not at all - with compression thrown in, you get a measure of the entropy of the original plaintext from the length of the compressed message.

I picked up on the reference to Stockfighter, but does anyone know if the walking machine learning game mentioned at the end of the article exists? Sounds like a fun game.

Heh, I was referring to OpenAI Gym (https://gym.openai.com/), specifically https://gym.openai.com/envs#mujoco

Would adding some tiny random size help? Based on my poorly understanding, if after compress, but before encrypt we add random 0 to 16 bytes or 1% of size that could defeat quite a lot of attacks (like CRIME).

That'll make an attack significantly more time-consuming, but won't prevent it. Instead of instand feedback whether they guessed correctly, an attacker would instead need to send a bunch of requests to determine if the average request size has decreased.

What if the number of padding bytes is a function of the contents?

That would add noise to the information the attacker gets, but ultimately the "output" length has to be correlated with the input length, so you can't ever get the amount of information the attacker gets down to zero which is what's needed to make a cipher secure.

'ultimately the "output" length has to be correlated with the input length'

But perhaps not relative to the size of the secret data, right?

(Note that I'm not saying, "Oh, obviously we should just do this to avoid that attack" - I'm hoping to learn by carefully understanding where it breaks down).

CRIME works because the compression/encryption algorithms aren't aware of which data is secret and which isn't. They treat the whole HTTP stream as a stream of text and operate on that. To them, text within the page is just the same as cookie data in the corresponding header field or (for example) the URI in the header.

If such a distinction were possible, the system could just not compress secret data. But if that were the case, nonsecret data wouldn't even need to be encrypted; it's nonsecret after all.

You can achieve this effect by having the compressed content have the same length regardless of attacker input. i.e. don't include any mutable information in your compression.

It's really going to depend on what kind of data you're dealing with.

For VoIP, if you're smart you can even fool the attacker into guessing a wrong answer of your choosing. See my comment upthread for a link to our NDSS 2009 paper on traffic morphing.

I'm not familiar enough with the CRIME attack to say how well such a defense might work. Maybe someone who knows more can chime in.

Despite the question being flawed. The correct answer is a series of questions: Who is the attacker? What are you guarding? What assumptions are there about the operating environment? What invariants (regulations, compliance, etc) exist?

There may be compensating controls that invalidate the perceived needs for encryption or compression, for example. i.e. don't design in the dark.

Of course, the interviewer may just want a canned scripted answer - but the interview is your chance to shine, showing how you can discuss all the angles.

That was a fun read. Do I detect a nod to tptacek's "If You’re Typing the Letters A-E-S Into Your Code You’re Doing It Wrong"?


If you're compressing audio, the simple solution is to compress using constant bitrate.

Unfortunately one way to define variable bit rate is its compressed CBR.

Rather than defining at the protocol level "insert comfort noise here" at the compression level you get bitstream level "I donno what this stream is at a higher level, but replace the next 1000 bits with zeros".

That's if you do simple sampling. I donno about weird higher level vocoders. I think you could create a constant bit rate vocoder that really is constant. But it'll likely be uncompressible if its a good one because a vocoder basically is a compressor that's specifically very smart about human speech input. If your vocoder output is compressible its not a good one.

I think if you replaced your compression step with run it thru a constant rate vocoder you'd get what you're looking for. Probably.

No, he's saying you compress CBR, then encrypt. Not compress CBR -> gzip -> encrypt or something silly like that.

CBR audio codec, then encrypt gives you a constant-bitrate stream indistinguishable from randomness. That's pretty much the gold standard.

(Of course, that still only encrypts content, not metadata. You can encrypt a phone call in such a way that a watcher gets mathematically zero information about what's being said, but the watcher still sees who is calling whom, when, and for how long. Hiding that is much harder.)

Would be great if Apple understood this and compressed IPA contents before encrypting.

Instead, when you submit something to the AppStore, you end up with a much bigger app than the one you uploaded.

To add insult to injury, if you ask Apple about this fuck up you get an esoteric support email about removing "contiguous zeros." As in, "make your app less compressible so it won't be obvious we're doing this wrong."

if your compression can compress your encrypted data, you should change your encryption mechanism to something that actually works...

What if you compress and then only send data at regular periods and regular packet sizes? That way no information can be gleaned. E.g. after compressing you pad the data if it is unusually short, or you include other compressed data too, or you only use constant bit-rate compression algorithm.

That quoted voip paper isn't actually as damaging as it sounds. IIRC that 0.6 rating was for less than half of the words so if you're trying to listen to a conversation to get something meaningful, it's probably not going to happen.

That's a good point, and the .6 is an optimistic score (I think it was from running a subset of their model)

I got the sense that the authors felt that this proves an attack of this sort was possible and viable, but that their model wasn't quite there yet.

OTOH this should be enough to acknowledge that the voice encryption/compression scheme they are attacking is not secure.

Fair enough. The larger point, however, is that this stuff was supposed to be encrypted. The adversary shouldn't be able to learn even one bit about the plaintext.

After all, nobody would buy a new encryption module that advertised it could protect you 40% of the time.

Has there been any research into compression that's generally safe to use before encryption? E.g., matching only common substrings longer than the key length would (I think?) defeat CRIME at the cost of compression ratio.

I'm working (finishing paper) on an algorithm compatible with deflate/gzip that is safe to use before encryption (i.e. it is guaranteed to not leak random secrets cookies). It's a bit more complex than your suggestion - matching common substrings longer than key length would still be vulnerable as substring boundary may still fall inside secret.

Cool! Please post a Show HN with a link to your ePrint when it's done.

Maybe we need encryption that also plays with the length of the message / or randomly pad our date before encryption ? I am however no expert, so I have no clue how feasible, or full of holes this method would be .

Yes, this is one option that we proposed back in 2009 after we first pointed out the VoIP traffic analysis attacks in 07 and 08.

If you want more information, see our paper on traffic morphing from NDSS 2009 [1]. It works pretty well for VoIP; not as great for trying to counter similar attacks on web browsing traffic.

[1] http://www.internetsociety.org/doc/traffic-morphing-efficien...

I am always thinking, if the compression scheme is known, you would need some good noonce to avoid known plaintext (for example, compression format's header is always the same), and also by CRIME, which is to remover the dictionary of the compression.

I think it is best to use built-in compression scheme by the compression program to do the encryption first, as those often take these into account (and the header is not leaked, since only the content is encrypted).

Can't you just add some random length data at the end. You are defeating compression a little bit, but are also making the length non deterministic. I thought pgp did that.

So long as you do not compress the random data you add it will work. Compressing it will not help the situation at all since the compression ratio will be constant for the random data. The problem is you have to then seriously negate the effects of compression.

Consider compressing "silence" in a telephone call.. You can't compress it well if you also need to have the non silence elements be indistinguishable from it by adding random noise. You must add enough random noise to cover up any compression differential, otherwise statistical artefacts still persist. That amount of random noise will be up to the maximum compression ratio you can achieve.

So what does this mean if I am using an encrypted SSL connection that is correctly configured?

Is this kind of problem not already dealt with for me by the secure transport layer? It would be a shame if the abstraction were leaky. My understanding of the contract is that whatever bits I supply will be securely transported within the limits of the configuration I have selected.

If I pick a bad configuration then yes shame on me, but a good configuration won't care if I compress right?

Logically speaking, an encrypted file should have a high entropy set of bits within it. Compressing it would be low return, but higher security since the input file contained more "random" bits.

Compressing the source material will yield smaller results but will be more predictable as the file will always contain ZIP headers and other metadata that would possibly make decryption of your file much easier.

But the metadata is a form of structured data and as you state is semi known plaintext; i.e. the structure is known, and to a lesser extent also the data (some fields have a known or limited range of possibilities).

Aside from the headers the compressed data it self often has structure; take the Lempel Ziv class of dictionary encoders rely on repeating data to compress. It is just this fact that it is repeating means that you can guess with much higher probability than normal what certain bytes will _not_ be (because longest words that match the dictionary are chosen to tokenised to maximise compression); i.e. bytes that _don't_ match a word suffix in the dictionary will restart the search for a new matching word / token pair.

Having said that the plaintext itself is almost never random; but the key thing does the attacker have any crib that might be used to have a good first guess that can reduce the amount of work required.

So which is more guessable; the semi-structure, at the byte level, of compressed data, or the possible semi-structure of the original plain text? If you are dealing with a protocol that specifies compression (and in particular which compression method) you may have given away part of the game.

One way is to "bump up" the entropy and add "chaff" to the compressed stream before encryption; i.e. add some entropy but less than what would be less than the amount saved through compression. My gut feel is that the efficacy of this would vary depending on plaintext, compression method, and encryption method. You also run the risk of side channel analysis via CPU, RAM, power usage etc.

The input 'plaintext' could be binary data or other information that the attacker cannot predict.

What they can predict if they find some code interfacing with this encrypted file is the way it has been stored. It's not much of a long shot to say that if you can identify that it's just a plain zipped file then you're job will be much easier when it comes to reverse engineering this.

That being said, it's still a huge pain in the ass to work with that stuff. I mean, the US government hires some of the worlds best crypto people in the world and they still are sent for a run some times.

It's not just the headers. You could strip those off and count on the other side to put them back on.

It's that the act of decompressing arbitrary data can leak very important information to the attacker.

Most (good) encryption schemes have the property that knowing part of the message, or some other information like message length, will not help you decrypt the message.

> Compressing the source material will yield smaller results but will be more predictable

This depends on the susceptibility of the ciphers to known-plaintext attack; I'm not sure if today there are scenarios where using such ciphers makes sense.

If I compress each component (ie: attacker-influenced vs secret) separately, concatenate the results (with message lengths of course), then encrypt the whole message, is that secure?

It seems like it should be, but I'm not an encryption expert. The compression should be pretty good, though.

> The paper Phonotactic Reconstruction of Encrypted VoIP Conversations gives a technique for reconstructing speach from an encrypted VoIP call.

The technique to reconstructing speech clearly had its limitations.

This blog is an interesting way to advertise to their target market: us.

By eliciting damnfool responses from people that haven't read the article?

That part was… unexpected.

This question was in one of quizzes. Expected answer: compress first.

So if the length of the resulting message is leaking information, salt it by adding some extra random bits to the end to increase the length by a random amount.

Which may be useful, unless you can use a padding oracle attack or timing attack, or you're using something stupid like ECB mode, or you aren't authenticating your ciphertext.

In general, it is safe to assume that whatever countermeasure you are thinking of has already been defeated by an attacker, unless you have researched for a really long time and found no possible alternative.

None of this seems to apply to documents you generate to supply to someone else you trust. Compress and encrypt seems perfectly fine.

If you're encrypting it, it is to hide information from some sort of attacker, not the trusted recipient of the document. If there is literally no possibility at all of someone else intercepting the document, then why are you bothering to encrypt in the first place?

Because the wire is unsafe.

In which case the considerations in the featured article apply entirely to any attacker with access to the wire/transit.

Your original comment makes no sense.

No. The considerations in the article apply to certain kinds of input under certain conditions. Those caveats mean it doesn't apply to self-generated text documents encrypted all at once, in general.

Given the real risk of CRIME attacks, are there "compression aware" encryption algorithms?

I don't think it is possible - it's hard to imagine an encryption algorithm that doesn't preserve content length.


A lot of comments here suggesting that encryption increases entropy. While true, it only adds the key's entropy to the plaintext's entropy. In most real-world cases, len(m) >> len(k), so this is usually an insignificant increase of entropy. Compression also adds a trivial amount of entropy (specifically, the information encoding the algorithm used to compress, even if that information is out of band).

I believe you confuse entropy with Kolmogorov complexity

I thought this was common sense. Compress then encrypt. Encryption leads to higher entropy, therefore less effective compression.

The article is about why that can be wrong.

The article agrees, as should anyone who understands encryption and compression: encrypted data can't be compressed, so encrypt-then-compress is pointless.

The article also covers why compress-then-encrypt is dangerous. But it's not a dichotomy. Those aren't your only two choices, you can also just encrypt and not compress.

But in almost all cases it's right.

"Don't compress at all" is the better default answer.

Noting that "compression after encryption is stupid."

If security is the top concern, making all encrypted messages the same length would be ideal as far as I can tell. That way, all you are giving away is an upper bound on the message size. Padding with random noise to a uniform length (with the payload either compressed or not) and then encrypting should be the most secure option.

What's the point of compressing if you're just going to pad it anyway? If you're worried about security just encrypt and don't compress.

The point in that case is to combat plain text (known value) attacks.

Precisely how the padding / extra padding is distributed within the data stream to be encrypted is also an issue. The goal is to make it very difficult to guess where data will be represented if you do happen to know the plain text.

yeah, but what about compress, encrypt, compress, encrypt .. compress?


You've just created an amplified DOS vulnerability :)

The correct default answer is encrypt after compress.

"Don't compress at all" doesn't help you if you need to reduce bandwidth.

What good is a secure channel if no one uses it because of its high bandwidth requirements?

A lot, possibly a majority, of the major breaks in crypto systems (certainly the interesting ones) in the past decade have been because of compressing before encrypting. If someone wants to compress first, demand that they justify the reduced bandwidth usage.

The interview in the article is for a security position. You answer is incorrect in that situation.

Did you read the article?

Read the article smarty pants ;-)

Compressing then encrypting gives you more effective compression and (somewhat) less effective encryption. Encrypting then compressing gives you more effective encryption and less effective (almost ineffective) compression. So, depends which one you want more. :)

Compression after encryption is likely to be worse then ineffective - it's almost certain to be higher bandwidth than no compression.

Yeah, running an encrypted bitstream through a lossy encoder like CELP is going to sound pretty awful (i.e., the content will be completely unrecoverable.) The whole idea behind a speech codec is to simplify the frequency-domain properties of the data in one way or another.

People are missing the real takeaway from the article, which is that VBR speech compression has serious vulnerabilities that CBR codecs won't share. That part wasn't obvious.

Are you sure? Is that your final answer?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact