
“Should you encrypt or compress first?” - phillmv
http://blog.appcanary.com/2016/encrypt-or-compress.html
======
nightcracker
There's no compress or encrypt _first_.

It's just compress or not, before encrypting. If security is important, the
answer to that is no, unless you're an expert and familiar with CRIME and
related attacks.

Compression after encryption is useless, as there should be NO recognizable
patterns to exploit after the encryption.

~~~
lisper
> If security is important, the answer to that is no

It's a little more nuanced than that. Compression may cause information leaks
or it can prevent them depending on the circumstances. If you're encrypting an
audio stream, then compressing it first can cause leaks. If you're encrypting
a document, then compressing it first may prevent leaks.

~~~
admax88q
You can't compress after encryption.

Encrypted content should be indistinguishable from random data. So encrypt
than compress shouldn't be able to yield any reasonable compression.

~~~
choosername
wit truly random data, patterns should randomly appear.

~~~
jahewson
Encrypted data is "indistinguishable from random data" but it is not actually
random data. No patterns will appear.

~~~
qu4z-2
If patterns appear in random data, and no patterns appear in encrypted data,
then it's trivially distinguishable from random data: look for patterns.

~~~
qu4z-2
To be clear, I'm not claiming that encrypted data is compressible. I'm just
pointing out that any detectable "patterns" that appear in random data but not
encrypted data are inconsistent with the claim that random and encrypted data
are indistinguishable. Also if you find such a "pattern" that's probably a
good starting point for breaking whatever cipher you're using.

------
vog
A more interesting question is whether to compress or _sign_ first.

There's an interesting article on that topic by Ted Unangst:

"preauthenticated decryption considered harmful"

[http://www.tedunangst.com/flak/post/preauthenticated-
decrypt...](http://www.tedunangst.com/flak/post/preauthenticated-decryption-
considered-harmful)

EDIT: Although the article talks about encrypt+sign versus sign+encrypt, the
same argument goes for compress+sign versus sign+compress. You shouldn't do
anything with untrusted data before having checked the signature - neither
uncompress nor decrypt nor anything else.

~~~
sirk390
I think, you mean whether to encrypt or sign first.

~~~
vog
Good catch!

Although the article talks about encrypt+sign versus sign+encrypt, the same
argument goes for compress+sign versus sign+compress.

~~~
b101010
Why is the debate about "compress/encrypt then sign" vs "sign then
compress/encrypt"?

Is there a non obvious problem with sign then compress/encrypt then sign
again? (overcomplicated or unnecessary?)

~~~
mikeash
It's pointless. If you sign the encrypted data, then once the signature is
verified in the receiver, you know that the decrypted data is also good.
Repeating the signature just wastes time and space.

~~~
b101010
After some thought, The only advantage of signing before and after i can think
of is without it you are left with the (theoretical?) problem of not knowing
if the output from your implementation/version of the utility to
decrypt/decompress is identical to the senders input to their implementation
of the utility if the sender only signs the compressed/encrypted version.

Of course, if the compression/encryption method has some way of checking the
integrity of the output (that's of equal strength to the signature) then then
signing first would be completely redundant.

EDIT: so in many scenarios, signing first and last would have no advantage.
For example if you get to decide what implementation will be used by the
sender and the recipient (most package managers?)

~~~
mikeash
I can definitely see a use for checking the integrity of the
decrypted/decompressed data. Of course, implementation bugs or hardware
defects could wreck your data _after_ that stage, but increasing coverage can
still help.

Verifying data integrity isn't exactly the same as authenticating a message,
so I think you'd probably want to use a simpler scheme for that. For example,
a basic CRC would suffice for most use cases there.

------
mjevans
Where everyone seems to be getting confused is handling a live flow versus
handling a finalized flow (a file).

* Always pad to combat plain-text attacks, padding in theory shouldn't compress well so there's no point making the compression less effective by processing it.

* Always compress a 'file' first to reduce entropy.

* Always pad-up a live stream, maybe this data is useful in some other way, but you want interactive messages to be of similar size.

* At some place in the above also include a recipient identifier; this should be counted as part of the overhead not part of the padding.

* The signature should be on everything above here (recipients, pad, compressed message, extra pad).

. It might be useful to include the recipients in the un-encrypted portion of
the message, but there are also contexts where someone might choose otherwise;
an interactive flow would assume both parties knew a key to communicate with
each other on and is one such case.

* The pad, message, extra-pad, and signature /must/ be encrypted. The recipients /may/ be encrypted.

I did have to look up the sign / encrypt first question as I didn't have
reason to think about it before. In general I've looked to experts in this
field for existing solutions, such as OpenPGP (GnuPG being the main
implementation). Getting this stuff right is DIFFICULT.

------
Animats
This is why military voice encryption sends at a constant bitrate even when
you're not talking. For serious security applications where fixed links are
used, data is transmitted at a constant rate 24/7, even if the link is mostly
idle.

------
dietrichepp
Wow, what a trainwreck. So many comments in here talking about whether it
would be possible to compress data which looks like uniformly random data, for
all the tests you would throw at it. Spoiler alert, you can't compress
encrypted data. This isn't a question of whether we know it's possible,
rather, it's a fact that we know it's impossible.

In fact, if you successfully compress data after encryption, then the only
logical conclusion is that you've found a flaw in the encryption algorithm.

------
kinofcain
Also interesting is _which_ compression algorithm you're using. HPACK Header
compression in HTTP 2.0 is an attempt to mitigate this problem:

[https://http2.github.io/http2-spec/compression.html#Security](https://http2.github.io/http2-spec/compression.html#Security)

------
js2
The paper cited in this article ( _Phonotactic Reconstruction of Encrypted
VoIP Conversations_ ) really deserves to be highlighted, so I submitted it
separately:

[https://news.ycombinator.com/item?id=11995298](https://news.ycombinator.com/item?id=11995298)

[http://www.cs.unc.edu/~fabian/papers/foniks-
oak11.pdf](http://www.cs.unc.edu/~fabian/papers/foniks-oak11.pdf)

------
tomp
I don't understand... Why couldn't you do CRIME with no compression as well?
Assuming you can control (parts of) the plaintext, surely plaintext+encrypt
gives you more information than plaintext+compress+encrypt?

~~~
ontoillogical
Crime relies on compression --- the "CR" stands for "Compression Ratio"

The idea is that the DEFLATE compression algorithm used in the TLS compression
mode CRIME attacks will build and index of repeated strings and compress by
providing keys to that index.

Here's a beautiful demonstration of another similar compression algorithm:
[http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-
equal...](http://jvns.ca/blog/2013/10/24/day-16-gzip-plus-poetry-equals-
awesome/)

So, if you control some subset of the plaintext you can make guesses as to
what the secret you're trying to get at is, and if the size changes after
compression you know that you got two hits to that bucket in the index, so
your guess is right. You can use this technique to guess some string character
by character -- reducing your seach space to n*m instead of n^m for a string
of length m with character set of length n.

~~~
vog
Excellent explaination, thanks!

------
arknave
I picked up on the reference to Stockfighter, but does anyone know if the
walking machine learning game mentioned at the end of the article exists?
Sounds like a fun game.

~~~
ontoillogical
Heh, I was referring to OpenAI Gym
([https://gym.openai.com/](https://gym.openai.com/)), specifically
[https://gym.openai.com/envs#mujoco](https://gym.openai.com/envs#mujoco)

------
jakozaur
Would adding some tiny random size help? Based on my poorly understanding, if
after compress, but before encrypt we add random 0 to 16 bytes or 1% of size
that could defeat quite a lot of attacks (like CRIME).

~~~
phlo
That'll make an attack significantly more time-consuming, but won't prevent
it. Instead of instand feedback whether they guessed correctly, an attacker
would instead need to send a bunch of requests to determine if the average
request size has decreased.

~~~
dllthomas
What if the number of padding bytes is a function of the contents?

~~~
lmm
That would add noise to the information the attacker gets, but ultimately the
"output" length has to be correlated with the input length, so you can't ever
get the amount of information the attacker gets down to zero which is what's
needed to make a cipher secure.

~~~
dllthomas
_' ultimately the "output" length has to be correlated with the input length'_

But perhaps not relative to the size of the secret data, right?

(Note that I'm not saying, "Oh, obviously we should just do this to avoid that
attack" \- I'm hoping to learn by carefully understanding where it breaks
down).

~~~
phlo
CRIME works because the compression/encryption algorithms aren't aware of
which data is secret and which isn't. They treat the whole HTTP stream as a
stream of text and operate on that. To them, text within the page is just the
same as cookie data in the corresponding header field or (for example) the URI
in the header.

If such a distinction were possible, the system could just not compress secret
data. But if that were the case, nonsecret data wouldn't even need to be
encrypted; it's nonsecret after all.

------
IncRnd
Despite the question being flawed. The correct answer is a series of
questions: Who is the attacker? What are you guarding? What assumptions are
there about the operating environment? What invariants (regulations,
compliance, etc) exist?

There may be compensating controls that invalidate the perceived needs for
encryption or compression, for example. i.e. don't design in the dark.

Of course, the interviewer may just want a canned scripted answer - but the
interview is your chance to shine, showing how you can discuss all the angles.

------
spatulon
That was a fun read. Do I detect a nod to tptacek's "If You’re Typing the
Letters A-E-S Into Your Code You’re Doing It Wrong"?

[https://www.nccgroup.trust/us/about-us/newsroom-and-
events/b...](https://www.nccgroup.trust/us/about-us/newsroom-and-
events/blog/2009/july/if-youre-typing-the-letters-a-e-s-into-your-code-youre-
doing-it-wrong/)

------
biokoda
If you're compressing audio, the simple solution is to compress using constant
bitrate.

~~~
VLM
Unfortunately one way to define variable bit rate is its compressed CBR.

Rather than defining at the protocol level "insert comfort noise here" at the
compression level you get bitstream level "I donno what this stream is at a
higher level, but replace the next 1000 bits with zeros".

That's if you do simple sampling. I donno about weird higher level vocoders. I
think you could create a constant bit rate vocoder that really is constant.
But it'll likely be uncompressible if its a good one because a vocoder
basically is a compressor that's specifically very smart about human speech
input. If your vocoder output is compressible its not a good one.

I think if you replaced your compression step with run it thru a constant rate
vocoder you'd get what you're looking for. Probably.

~~~
dcposch
No, he's saying you compress CBR, then encrypt. Not compress CBR -> gzip ->
encrypt or something silly like that.

CBR audio codec, then encrypt gives you a constant-bitrate stream
indistinguishable from randomness. That's pretty much the gold standard.

(Of course, that still only encrypts content, not metadata. You can encrypt a
phone call in such a way that a watcher gets mathematically zero information
about what's being said, but the watcher still sees who is calling whom, when,
and for how long. Hiding that is much harder.)

------
jayd16
Would be great if Apple understood this and compressed IPA contents before
encrypting.

Instead, when you submit something to the AppStore, you end up with a much
bigger app than the one you uploaded.

To add insult to injury, if you ask Apple about this fuck up you get an
esoteric support email about removing "contiguous zeros." As in, "make your
app less compressible so it won't be obvious we're doing this wrong."

------
poelzi
if your compression can compress your encrypted data, you should change your
encryption mechanism to something that actually works...

------
em3rgent0rdr
What if you compress and then only send data at regular periods and regular
packet sizes? That way no information can be gleaned. E.g. after compressing
you pad the data if it is unusually short, or you include other compressed
data too, or you only use constant bit-rate compression algorithm.

------
hueving
That quoted voip paper isn't actually as damaging as it sounds. IIRC that 0.6
rating was for less than half of the words so if you're trying to listen to a
conversation to get something meaningful, it's probably not going to happen.

~~~
ontoillogical
That's a good point, and the .6 is an optimistic score (I think it was from
running a subset of their model)

I got the sense that the authors felt that this proves an attack of this sort
was possible and viable, but that their model wasn't quite there yet.

OTOH this should be enough to acknowledge that the voice
encryption/compression scheme they are attacking is not secure.

------
panic
Has there been any research into compression that's generally safe to use
before encryption? E.g., matching only common substrings longer than the key
length would (I think?) defeat CRIME at the cost of compression ratio.

~~~
zielmicha
I'm working (finishing paper) on an algorithm compatible with deflate/gzip
that is safe to use before encryption (i.e. it is guaranteed to not leak
random secrets cookies). It's a bit more complex than your suggestion -
matching common substrings longer than key length would still be vulnerable as
substring boundary may still fall inside secret.

~~~
cvwright
Cool! Please post a Show HN with a link to your ePrint when it's done.

------
Qantourisc
Maybe we need encryption that also plays with the length of the message / or
randomly pad our date before encryption ? I am however no expert, so I have no
clue how feasible, or full of holes this method would be .

~~~
cvwright
Yes, this is one option that we proposed back in 2009 after we first pointed
out the VoIP traffic analysis attacks in 07 and 08.

If you want more information, see our paper on traffic morphing from NDSS 2009
[1]. It works pretty well for VoIP; not as great for trying to counter similar
attacks on web browsing traffic.

[1] [http://www.internetsociety.org/doc/traffic-morphing-
efficien...](http://www.internetsociety.org/doc/traffic-morphing-efficient-
defense-against-statistical-traffic-analysis)

------
itsnotvalid
I am always thinking, if the compression scheme is known, you would need some
good noonce to avoid known plaintext (for example, compression format's header
is always the same), and also by CRIME, which is to remover the dictionary of
the compression.

I think it is best to use built-in compression scheme by the compression
program to do the encryption first, as those often take these into account
(and the header is not leaked, since only the content is encrypted).

------
cm2187
Can't you just add some random length data at the end. You are defeating
compression a little bit, but are also making the length non deterministic. I
thought pgp did that.

~~~
Murk
So long as you do not compress the random data you add it will work.
Compressing it will not help the situation at all since the compression ratio
will be constant for the random data. The problem is you have to then
seriously negate the effects of compression.

Consider compressing "silence" in a telephone call.. You can't compress it
well if you also need to have the non silence elements be indistinguishable
from it by adding random noise. You must add enough random noise to cover up
any compression differential, otherwise statistical artefacts still persist.
That amount of random noise will be up to the maximum compression ratio you
can achieve.

------
arielweisberg
So what does this mean if I am using an encrypted SSL connection that is
correctly configured?

Is this kind of problem not already dealt with for me by the secure transport
layer? It would be a shame if the abstraction were leaky. My understanding of
the contract is that whatever bits I supply will be securely transported
within the limits of the configuration I have selected.

If I pick a bad configuration then yes shame on me, but a good configuration
won't care if I compress right?

------
gravypod
Logically speaking, an encrypted file should have a high entropy set of bits
within it. Compressing it would be low return, but higher security since the
input file contained more "random" bits.

Compressing the source material will yield smaller results but will be more
predictable as the file will always contain ZIP headers and other metadata
that would possibly make decryption of your file much easier.

~~~
tankenmate
But the metadata is a form of structured data and as you state is semi known
plaintext; i.e. the structure is known, and to a lesser extent also the data
(some fields have a known or limited range of possibilities).

Aside from the headers the compressed data it self often has structure; take
the Lempel Ziv class of dictionary encoders rely on repeating data to
compress. It is just this fact that it is repeating means that you can guess
with much higher probability than normal what certain bytes will _not_ be
(because longest words that match the dictionary are chosen to tokenised to
maximise compression); i.e. bytes that _don't_ match a word suffix in the
dictionary will restart the search for a new matching word / token pair.

Having said that the plaintext itself is almost never random; but the key
thing does the attacker have any crib that might be used to have a good first
guess that can reduce the amount of work required.

So which is more guessable; the semi-structure, at the byte level, of
compressed data, or the possible semi-structure of the original plain text? If
you are dealing with a protocol that specifies compression (and in particular
which compression method) you may have given away part of the game.

One way is to "bump up" the entropy and add "chaff" to the compressed stream
before encryption; i.e. add some entropy but less than what would be less than
the amount saved through compression. My gut feel is that the efficacy of this
would vary depending on plaintext, compression method, and encryption method.
You also run the risk of side channel analysis via CPU, RAM, power usage etc.

~~~
gravypod
The input 'plaintext' could be binary data or other information that the
attacker cannot predict.

What they can predict if they find some code interfacing with this encrypted
file is the way it has been stored. It's not much of a long shot to say that
if you can identify that it's just a plain zipped file then you're job will be
much easier when it comes to reverse engineering this.

That being said, it's still a huge pain in the ass to work with that stuff. I
mean, the US government hires some of the worlds best crypto people in the
world and they still are sent for a run some times.

------
jtolmar
If I compress each component (ie: attacker-influenced vs secret) separately,
concatenate the results (with message lengths of course), then encrypt the
whole message, is that secure?

It seems like it should be, but I'm not an encryption expert. The compression
should be pretty good, though.

------
khc
> The paper Phonotactic Reconstruction of Encrypted VoIP Conversations gives a
> technique for reconstructing speach from an encrypted VoIP call.

The technique to reconstructing speech clearly had its limitations.

------
draugadrotten
This blog is an interesting way to advertise to their target market: us.

~~~
BrainInAJar
By eliciting damnfool responses from people that haven't read the article?

~~~
phillmv
That part was… unexpected.

------
gameofdrones
The OP should take
[https://www.coursera.org/learn/crypto](https://www.coursera.org/learn/crypto)

~~~
boriselec
This question was in one of quizzes. Expected answer: compress first.

------
kstenerud
So if the length of the resulting message is leaking information, salt it by
adding some extra random bits to the end to increase the length by a random
amount.

~~~
peterwwillis
Which may be useful, unless you can use a padding oracle attack or timing
attack, or you're using something stupid like ECB mode, or you aren't
authenticating your ciphertext.

In general, it is safe to assume that whatever countermeasure you are thinking
of has already been defeated by an attacker, unless you have researched for a
really long time and found no possible alternative.

------
arjie
None of this seems to apply to documents you generate to supply to someone
else you trust. Compress and encrypt seems perfectly fine.

~~~
rpearl
If you're encrypting it, it is to hide information from some sort of attacker,
not the trusted recipient of the document. If there is literally no
possibility at all of someone else intercepting the document, then why are you
bothering to encrypt in the first place?

~~~
arjie
Because the wire is unsafe.

~~~
rpearl
In which case the considerations in the featured article apply entirely to any
attacker with access to the wire/transit.

Your original comment makes no sense.

~~~
arjie
No. The considerations in the article apply to certain kinds of input under
certain conditions. Those caveats mean it doesn't apply to self-generated text
documents encrypted all at once, in general.

------
FuturePromise
Given the real risk of CRIME attacks, are there "compression aware" encryption
algorithms?

~~~
zielmicha
I don't think it is possible - it's hard to imagine an encryption algorithm
that doesn't preserve content length.

------
justinzollars
tl;dr

------
vox_mollis
A lot of comments here suggesting that encryption increases entropy. While
true, it only adds the key's entropy to the plaintext's entropy. In most real-
world cases, len(m) >> len(k), so this is usually an insignificant increase of
entropy. Compression _also_ adds a trivial amount of entropy (specifically,
the information encoding the algorithm used to compress, even if that
information is out of band).

~~~
srgb
I believe you confuse entropy with Kolmogorov complexity

------
usloth_wandows
I thought this was common sense. Compress then encrypt. Encryption leads to
higher entropy, therefore less effective compression.

~~~
danielweber
The article is about why that can be wrong.

~~~
mozumder
But in almost all cases it's right.

~~~
danielweber
"Don't compress at all" is the better default answer.

Noting that "compression after encryption is stupid."

~~~
mozumder
The correct default answer is encrypt after compress.

"Don't compress at all" doesn't help you if you need to reduce bandwidth.

What good is a secure channel if no one uses it because of its high bandwidth
requirements?

~~~
danielweber
A lot, possibly a majority, of the major breaks in crypto systems (certainly
the interesting ones) in the past decade have been because of compressing
before encrypting. If someone wants to compress first, demand that they
justify the reduced bandwidth usage.

