Hacker News new | past | comments | ask | show | jobs | submit login

The visual description of the colliding files, at http://shattered.io/static/pdf_format.png, is not very helpful in understanding how they produced the PDFs, so I took apart the PDFs and worked it out.

Basically, each PDF contains a single large (421,385-byte) JPG image, followed by a few PDF commands to display the JPG. The collision lives entirely in the JPG data - the PDF format is merely incidental here. Extracting out the two images shows two JPG files with different contents (but different SHA-1 hashes since the necessary prefix is missing). Each PDF consists of a common prefix (which contains the PDF header, JPG stream descriptor and some JPG headers), and a common suffix (containing image data and PDF display commands).

The header of each JPG contains a comment field, aligned such that the 16-bit length value of the field lies in the collision zone. Thus, when the collision is generated, one of the PDFs will have a longer comment field than the other. After that, they concatenate two complete JPG image streams with different image content - File 1 sees the first image stream and File 2 sees the second image stream. This is achieved by using misalignment of the comment fields to cause the first image stream to appear as a comment in File 2 (more specifically, as a sequence of comments, in order to avoid overflowing the 16-bit comment length field). Since JPGs terminate at the end-of-file (FFD9) marker, the second image stream isn't even examined in File 1 (whereas that marker is just inside a comment in File 2).

tl;dr: the two "PDFs" are just wrappers around JPGs, which each contain two independent image streams, switched by way of a variable-length comment field.

Wow, nice. Thanks. It seems everyone is thinking the secret sauce had something to do with embedding appropriate junk to the PDF format itself.

I was cynically thinking that Google might, as an aside, be recruiting with this disclosure; looking for people (like you) who actually solved the "how they did it" puzzle.

So if one were hashing data or a document format that also contained a bit of self-referential integrity data, e.g. the end of the data has a CRC of the rest of the block, or maybe a cryptographic signature using some other hash, wouldn't that further constrain the search space? Then not only would one need to back out the random bits needed to complete the SHA-1 collision, it would also have to satisfy a some other constraint within the other data in the block.

The best thing would be if one could prove a mathmatically incompatible set of counter functions where data colliding in the hash of one function would prevent the other function from validating correctly.

> The best thing would be if one could prove a mathmatically incompatible set of counter functions where data colliding in the hash of one function would prevent the other function from validating correctly.

I'm a rank amateur, so this is completely outside of my wheelhouse, but this sounds suspect. I don't think you can make hash collision impossible, even with multiple functions, unless the combined hashes contain as much data as the originals or the combined functions have restricted inputs. The point of hashes is making collisions hard, not mathematically impossible - the latter is, I believe, itself impossible.

Like I said, I'm totally unqualified to speak on this, so I may well be missing something here.

You are correct: by the pigeonhole principle, if the sum of the lengths of all the hashes/checksums is less than the length of the checksummed data, then collisions exist. (A detail: If checksums are included into the input of other checksums, their length doesn't count towards the amount of checksummed data, because they are not free to vary.)

Wouldn't this imply that all hash functions (other than one-to-one mappings) must have collisions?

Why does the pigeonhole principle hold?

Yes, this is correct. It's a really simple principle, and I think an explanation can help you understand :)

Suppose you have 3 holes, and 4 pigeons, and you stuff the pigeons into holes. There must be 1 hole with at least 2 pigeons, right?

The same is true with hash functions. If you're hashing data down to a fixed length, like say 256-bits with sha-256, and the data is longer than 256 bits, there must be a collision somewhere.

Wouldn't this imply that all hash functions (other than one-to-one mappings) must have collisions?

Yes, they do. Finding them is the hard part.

Yes; by definition using something with X possible values to represent something with Y possible values will always have collisions if X < Y.

Right. But there are properties that can be proven about a given hash function that give us more faith that no collision can be efficiently found: https://en.m.wikipedia.org/wiki/Security_of_cryptographic_ha...

I'm an amateur in this area too, but I'm not suggesting to avoid collisions, I'm suggesting adding a validation function for the hashed data so that if one were to generate an intentional collision, you would still have to contend with generating it in a way that also validated.

For Git, Linus basically says the validation function is a prepended type/length. https://news.ycombinator.com/item?id=13719368

Issue with that is we already addressed that during the MD5/cert collision era; the final cert, as delivered, would by definition contain additional data over the CSR (the signer reference and the start/end dates) but because that information was predictable, the collision could be generated for the expected emitted cert, rather than the input data. Same would apply to git; if you were building the submitted data block, you would know what type and length it was going to be, so could build it with that in mind while colliding.

The composition of the result of the original hash function and the result of that validation function can be taken together to be some larger function. Call that an uberhash. Such an uberhash is created by putting some number of bits in and getting some smaller number of bits out. There will unfortunately still be collisions. That trick is an improvement, and newer hashing algorithms contain similarly useful improvements to make creating collisions difficult.

It's a good thought, but if the functions have bounded-length outputs (as with all major hash functions), then your combined hash value is still a bounded length. A collision must therefore exist because there are many more potential messages than potential hash values.

If your functions produce unbounded-length hashes, then you may as well just be compressing your input to guarantee there are no collisions in your "hash".

True - I suppose the hash function just ends up being the convolution of say an interior CRC and some exterior hash function. That gives us low to zero chance of something mathematically provable. But perhaps some arrangement of functions forces an attacker to scramble the inputs in more drastic ways such that the original data must be altered by more than a handful bytes it could perhaps raise the difficulty on a practical basis.

But to make a strong cryptographic hash, you don't want predictability based on inputs. If you end up with some kind of pattern where getting the same output values requires modifying huge ranges of bytes, that lets attackers reason about inputs that generated a hash.

Why not use SHA-1 recursively, e.g., append SHA1(text), then SHA1(everything)?

Of course, it would probably make more sense just to use one of the recommended replacements in the first place, e.g. SHA-256.

It's instructive to diff the hexdumps of the two files. It's surprisingly small.

The differences are confined to two SHA-1 input blocks of 64 bytes each. The first block introduces a carefully-crafted set of bit differences in the intermediate hash value (IHV), which are then eliminated in the second block. After the IHVs synchronize, all future blocks can be identical because SHA-1 can be trivially length-extended.

It is small indeed.

    $ diff -u <(xxd -g1 shattered-1.pdf) <(xxd -g1 shattered-2.pdf)
    --- /dev/fd/63	2017-02-24 12:25:17.311939732 +0530
    +++ /dev/fd/62	2017-02-24 12:25:17.311939732 +0530
    @@ -10,14 +10,14 @@
     0000090: 72 65 61 6d 0a ff d8 ff fe 00 24 53 48 41 2d 31  ream......$SHA-1
     00000a0: 20 69 73 20 64 65 61 64 21 21 21 21 21 85 2f ec   is dead!!!!!./.
     00000b0: 09 23 39 75 9c 39 b1 a1 c6 3c 4c 97 e1 ff fe 01  .#9u.9...<L.....
    -00000c0: 73 46 dc 91 66 b6 7e 11 8f 02 9a b6 21 b2 56 0f  sF..f.~.....!.V.
    -00000d0: f9 ca 67 cc a8 c7 f8 5b a8 4c 79 03 0c 2b 3d e2  ..g....[.Ly..+=.
    -00000e0: 18 f8 6d b3 a9 09 01 d5 df 45 c1 4f 26 fe df b3  ..m......E.O&...
    -00000f0: dc 38 e9 6a c2 2f e7 bd 72 8f 0e 45 bc e0 46 d2  .8.j./..r..E..F.
    -0000100: 3c 57 0f eb 14 13 98 bb 55 2e f5 a0 a8 2b e3 31  <W......U....+.1
    -0000110: fe a4 80 37 b8 b5 d7 1f 0e 33 2e df 93 ac 35 00  ...7.....3....5.
    -0000120: eb 4d dc 0d ec c1 a8 64 79 0c 78 2c 76 21 56 60  .M.....dy.x,v!V`
    -0000130: dd 30 97 91 d0 6b d0 af 3f 98 cd a4 bc 46 29 b1  .0...k..?....F).
    +00000c0: 7f 46 dc 93 a6 b6 7e 01 3b 02 9a aa 1d b2 56 0b  .F....~.;.....V.
    +00000d0: 45 ca 67 d6 88 c7 f8 4b 8c 4c 79 1f e0 2b 3d f6  E.g....K.Ly..+=.
    +00000e0: 14 f8 6d b1 69 09 01 c5 6b 45 c1 53 0a fe df b7  ..m.i...kE.S....
    +00000f0: 60 38 e9 72 72 2f e7 ad 72 8f 0e 49 04 e0 46 c2  `8.rr/..r..I..F.
    +0000100: 30 57 0f e9 d4 13 98 ab e1 2e f5 bc 94 2b e3 35  0W...........+.5
    +0000110: 42 a4 80 2d 98 b5 d7 0f 2a 33 2e c3 7f ac 35 14  B..-....*3....5.
    +0000120: e7 4d dc 0f 2c c1 a8 74 cd 0c 78 30 5a 21 56 64  .M..,..t..x0Z!Vd
    +0000130: 61 30 97 89 60 6b d0 bf 3f 98 cd a8 04 46 29 a1  a0..`k..?....F).
     0000140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0000150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     0000160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

Great analysis. So why did Google choose to use PDFs instead of JPGs directly?

The notion of "document security" is a common desire in business and legal contexts. PDFs are a standard representation of documents, whereas JPGs are less common (TIFF might count since it's a somewhat common scanned document format, but I digress).

Subverting the most popular document format, one which even provides built-in support for digital signatures is probably much more interesting than subverting the JPG format.

Open challenge: build two signed PDFs with identical signatures and different contents - it's not necessarily hard, and it would definitely drive the point home.

I wonder if this is the 'standard' thing to do when you have a broken hash function. I took a introductory course in computer security a year ago and we had to create MD5 hash-colliding documents for homework[0].

[0] https://courses.engr.illinois.edu/cs461/sp2015/static/proj1.... (see part 4)

I'm interested in taking a look at this assignment for my own learning, but it looks like there are a bunch of files you need. It would be cool if you put them in a git repo!

Because many automated systems/image hosting website (imgur, etc) will process JPEGs to optimize them, which I assume would no longer cause there to be a collision.

On the other hand, PDFs are almost universally sent as-is, increasing the risk of being able to 'fool' someone.

Perhaps to be taken more seriously by management. All things being equal, the content of PDFs (auditor reports) is more important than that of JPGs (cat memes).

Maybe because JPG bytes or more likely to be manipulated when sending around the web?

Thanks! What tools did you use to analyse the JPEG?

I used Hachoir (https://github.com/haypo/hachoir3), a Python library that I've contributed to. Hachoir disassembles files using a library of parsers, with the intent of describing the function of every single bit in the file. You can see the resulting disassemblies (rendered with the hachoir-wx GUI) here: http://imgur.com/a/F1cnV

And here I'm refactoring a PHP application feeling like I've come so far. I love that I always have more to learn.

What an incredible tool... I have a webapp that has as a small component the analysis of JPEG metadata, this will be very useful for future development.

I've just started disassembling a report file format. In python. This is going to be so useful. Thanks!!

Woah, that's a fascinating tool! What other sorts of things could Hachoir be used for?

I use Hachoir for a lot of tasks - digital forensics (examining file tables on a filesystem, discovering slack space taken up by hidden data, etc.); for metadata extraction (for which the hachoir-metadata subproject is designed); for file carving (either with hachoir-subfile or manually); and just to geek out about file formats (did you know .doc/.ppt is secretly a FAT filesystem?).

Funny, I had no idea it could do all that. I've been using it in production for years to extract width/height/duration of media files at compile time so my loading placeholders are perfectly sized at runtime in a basic web app.

Sweet tool recommendation! Bookmarking that :-) Thx

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact