Hacker News new | comments | show | ask | jobs | submit login
SHA-1 collider: Make your own colliding PDFs (alf.nu)
299 points by ascorbic 119 days ago | hide | past | web | 98 comments | favorite



I think the coolest proof was that the Bitcoin SHA1 collision bounty was claimed:

https://bitcointalk.org/index.php?topic=293382.0

The OP_2DUP OP_EQUAL OP_NOT OP_VERIFY OP_SHA1 OP_SWAP OP_SHA1 OP_EQUAL script was making sure that only a person who finds 2 SHA1 colliders and publishes it can get the 2.5 BTC bounty.



This was a good explanation of what's happening here from a previous thread: https://news.ycombinator.com/item?id=13715761

The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision.

Here's an diff of the 2 PDFs from when I tried it earlier: https://imgur.com/a/8O58Q

Not to say that there isn't still something exploitable here, but I don't think it means that you can just create collisions from arbitrary PDFs.

edit: Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs from this tool. The first ~550 bytes are identical.

https://imgur.com/a/vVrrQ


I didn't get a chance to make this point in that other thread, because the thread [1] of its follow-ups quickly morphed from promising [2] to meandering, but the combination of lax formats (PDF and JPEG in this instance) makes this style of collision particularly reductive, and in a sense, a cheapshot, if still devastatingly damaging given both PDF's and JPEG's ubiquity -- both separately and together -- in document storage and archival.

This shows the importance of techniques like canonicalization and determinism, which ensure that given a particular knowledge set, that result could only have been arrived at given exactly one input. For general-purpose programming languages like PostScript, of which PDF is a subset, this is essentially an unfulfillable requirement, as any number of input "source code" can produce observationally "same" results. Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.

[1] https://news.ycombinator.com/item?id=13715761 [2] https://news.ycombinator.com/item?id=13718772


> Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.

That might be very challenging in practice, because a more expressive language directly allows a more compressed/efficient encoding of the same information, but at the cost of being more difficult (or impossible) to create a canonical representation.

Also, data formats that are purposefully redundant for error tolerance all basically have the property that readers should be tolerant of non-canonical forms. If we want to redundantly represent some bytes redundantly in case of data loss, then there must be multiple representations of those bytes that are all acceptable for the reader for this to work.

Video and image formats use multiple encodings to give encoders the room to make time-space trade-offs.


I agree, for anything that a human is supposed to see with the eyes, there are always different representations that look the "same" enough to be indistinguishable.

People should be aware of it, not believe in a non-existing world where it isn't so.


> PDF and JPEG

Add ELF [1] and Zip [2] to the list. Many common file formats have areas where you can insert an arbitrary chunk of data without significant side effects.

[1] ELF allows for a very flexible layout, and is almost certainly vulnerable to this length-extension-based attack.

[2] Zip allows a comment at the end of the central directory. Since the central directory is at the end of the file, I don't know if it's vulnerable to this exact attack.


How about PE format? You can basically write after the ImageBase+ImageLength and have no functional difference.


As the paper [http://shattered.io/static/shattered.pdf] says:

> This is an identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S

The near-collision block pairs is the difference everyone can see in the image. Whoever created the PDFs did everyone the courtesy of keeping the suffix the same. There's numerous examples already of different PDFs with the same hash.


Shameless plug: I built my own version of this to collide arbitrary PDFs: https://github.com/nneonneo/sha1collider

My version is similar to this, but removes the 64kb JPEG limit and allows for colliding multi-page PDFs.


Excellent! This ought to be the top comment; have you submitted it separately? I was considering implemented the same thing, since there wasn't any reason for those limitations. The Google researchers purposely designed this collision to be highly reusable, probably to encourage everyone to generate lots of fun examples which will be spread widely and break systems everywhere :)

I'm surprised that some PDF renderers have problems with JPEG streams that contain comments and restarts. Actually, glancing at the JPEG spec, I didn't even realise that restarts would be needed, I thought this was just done with comments. Is this really "bending" the spec, or is GhostScript buggy, or is the problem that it's assuming that restarts don't occur inside comments without escaping?


I did submit it, but I don't think anyone saw it: https://news.ycombinator.com/item?id=13729920. I was trying to finish it in time to catch daytime folks but work got in the way :)

Comments are limited to 65536 bytes, and the JPEG spec doesn't offer any way to break an image stream up except for progressive mode or restart intervals (otherwise, the image stream must be a single intact entity). The trouble is that it's probably not _technically_ legal to stick comments before restart interval markers (putting comments _after_ the markers just breaks all the readers since presumably they are expecting image data). So, GhostScript's JPEG reader gets confused when it fails to see a restart marker immediately following an image data chunk, and aborts decoding.


Wow, it works! I thought this was supposed to require Google-level resources and months on processing time. Did the initial collision somehow enable more similar ones?


As Deregibus explains elsewhere, this is almost certainly using the same hash collision.

https://news.ycombinator.com/item?id=13728116


Yeah, I think that's pretty much the case. The first 320 bytes of the two PDFs released by Google result in the same SHA-1 state. Once you're at that point as long as you append identical data to each of the files you're going to get identical hashes. This is just taking those same 320 bytes and appending the combined images of your choice.

edit: as versteegen points out it's 320 bytes, not 304.


I had a whole discussion in another thread about this:

https://news.ycombinator.com/item?id=13716581

I learned a lot from it. One thing is that this property is true of any Merkle-Damgård-type hash if the hash internal state is the same size as the hash digest. This is true of SHA-1 and of several other famous and widely-used hashes, but not true of every hash, including some of the most recent designs like several SHA-3 candidates and SHA-3 itself. In a hash without this property, you can have a collision condition H(X)=H(Y) (and len(X)=len(Y)) yet typically H(X+a)≠H(Y+a).

Edit: len(X)=len(Y) is also necessary because Merkle-Damgård hashes encode the message length into internal padding, so if you happened to have two colliding inputs that were different lengths, they will generally not produce a collision when the same string is added to each.


This is really good to be aware of, even if there were no collisions. I could imagine someone making for example a signed cookie scheme that is value,SHA1(secret,value). Someone could then change it to value+foo,SHA1(secret,value+foo) without knowing the secret, and it would verify as a valid signed cookie.


Yes, that's called a length extension attack: https://en.wikipedia.org/wiki/Length_extension_attack

It's why you don't use a bare hash as authentication, but instead use a HMAC.


People sometimes overstate the impact of length extension attacks. If your format has a length prefix (really common) then you may well be "vulnerable" in the sense that appending arbitrary data is "valid", but a canonical form without the appended data is trivial to construct; and indeed most software would likely completely ignore that extra data.

HMAC is a neat trick to avoid length extension attacks (and other issues) in a generalized fashion, but that doesn't mean those risks actually apply in practice. (Some googling finds e.g. this paper: https://www.iacr.org/archive/fse2009/56650374/56650374.pdf which proposes an attack on length-and-key prefixed messages, using some sha1 weaknesses and merely over 2^84 memory and 2^154 queries - color me impressed, but not scared). Edit: just to be clear, I'm not suggesting anyone actally use LPMAC-sha1 given the current state of sha1.

For another example; in general it's unsafe to truncate a "secure" hash - hashes that satisfy most security requirements can be constructed that are not safe when truncated (e.g. sha3 prepended by zeros is still safe, but obviously not if truncate the sha3-provided bits off). But I don't know of any mainstream hash where this theoretical risk actually applies (e.g. no merkle-damgard hash suffers from such a risk); nobody constructs hashes intentionally with more bits than entropy.

It's probably still wise to stick with known-good constructions, but the risks seem overstated, and the difficulty is also overstated - assuming the primitives used aren't too flawed. Sure, it's cool that HMAC can use even flawed things like MD5 and retain safety, but typically nobody is forcing you to stick with md5. I guess the world is more complicated if you need to pick a protocol and then you're unable to change it, but most applications can (with some effort) be changed. You need something safe now, not for all eternity.

So, I think the rule is simpler: this has little to do with crypto per se; just don't be unnecessarily clever, in general. Crypto makes the consequences particularly nasty, often. But that's about it.


Indeed, it comes back to the usual rule of not rolling your own crypto.


This was good meme that served its function well when it was needed - early enthusiasm for reusable cryptographic primitives and a failure to recognise the foot-shooting potential lead to many easily broken schemes.

Now, however, "don't roll your own crypto" is dogma, and if anything we have the opposite problem of monoculture and slow progress. I think a more nuanced view is required, one that encourages experimentation when the stakes are low and more competing implementations when the stakes are high (or perhaps we should call them "complementing" - a standard ought to have multiple implementations).

As Wikipedia puts it, "Mathematical analysis of [security] protocols is, at the time of this writing, not mature... Protocol design is an art requiring deep knowledge and much practice; even then mistakes are common." How are programmers to practice, if they are not allowed to fail?


And SHA(-2)-512/X (The truncation to X forms) are also good for avoiding length extension


Yep, I have a long list in the other thread (transcribed from Wikipedia). It's nice to finally understand why the truncation forms exist!


320 bytes.


That's not quite what he's saying. He's saying it's using the same research but you can't reuse a hash collision like that since the contents of the two images you give it are unknown before hand.


You can reuse a collision by appending identical data to both pieces of colliding data and the result still collides. The collision google found exists in the PDF before the JPEGs, and both JPEGs are appended after the collision -- the difference in the earlier data is where the differing instructions exist that say either to display the first or the second JPEG.


I see, i had misunderstood how they had done their attack. Makes a lot of sense.


You can reuse the PDF from Google because it's not particularly reliant on the contents of the images. You can replace the images that Google used with your own (indeed that's what the service does) and it will output two files that themselves have the same hash (but that hash will be different than the original provided PDFs)


Interesting. I need to read more into this because I didn't think that kind of thing was possible. I thought you'd have to recalculate it from scratch because of the two images being different but apparently I didn't understand how this worked.


(very)psuedocode:

  goto showImage1;
  showImage1:
    renderImage1();
    exit
  showImage2:
    renderImage2();
    exit
If an attacker can hash a block that means "goto showImage1" and "goto showImage2" with the same hash, then you can see that the contents of those images doesn't matter, as long as the data for those images occurs after the logic for choosing them.

(and no that's not what this attack does in reality, but just a logical framework for understanding why the images don't matter)


Both images are in both files. The only difference is the part of the file (where Google put the collision) which switches the image shown.


yea, this.. wow.

Is this the first hash function which went from "secure" to collision-as-a-service in a matter of days? Was sha1 particularly weak, or the published research particularly strong? or maybe something else?


It was known for a while to move away from SHA-1. Any TLS certificates using SHA-1 were supposed to become invalid on January 1st.

But I wasn't expecting that google's 110 GPU-year work meant that we could create colliding PDFs on demand.


Absolutely, but knowing something smells a little off vs having a documented collision feels like the point where we we go from "No longer advised for use" to "It's broken".

Usually its when we hit "It's broken" that average Joe developer/operator/etc starts to form their migration plan, and usually they have some time before the attack becomes reasonable to execute outside of pure research / "nation state" attacks.. It seems that block of time was just a few days (a day?!) for SHA1.

And it's that, even if we're talking about an easy to abuse format like PDFs apparently are, that makes me question what just happened? Was sha1 secretly terrible? was the research just that good? or was there some other factor that allowed this to happen?

Also, sure, the major browsers have a date for SHA1 as a TLS signature hash retirement.. But SHA1 is used for a whole pile of other stuff with absolutely no transition plan! January 1st was absolutely never going to be the last day people used SHA1.


Look into the details. This isn't what you think it is.


As other people explained, and just to summarize, this service is a way of reusing Google's specific collision (as a prefix to new collisions), and isn't a way of making arbitrary colliding files or file formats. You can't, for example, use this to make something other than PDFs that collide; for that, you'll have to redo a computation on the scale of Google's!


I thought I heard that some file formats have "headers" that go at the end of the file. I think a demo of this was a file that was both a zip and a PNG or something. If I remembered right, you should be able to make a similar hack here.


If the only headers are at the end, then yes, that's a really neat idea. I'm not sure of the constraints for zip files. Maybe it would be interesting to brainstorm with some people with a lot of file format knowledge to find additional such formats. But if the formats have any constraints at all on the first few hundred bytes, those generally won't be satisfied by the prefix here.


Why wouldn't you be able to? Most file formats have ways of including comment data, including JPEG, GIF, HTML, C code, and most others. You could potentially create a piece of noble C code and a piece of malicious C code that collide. (However, creating a piece of malicious code that matches an existing codebase that doesn't contain a comment would be hard, if I understand correctly.)


This reuses the prefix that Google calculated, which is a PDF header. Therefore you can only generate PDFs with this technique.

If you want to dedicate a few hundred GPU-years to it, you could generate similar colliding prefixes for other formats by doing what Google did.


Ah, I see, thanks.


This works the same way with any Merkle-Damgard-constructed hash function; basically it's just a length extension attack.


How do you length extension when both outputs are the same length?


It's a length extension on the beginning part of the PDF. There are 2 headers that have the same hash. As long as you append the same suffix to both of those headers, the hash will be the same. In this case the headers happen to contain a switch that select between one of two images. So the length extension is adding both images to both of the headers. Since the headers have the same hash and the suffix has the same hash, the overall document has the same hash. But because of the switch in the header you see two completely different contents.

So basically H1 and H2 have the same SHA1 hash. By adding suffix I1I2 to both you get H1I1I2 and H2I1I2. That's the length extension.


If you got two different messages to get into the same internal state with SHA-3, the same thing would apply, wouldn't it? Would that be a length extension attack on SHA-3?

This is really a terminology question. I had a clear understanding of "length extension attacks" but it seems on this comment page people are using something else now. I've been looking over crypto.stackexchance and twitter to see if I missed the memo but this looks like a new usage.


It's a different usage of the term than the normal case, but it takes advantage of the same vulnerability to length extension. It doesn't necessarily apply to SHA-3 and other sponge functions because there is a difference there between the internal state of the hash function and the actual hash itself. The internal state is much larger than the hash output for a sponge function like SHA-3, which means you can get two messages that have the same hash without having the same internal state, and therefore appending a suffix would mostly likely change that internal state enough to no longer have a hash collision.


> It's a different usage of the term than the normal case

Is it? How? It's a simple case of length extension, just that here, since we have two independent starting points sharing the same state, we start with a collision and we extend to a collision.

In other words, these are two length extensions on independent prefixes. It just happens that these prefixes share the same state / hash, hence the surprising result (on a first glance).


Normally when talking about length extension attacks the original plaintext is unknown, but we can compute the hash of the plaintext plus an extension if we know the hash of the plaintext. In this case we know what the plaintext is, and we happen to have two different texts that produce the same hash, which we can extend to generate many collisions. It's the same property but it's a different scenario than what is commonly referred to as length extension.


Both SHA-3 and BLAKE2 are not susceptible to length extension. Keyed BLAKE2 is actually just a prefix MAC (which would be completely unsecure with SHA-1/2 / classic Merkle-Damgard).


SHA1 has been deprecated for a while. It certainly wasn't considered "secure" on the day the attack was announced. For example CAs have been moving away from SHA1 signatures for a few years.


Day, not days.


Faster than the zip quine.


I just constructed a little POC for bittorrent: https://github.com/michaf/shattered-torrent-poc

Installer.py and Installer-evil.py are both valid seed data for Installer.torrent ...


Nice, you're able to line it up so the colliding data starts right at the beginning of one of the torrent blocks.


According to [the Shattered paper](http://shattered.io/static/shattered.pdf), the reason why the proof-of-concepts are PDFs is because we are looking at a

> identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S

They have already precomputed the prefix (the PDF header) and the blocks (which I'm guessing is the part that tells the PDF reader to show one image or the other), and all you have to do is to populate the rest of the suffix with identical data (both images)


Yep. With any hash function that takes input as blocks, if you were to ever get two messages to generate the same EDIT internal state, you could add the same arbitrary data to both and get those new messages to have the same hash.


Is this length extending [1] the already existing Google attack?

[1] https://en.wikipedia.org/wiki/Length_extension_attack

Edit: yes, looks like it is.

As sp332 and JoachimSchipper mentioned, the novelty here is that it contains specially crafted code in order to conditionally display either picture based on previous data (the diff). I can't grok PDF so I still can't find the condition though. Can PDFs reference byte offsets? This is really clever.

Edit #2: I misunderstood the original Google attack. This is just an extension of it.


Yes, it's a length extension. Both images are in both outputs. Both outputs also contain a conditional switch to choose which image to show based on the previous data, where the collision lives.

Edit to your edit: This is more of a JPEG hack than a PDF hack. https://news.ycombinator.com/item?id=13715761


Yes; you extend with a carefully-chosen "if" to generate distinct-looking files.


It seems so. I can add the same arbitrary data at the end of two pdfs generated by this tool, and they are still a collision. I didn't know SHA-1 is so susceptible to length extension. Is there no internal state in the algorithm that would be different even if the hash output is identical?


If you were to somehow get two messages with the same SHA-3 hash, you could keep on appending the same data to both and they would keep the same SHA-3. But SHA-3 is explicitly not vulnerable to length extension attacks.


No they wouldn't, since its internal state is different than the output.

Same goes for SHA-224 and SHA-384.


Damn, right, you have to get them with the same internal state.


No. The messages differ in their internals.

The length extension attack leverages the weakness that people think HASH(secret + message) is a signature only they can create as long as only they know "secret".


I just tested Backblaze and found that its deduplication is broken by this. If you let it backup the two files generated by this tool, then restore them, it gives you back two identical files instead of two different ones.


I have never been particularly comfortable with the idea that you can simply assume that if the hashes are the same, then the data must be the same.

The fact that collisions happen so rarely (and sometimes only after the a hash function has become compromised) makes this even more dangerous.

It's like a couple of decades of strong hash functions has made people forget what hashing actually is.


I don't understand the whole collision thing. I mean a sha1 is 160bits so if you are hashing information longer then that collision is a fact, being able to forge a piece of information with constraints is the challenge and even that with enough power you end up being able to try all the combinations. What I understand from that collision reported is that they use PDF format which can have as many data inserted to it as comment /junk as you want so all you need is enough processing power to find the right padding/junk to insert to get the collision. Am I missing something here ?


I would imagine that a lot of old data is secured by SHA1, which may be available for attack.

Does anyone have any idea about a broad risk-assessment of systems worldwide that might be vulnerable as SHA1 becomes easier and easier to beat?


If by "secured by SHA1" you mean "someone generated a hash using SHA-1 and we use the validity of that hash to guarantee we have the right document," that's still okay. We're a long way from being able to make documents with a given SHA-1.

(Edit: Any newly signed documents, or documents signed recently, are not safe, because an nasty person could have made two, one to get signed by the system, another to do evil with.)

SHA-1 is officially deprecated for certificates, because of the example that OP shows. You can create two certificates, have the decent one get signed by a CA, and then use the evil one to intercept traffic.


Thanks for the info. Good point, I suppose anyone relying on SHA1 in 2017 has had ample warning about its weaknesses.

It seems that there is also a very strong incentive for anyone receiving anything whose authenticity is verified by SHA1 to request an improved hashing algorithm.


Practical question: does this generate a "harmful" (harmful to a repo system like SVN) PDF if the flaw in the hashing is enough to crash/corrupt the system?


To answer my own question looks like Gmail has flagged PDFs generated with this specific hash as harmful.


Doesn't work for me. One of PDFs always says "Insufficient data for an image" (sometimes for the same image that worked before).


I found that you have to reload the page if there's an error or it'll stick. In my case it was too big of a file.


Damn that didn't take long to go from $100K to carry out this attack to a single day to offer a website for SHA1 collision as a service...


They aren't equivalent.


Just wait until you get the invoice


Is the invoice in a PDF? What's the SHA-1 hash of it?


This is using that same $100K.


Yup! At this rate we will have AWS SHA1 Collision as a Web Service by tomorrow morning.


What's the smallest file you can make collide? Could you make two files collide that are actually smaller than their hashes?


Question: What does git use to hash blobs? Is it SHA-1?

Would that be a problem? Ramifications are unclear to me.


Ah, note that apparently it is SHA-1, and this question was common enough that Linus himself has addressed it:

https://news.ycombinator.com/item?id=13733481


Just tried it and it really got me the same sha1... damn...


I wanted to try this tool, but the upload requirements were so stringent (must be jpeg, must be same aspect ratio, must be less than 64KB) that I gave up. Would be nice if sample jpegs were provided on the page.


Here's what I got: http://i.imgur.com/d0lzPv9.jpg http://i.imgur.com/NBIKKRX.jpg

The tool gives me two pdfs that both hash to b8d36cccd01e5a4569ba01f9d15b54efedcd5d9f. It works... scary!


If you want to do a simple test, you can just generate some images with ImageMagick:

  $ convert -size 512x512 canvas:red canvas_red.jpg
  $ convert -size 512x512 canvas:blue canvas_blue.jpg


... are you joking?


(not OP) I understand why the requirements are strict so I can see why s/he was being downvoted, but I do agree that samples would be nice. Like, do I have to go search for some images with these requirements just to see if it really works (since it was supposed to take $100k)? By now someone mentioned it works in the comments though, so I guess I'll trust them.


Is it really that hard to take a screenshot of a couple windows on your desktop and run:

$ convert screenshot.png -resize '500x500!' bar.jpg

or

$ sips -s format jpeg -Z 500x500 screenshot.png --out bar.jpg

Generating some workable source images is trivial. You're not interested in making a 'perfect hack pdf' for some nefarious purpose, just seeing if the service does what it says it does.


> Generating some workable source images is trivial

Everything is trivial if you know how to do it off the top of your head. Generating white noise samples is trivial. UV-mapping a texture to a mesh is trivial. Soldering resisters to a PCB is trivial. Generating an ARM toolchain with Yocto Project is trivial.

Is it a crime that I didn't feel like researching a bunch of CLI tools I've never heard of to try to use an app I was only mildly curious about?


This overstates the difficulty of creating a JPEG. If you have an image editor on hand that isn't MS Paint (or you're using macOS, which has Preview), you merely need to export a JPEG and choose a low quality setting.

For a Windows user with no decent image editor or viewer installed, though, it could certainly be a hassle.


None of the things I listed are difficult. They just require the right tool and know how. For someone like me with Xubuntu, I had neither the tools nor know how for creating small jpegs. I didn't fell like wasting 30 minutes researching and I didn't feel like walking over to a different computer that has MSPaint.


Most people wouldn't have such small JPGs lying around on their computer, and even less likely that they have 2 of them with the same aspect ratio. So, it's likely that they'll have to go and create them, which is quite a hassle. I know because I went to the site and then after mucking around in some of my directories, gave up.


You don't need to have such JPGs. You just crop/resize them in Gimp or Paint.


Well, a sample pair of PDFs would be nice, so we didn't have to go looking for or creating the JPGs...


Ya it was annoying - here is my output tho: https://drive.google.com/drive/folders/0BzPYSqARS6D6SUZUYS1p...


I just did an arbitrary google image search specifying the image size to be 512x512, and they worked fine.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: