The OP_2DUP OP_EQUAL OP_NOT OP_VERIFY OP_SHA1 OP_SWAP OP_SHA1 OP_EQUAL script was making sure that only a person who finds 2 SHA1 colliders and publishes it can get the 2.5 BTC bounty.
The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision.
Here's an diff of the 2 PDFs from when I tried it earlier: https://imgur.com/a/8O58Q
Not to say that there isn't still something exploitable here, but I don't think it means that you can just create collisions from arbitrary PDFs.
Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs from this tool. The first ~550 bytes are identical.
This shows the importance of techniques like canonicalization and determinism, which ensure that given a particular knowledge set, that result could only have been arrived at given exactly one input. For general-purpose programming languages like PostScript, of which PDF is a subset, this is essentially an unfulfillable requirement, as any number of input "source code" can produce observationally "same" results. Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.
That might be very challenging in practice, because a more expressive language directly allows a more compressed/efficient encoding of the same information, but at the cost of being more difficult (or impossible) to create a canonical representation.
Also, data formats that are purposefully redundant for error tolerance all basically have the property that readers should be tolerant of non-canonical forms. If we want to redundantly represent some bytes redundantly in case of data loss, then there must be multiple representations of those bytes that are all acceptable for the reader for this to work.
Video and image formats use multiple encodings to give encoders the room to make time-space trade-offs.
People should be aware of it, not believe in a non-existing world where it isn't so.
Add ELF  and Zip  to the list. Many common file formats have areas where you can insert an arbitrary chunk of data without significant side effects.
 ELF allows for a very flexible layout, and is almost certainly vulnerable to this length-extension-based attack.
 Zip allows a comment at the end of the central directory. Since the central directory is at the end of the file, I don't know if it's vulnerable to this exact attack.
> This is an identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S
The near-collision block pairs is the difference everyone can see in the image. Whoever created the PDFs did everyone the courtesy of keeping the suffix the same. There's numerous examples already of different PDFs with the same hash.
My version is similar to this, but removes the 64kb JPEG limit and allows for colliding multi-page PDFs.
I'm surprised that some PDF renderers have problems with JPEG streams that contain comments and restarts. Actually, glancing at the JPEG spec, I didn't even realise that restarts would be needed, I thought this was just done with comments. Is this really "bending" the spec, or is GhostScript buggy, or is the problem that it's assuming that restarts don't occur inside comments without escaping?
Comments are limited to 65536 bytes, and the JPEG spec doesn't offer any way to break an image stream up except for progressive mode or restart intervals (otherwise, the image stream must be a single intact entity). The trouble is that it's probably not _technically_ legal to stick comments before restart interval markers (putting comments _after_ the markers just breaks all the readers since presumably they are expecting image data). So, GhostScript's JPEG reader gets confused when it fails to see a restart marker immediately following an image data chunk, and aborts decoding.
edit: as versteegen points out it's 320 bytes, not 304.
I learned a lot from it. One thing is that this property is true of any Merkle-Damgård-type hash if the hash internal state is the same size as the hash digest. This is true of SHA-1 and of several other famous and widely-used hashes, but not true of every hash, including some of the most recent designs like several SHA-3 candidates and SHA-3 itself. In a hash without this property, you can have a collision condition H(X)=H(Y) (and len(X)=len(Y)) yet typically H(X+a)≠H(Y+a).
Edit: len(X)=len(Y) is also necessary because Merkle-Damgård hashes encode the message length into internal padding, so if you happened to have two colliding inputs that were different lengths, they will generally not produce a collision when the same string is added to each.
It's why you don't use a bare hash as authentication, but instead use a HMAC.
HMAC is a neat trick to avoid length extension attacks (and other issues) in a generalized fashion, but that doesn't mean those risks actually apply in practice. (Some googling finds e.g. this paper: https://www.iacr.org/archive/fse2009/56650374/56650374.pdf which proposes an attack on length-and-key prefixed messages, using some sha1 weaknesses and merely over 2^84 memory and 2^154 queries - color me impressed, but not scared). Edit: just to be clear, I'm not suggesting anyone actally use LPMAC-sha1 given the current state of sha1.
For another example; in general it's unsafe to truncate a "secure" hash - hashes that satisfy most security requirements can be constructed that are not safe when truncated (e.g. sha3 prepended by zeros is still safe, but obviously not if truncate the sha3-provided bits off). But I don't know of any mainstream hash where this theoretical risk actually applies (e.g. no merkle-damgard hash suffers from such a risk); nobody constructs hashes intentionally with more bits than entropy.
It's probably still wise to stick with known-good constructions, but the risks seem overstated, and the difficulty is also overstated - assuming the primitives used aren't too flawed. Sure, it's cool that HMAC can use even flawed things like MD5 and retain safety, but typically nobody is forcing you to stick with md5. I guess the world is more complicated if you need to pick a protocol and then you're unable to change it, but most applications can (with some effort) be changed. You need something safe now, not for all eternity.
So, I think the rule is simpler: this has little to do with crypto per se; just don't be unnecessarily clever, in general. Crypto makes the consequences particularly nasty, often. But that's about it.
Now, however, "don't roll your own crypto" is dogma, and if anything we have the opposite problem of monoculture and slow progress. I think a more nuanced view is required, one that encourages experimentation when the stakes are low and more competing implementations when the stakes are high (or perhaps we should call them "complementing" - a standard ought to have multiple implementations).
As Wikipedia puts it, "Mathematical analysis of [security] protocols is, at the time of this writing, not mature... Protocol design is an art requiring deep knowledge and much practice; even then mistakes are common." How are programmers to practice, if they are not allowed to fail?
(and no that's not what this attack does in reality, but just a logical framework for understanding why the images don't matter)
Is this the first hash function which went from "secure" to collision-as-a-service in a matter of days? Was sha1 particularly weak, or the published research particularly strong? or maybe something else?
But I wasn't expecting that google's 110 GPU-year work meant that we could create colliding PDFs on demand.
Usually its when we hit "It's broken" that average Joe developer/operator/etc starts to form their migration plan, and usually they have some time before the attack becomes reasonable to execute outside of pure research / "nation state" attacks.. It seems that block of time was just a few days (a day?!) for SHA1.
And it's that, even if we're talking about an easy to abuse format like PDFs apparently are, that makes me question what just happened? Was sha1 secretly terrible? was the research just that good? or was there some other factor that allowed this to happen?
Also, sure, the major browsers have a date for SHA1 as a TLS signature hash retirement.. But SHA1 is used for a whole pile of other stuff with absolutely no transition plan! January 1st was absolutely never going to be the last day people used SHA1.
If you want to dedicate a few hundred GPU-years to it, you could generate similar colliding prefixes for other formats by doing what Google did.
So basically H1 and H2 have the same SHA1 hash. By adding suffix I1I2 to both you get H1I1I2 and H2I1I2. That's the length extension.
This is really a terminology question. I had a clear understanding of "length extension attacks" but it seems on this comment page people are using something else now. I've been looking over crypto.stackexchance and twitter to see if I missed the memo but this looks like a new usage.
Is it? How? It's a simple case of length extension, just that here, since we have two independent starting points sharing the same state, we start with a collision and we extend to a collision.
In other words, these are two length extensions on independent prefixes. It just happens that these prefixes share the same state / hash, hence the surprising result (on a first glance).
Installer.py and Installer-evil.py are both valid seed data for Installer.torrent ...
> identical-prefix collision attack, where a given prefix P is extended with
two distinct near-collision block pairs such that they collide for any suffix S
They have already precomputed the prefix (the PDF header) and the blocks (which I'm guessing is the part that tells the PDF reader to show one image or the other), and all you have to do is to populate the rest of the suffix with identical data (both images)
Edit: yes, looks like it is.
As sp332 and JoachimSchipper mentioned, the novelty here is that it contains specially crafted code in order to conditionally display either picture based on previous data (the diff). I can't grok PDF so I still can't find the condition though. Can PDFs reference byte offsets? This is really clever.
Edit #2: I misunderstood the original Google attack. This is just an extension of it.
Edit to your edit: This is more of a JPEG hack than a PDF hack. https://news.ycombinator.com/item?id=13715761
Same goes for SHA-224 and SHA-384.
The length extension attack leverages the weakness that people think HASH(secret + message) is a signature only they can create as long as only they know "secret".
The fact that collisions happen so rarely (and sometimes only after the a hash function has become compromised) makes this even more dangerous.
It's like a couple of decades of strong hash functions has made people forget what hashing actually is.
Does anyone have any idea about a broad risk-assessment of systems worldwide that might be vulnerable as SHA1 becomes easier and easier to beat?
(Edit: Any newly signed documents, or documents signed recently, are not safe, because an nasty person could have made two, one to get signed by the system, another to do evil with.)
SHA-1 is officially deprecated for certificates, because of the example that OP shows. You can create two certificates, have the decent one get signed by a CA, and then use the evil one to intercept traffic.
It seems that there is also a very strong incentive for anyone receiving anything whose authenticity is verified by SHA1 to request an improved hashing algorithm.
Would that be a problem? Ramifications are unclear to me.
The tool gives me two pdfs that both hash to b8d36cccd01e5a4569ba01f9d15b54efedcd5d9f. It works... scary!
$ convert -size 512x512 canvas:red canvas_red.jpg
$ convert -size 512x512 canvas:blue canvas_blue.jpg
$ convert screenshot.png -resize '500x500!' bar.jpg
$ sips -s format jpeg -Z 500x500 screenshot.png --out bar.jpg
Generating some workable source images is trivial. You're not interested in making a 'perfect hack pdf' for some nefarious purpose, just seeing if the service does what it says it does.
Everything is trivial if you know how to do it off the top of your head. Generating white noise samples is trivial. UV-mapping a texture to a mesh is trivial. Soldering resisters to a PCB is trivial. Generating an ARM toolchain with Yocto Project is trivial.
Is it a crime that I didn't feel like researching a bunch of CLI tools I've never heard of to try to use an app I was only mildly curious about?
For a Windows user with no decent image editor or viewer installed, though, it could certainly be a hassle.