
SHA-1 collider: Make your own colliding PDFs - ascorbic
https://alf.nu/SHA1
======
xiphias
I think the coolest proof was that the Bitcoin SHA1 collision bounty was
claimed:

[https://bitcointalk.org/index.php?topic=293382.0](https://bitcointalk.org/index.php?topic=293382.0)

The OP_2DUP OP_EQUAL OP_NOT OP_VERIFY OP_SHA1 OP_SWAP OP_SHA1 OP_EQUAL script
was making sure that only a person who finds 2 SHA1 colliders and publishes it
can get the 2.5 BTC bounty.

~~~
j_s
HN discussion:
[https://news.ycombinator.com/item?id=13714987](https://news.ycombinator.com/item?id=13714987)

------
Deregibus
This was a good explanation of what's happening here from a previous thread:
[https://news.ycombinator.com/item?id=13715761](https://news.ycombinator.com/item?id=13715761)

The key is that essentially all of the data for both images are in both PDFs,
so the PDFs are almost identical except for a ~128 byte block that "selects"
the image and provides the necessary bytes to cause a collision.

Here's an diff of the 2 PDFs from when I tried it earlier:
[https://imgur.com/a/8O58Q](https://imgur.com/a/8O58Q)

Not to say that there isn't still something exploitable here, but I don't
think it means that you can just create collisions from arbitrary PDFs.

edit: Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs
from this tool. The first ~550 bytes are identical.

[https://imgur.com/a/vVrrQ](https://imgur.com/a/vVrrQ)

~~~
niftich
I didn't get a chance to make this point in that other thread, because the
thread [1] of its follow-ups quickly morphed from promising [2] to meandering,
but the combination of lax formats (PDF and JPEG in this instance) makes this
style of collision particularly reductive, and in a sense, a cheapshot, if
still devastatingly damaging given both PDF's and JPEG's ubiquity -- both
separately and together -- in document storage and archival.

This shows the importance of techniques like canonicalization and determinism,
which ensure that given a particular knowledge set, that result could only
have been arrived at given exactly one input. For general-purpose programming
languages like PostScript, of which PDF is a subset, this is essentially an
unfulfillable requirement, as any number of input "source code" can produce
observationally "same" results. Constrained formats, and formats where the set
of 'essential information' can be canonicalized into a particular
representation should be the norm, rather than the exotic exception,
especially in situations where minute inessential differences can be cascaded
to drastically alter the result.

[1]
[https://news.ycombinator.com/item?id=13715761](https://news.ycombinator.com/item?id=13715761)
[2]
[https://news.ycombinator.com/item?id=13718772](https://news.ycombinator.com/item?id=13718772)

~~~
smallnamespace
> Constrained formats, and formats where the set of 'essential information'
> can be canonicalized into a particular representation should be the norm,
> rather than the exotic exception, especially in situations where minute
> inessential differences can be cascaded to drastically alter the result.

That might be very challenging in practice, because a more expressive language
directly allows a more compressed/efficient encoding of the same information,
but at the cost of being more difficult (or impossible) to create a canonical
representation.

Also, data formats that are purposefully redundant for error tolerance all
basically have the property that readers should be tolerant of non-canonical
forms. If we want to redundantly represent some bytes redundantly in case of
data loss, then there _must_ be multiple representations of those bytes that
are all acceptable for the reader for this to work.

Video and image formats use multiple encodings to give encoders the room to
make time-space trade-offs.

~~~
acqq
I agree, for anything that a human is supposed to see with the eyes, there are
always different representations that look the "same" enough to be
indistinguishable.

People should be aware of it, not believe in a non-existing world where it
isn't so.

------
nneonneo
Shameless plug: I built my own version of this to collide arbitrary PDFs:
[https://github.com/nneonneo/sha1collider](https://github.com/nneonneo/sha1collider)

My version is similar to this, but removes the 64kb JPEG limit and allows for
colliding multi-page PDFs.

~~~
versteegen
Excellent! This ought to be the top comment; have you submitted it separately?
I was considering implemented the same thing, since there wasn't any reason
for those limitations. The Google researchers purposely designed this
collision to be highly reusable, probably to encourage everyone to generate
lots of fun examples which will be spread widely and break systems everywhere
:)

I'm surprised that some PDF renderers have problems with JPEG streams that
contain comments and restarts. Actually, glancing at the JPEG spec, I didn't
even realise that restarts would be needed, I thought this was just done with
comments. Is this really "bending" the spec, or is GhostScript buggy, or is
the problem that it's assuming that restarts don't occur inside comments
without escaping?

~~~
nneonneo
I did submit it, but I don't think anyone saw it:
[https://news.ycombinator.com/item?id=13729920](https://news.ycombinator.com/item?id=13729920).
I was trying to finish it in time to catch daytime folks but work got in the
way :)

Comments are limited to 65536 bytes, and the JPEG spec doesn't offer any way
to break an image stream up except for progressive mode or restart intervals
(otherwise, the image stream must be a single intact entity). The trouble is
that it's probably not _technically_ legal to stick comments before restart
interval markers (putting comments _after_ the markers just breaks all the
readers since presumably they are expecting image data). So, GhostScript's
JPEG reader gets confused when it fails to see a restart marker immediately
following an image data chunk, and aborts decoding.

------
TorKlingberg
Wow, it works! I thought this was supposed to require Google-level resources
and months on processing time. Did the initial collision somehow enable more
similar ones?

~~~
kiallmacinnes
yea, this.. wow.

Is this the first hash function which went from "secure" to collision-as-a-
service in a matter of days? Was sha1 particularly weak, or the published
research particularly strong? or maybe something else?

~~~
schoen
As other people explained, and just to summarize, this service is a way of
reusing Google's specific collision (as a prefix to new collisions), and isn't
a way of making arbitrary colliding files or file formats. You can't, for
example, use this to make something other than PDFs that collide; for that,
you'll have to redo a computation on the scale of Google's!

~~~
orblivion
I thought I heard that some file formats have "headers" that go at the end of
the file. I think a demo of this was a file that was both a zip and a PNG or
something. If I remembered right, you should be able to make a similar hack
here.

~~~
schoen
If the _only_ headers are at the end, then yes, that's a really neat idea. I'm
not sure of the constraints for zip files. Maybe it would be interesting to
brainstorm with some people with a lot of file format knowledge to find
additional such formats. But if the formats have any constraints at all on the
first few hundred bytes, those generally won't be satisfied by the prefix
here.

------
michaf
I just constructed a little POC for bittorrent:
[https://github.com/michaf/shattered-torrent-
poc](https://github.com/michaf/shattered-torrent-poc)

Installer.py and Installer-evil.py are both valid seed data for
Installer.torrent ...

~~~
Buge
Nice, you're able to line it up so the colliding data starts right at the
beginning of one of the torrent blocks.

------
lxe
According to [the Shattered
paper]([http://shattered.io/static/shattered.pdf](http://shattered.io/static/shattered.pdf)),
the reason why the proof-of-concepts are PDFs is because we are looking at a

> identical-prefix collision attack, where a given prefix P is extended with
> two distinct near-collision block pairs such that they collide for any
> suffix S

They have already precomputed the prefix (the PDF header) and the blocks
(which I'm guessing is the part that tells the PDF reader to show one image or
the other), and all you have to do is to populate the rest of the suffix with
identical data (both images)

~~~
danielweber
Yep. With any hash function that takes input as blocks, if you were to ever
get two messages to generate the same _EDIT_ internal state, you could add the
same arbitrary data to both and get those new messages to have the same hash.

------
fivesigma
Is this length extending [1] the already existing Google attack?

[1]
[https://en.wikipedia.org/wiki/Length_extension_attack](https://en.wikipedia.org/wiki/Length_extension_attack)

Edit: yes, looks like it is.

As sp332 and JoachimSchipper mentioned, the novelty here is that it contains
specially crafted code in order to conditionally display either picture based
on previous data (the diff). I can't grok PDF so I still can't find the
condition though. Can PDFs reference byte offsets? This is really clever.

Edit #2: I misunderstood the original Google attack. This is just an extension
of it.

~~~
TorKlingberg
It seems so. I can add the same arbitrary data at the end of two pdfs
generated by this tool, and they are still a collision. I didn't know SHA-1 is
so susceptible to length extension. Is there no internal state in the
algorithm that would be different even if the hash output is identical?

~~~
danielweber
If you were to somehow get two messages with the same SHA-3 hash, you could
keep on appending the same data to both and they would keep the same SHA-3.
But SHA-3 is explicitly not vulnerable to length extension attacks.

~~~
fivesigma
No they wouldn't, since its internal state is different than the output.

Same goes for SHA-224 and SHA-384.

~~~
danielweber
Damn, right, you have to get them with the same _internal state_.

------
averagewall
I just tested Backblaze and found that its deduplication is broken by this. If
you let it backup the two files generated by this tool, then restore them, it
gives you back two identical files instead of two different ones.

~~~
lokedhs
I have never been particularly comfortable with the idea that you can simply
assume that if the hashes are the same, then the data must be the same.

The fact that collisions happen so rarely (and sometimes only after the a hash
function has become compromised) makes this even more dangerous.

It's like a couple of decades of strong hash functions has made people forget
what hashing actually is.

------
neo2006
I don't understand the whole collision thing. I mean a sha1 is 160bits so if
you are hashing information longer then that collision is a fact, being able
to forge a piece of information with constraints is the challenge and even
that with enough power you end up being able to try all the combinations. What
I understand from that collision reported is that they use PDF format which
can have as many data inserted to it as comment /junk as you want so all you
need is enough processing power to find the right padding/junk to insert to
get the collision. Am I missing something here ?

------
grandalf
I would imagine that a lot of old data is secured by SHA1, which may be
available for attack.

Does anyone have any idea about a broad risk-assessment of systems worldwide
that might be vulnerable as SHA1 becomes easier and easier to beat?

~~~
danielweber
If by "secured by SHA1" you mean "someone generated a hash using SHA-1 and we
use the validity of that hash to guarantee we have the right document," that's
still okay. We're a long way from being able to make documents with a given
SHA-1.

(Edit: Any _newly_ signed documents, or documents signed recently, are not
safe, because an nasty person could have made two, one to get signed by the
system, another to do evil with.)

SHA-1 is officially deprecated for certificates, because of the example that
OP shows. You can create two certificates, have the decent one get signed by a
CA, and then use the evil one to intercept traffic.

~~~
grandalf
Thanks for the info. Good point, I suppose anyone relying on SHA1 in 2017 has
had ample warning about its weaknesses.

It seems that there is also a very strong incentive for anyone receiving
anything whose authenticity is verified by SHA1 to request an improved hashing
algorithm.

------
reiichiroh
Practical question: does this generate a "harmful" (harmful to a repo system
like SVN) PDF if the flaw in the hashing is enough to crash/corrupt the
system?

~~~
reiichiroh
To answer my own question looks like Gmail has flagged PDFs generated with
_this_ specific hash as harmful.

------
Grue3
Doesn't work for me. One of PDFs always says "Insufficient data for an image"
(sometimes for the same image that worked before).

~~~
averagewall
I found that you have to reload the page if there's an error or it'll stick.
In my case it was too big of a file.

------
Globz
Damn that didn't take long to go from $100K to carry out this attack to a
single day to offer a website for SHA1 collision as a service...

~~~
daenz
Just wait until you get the invoice

~~~
kalleboo
Is the invoice in a PDF? What's the SHA-1 hash of it?

------
odbol_
What's the smallest file you can make collide? Could you make two files
collide that are actually smaller than their hashes?

------
FabHK
Question: What does git use to hash blobs? Is it SHA-1?

Would that be a problem? Ramifications are unclear to me.

~~~
FabHK
Ah, note that apparently it is SHA-1, and this question was common enough that
Linus himself has addressed it:

[https://news.ycombinator.com/item?id=13733481](https://news.ycombinator.com/item?id=13733481)

------
b1gtuna
Just tried it and it really got me the same sha1... damn...

------
ythn
I wanted to try this tool, but the upload requirements were so stringent (must
be jpeg, must be same aspect ratio, must be less than 64KB) that I gave up.
Would be nice if sample jpegs were provided on the page.

~~~
lsh
... are you joking?

~~~
lucb1e
(not OP) I understand why the requirements are strict so I can see why s/he
was being downvoted, but I do agree that samples would be nice. Like, do I
have to go search for some images with these requirements just to see if it
really works (since it was supposed to take $100k)? By now someone mentioned
it works in the comments though, so I guess I'll trust them.

~~~
shaymcrasherson
Is it really that hard to take a screenshot of a couple windows on your
desktop and run:

$ convert screenshot.png -resize '500x500!' bar.jpg

or

$ sips -s format jpeg -Z 500x500 screenshot.png --out bar.jpg

Generating some workable source images is trivial. You're not interested in
making a 'perfect hack pdf' for some nefarious purpose, just seeing if the
service does what it says it does.

~~~
ythn
> Generating some workable source images is trivial

Everything is trivial if you know how to do it off the top of your head.
Generating white noise samples is trivial. UV-mapping a texture to a mesh is
trivial. Soldering resisters to a PCB is trivial. Generating an ARM toolchain
with Yocto Project is trivial.

Is it a crime that I didn't feel like researching a bunch of CLI tools I've
never heard of to try to use an app I was only mildly curious about?

~~~
TazeTSchnitzel
This overstates the difficulty of creating a JPEG. If you have an image editor
on hand that isn't MS Paint (or you're using macOS, which has Preview), you
merely need to export a JPEG and choose a low quality setting.

For a Windows user with no decent image editor or viewer installed, though, it
could certainly be a hassle.

~~~
ythn
None of the things I listed are difficult. They just require the right tool
and know how. For someone like me with Xubuntu, I had neither the tools nor
know how for creating small jpegs. I didn't fell like wasting 30 minutes
researching and I didn't feel like walking over to a different computer that
has MSPaint.

