An animated GIF that shows its own MD5

soheil · on March 8, 2017

Here is the explanation:

1. Generate a gif for each possible digit in the first column

2. Append collision blocks to each gif to make a 16 way collision

3. Repeat for each digit

4. Hash the final product

5. Replace each digit with the correct digit

From https://www.reddit.com/r/programming/comments/5y03g9/animate...

Manishearth · on March 9, 2017

Some more info on why the 16 way collision isn't hard:

This is a Nostradamus attack. For any Merkle-Damgard hash function (MD5, SHA1, SHA2 -- this is a crucial part of why the recent SHA1 collision can be used to create arbitrary colliding pdfs), if hash(A) = hash(B), hash(A+C) = hash(B+C) assuming A and B are at block size boundaries (pad them if not). So you can always add the same suffix to colliding strings and get new colliding strings.

Now, a preimage attack is beyond your means. Given a string, it's hard to find something that hashes to the same thing.

But, given two strings, it's possible to mix crap with both of them till they give you the same hash. This is reasonably fast for MD5, and expensive (but still within the means of folks like Google) for SHA1.

So what you do is you first create images for each digit. Pair them up.

Now, take each pair, and append crap to both instances till you find a collision. Now you have 8 colliding pairs. Now, applying the hash(digit0 + crap0) = hash(digit1 + crap1), hash(digit0 + crap0 + crap01) = hash(digit1 + crap1 + crap01). Appending the same suffix (crap01) to both will still get you a collision.

Now, pair up each pair. Let's say we take the digit0/digit1 pair and pair it with digit2/digit3. Find a collision between digit0+crap0 and digit2+crap2. Let's say that hash(digit0 + crap0 + morecrap0) = hash(digit2 + crap2 + morecrap2). Realize that this is also equal to hash(digit1 + crap1 + morecrap0) and hash(digit3 + crap3 + morecrap2) (since we can add the morecrap suffix to the already-colliding digit+crap combinations to get new collisions).

Now, we have a four-way collision. Repeat with the other pairs and you have 4 of these. Now do the same process, and get 2 8 way collisions. Repeat to get 1 16 way collision. Ultimately you'll have something like

- hash(digit0 + crap0 + crap01 + crap0123 + crap01234567)

- hash(digit1 + crap1 + crap01 + crap0123 + crap01234567)

- hash(digit2 + crap2 + crap23 + crap0123 + crap01234567)

- ...

- hash(digit6 + crap6 + crap67 + crap4567 + crap01234567)

- ...

- hash(digitF + crapF + crapEF + crapCDEF + crap9ABCDEF)

which is a 16 way collision, of 16 gifs of different digits, followed by some crap for each one.

This only requires 15 collision attacks, which isn't hard with MD5. Needs more than a million dollars for SHA1, though.

mulmen · on March 9, 2017

Thank you for this explanation! Can you tell me more about why it is easy to "mix crap with both until they give you the same hash?" Specifically in the context of MD5.

kevincox · on March 9, 2017

For the full details I thing you would have to dive quite a bit into MD5 as this isn't something that was supposed to be easy, but a problem with the algorithm.

mulmen · on March 9, 2017

This is HN, surely someone can point me in the right direction to learn more about the specific part of MD5 that makes it weak to this, even if it is only conceptual.

DougBTX · on March 9, 2017

Try: https://ad-pdf.s3.amazonaws.com/papers/wp.MD5_Collisions.en_...

(snark: this is the 8th Google search result for "learn more about the specific part of MD5 that makes it weak")

eriknstr · on March 9, 2017

>this isn't something that was supposed to be easy

Parent to the comment you are replying to said that it wasn't hard. If it's not hard then it ought to be quite easy.

kevincox · on March 9, 2017

What I meant is that when md5 was designed, it was supposed to make this thing difficult.

However flaws in the design have made this easy.

eriknstr · on March 9, 2017

Ah, I see :)

makomk · on March 9, 2017

It's moderately expensive even with MD5. Ideally you'd like to use the same prefix and use the collision itself to choose between the two. This is easy with code, trickier with images, and even trickier with text-based PDFs but still possible: https://www.makomk.com/~aidan/selfmd5-lastfix.pdf

Manishearth · on March 9, 2017

Could you expand on this? I'm not quite sure what you mean here.

Edit: Oh, I see what you mean.

taneq · on March 9, 2017

Why would you do it that way, rather than:

1. Pick a target MD5

2. Make whatever cutesy animation you want showing that number

3. Append collision blocks to the resulting gif to make it match the target MD5

cardigan · on March 9, 2017

Because this attack is not feasible against MD5 with current state of the art cryptanalysis.

Instead collisions between files are demonstrated by appending bytes to two files until their hashes match.

taneq · on March 9, 2017

Ah darn, for some reason I thought there was a practical preimage attack on MD5. I must have been thinking of CRC32 (which is just useless for this sort of thing.) My mistake.

dom0 · on March 9, 2017

CRCs are linear, all 'attacks' are efficient on them.

corndoge · on March 9, 2017

Got a code snippet?

helb · on March 9, 2017

Create your own MD5 collisions – https://natmchugh.blogspot.cz/2015/02/create-your-own-md5-co...

(2015, on HN here, but without any significant traction: https://news.ycombinator.com/item?id=9003429)

The software used is "HashClash", seems it supports SHA1, too: https://marc-stevens.nl/p/hashclash/index.php

kevinSuttle · on March 9, 2017

Oh man. Thanks! That makes so much more sense with the explanation.

strictnein · on March 8, 2017

Took me a couple of minutes looking at this to realize why this was interesting.

It's like a baby being born holding its completed birth certificate.

tyre · on March 8, 2017

More closely, a baby being born with a birthmark of its DNA sequence on its skin—where the birthmark is genetic such that any clones (with the same DNA) would have the same birthmark.

ouid · on March 9, 2017

I'm not sure that's closer, DNA->organism is roughly a continuous mapping.

rkangel · on March 9, 2017

One of my friends had a birthday cake which had a QR code on it of the URL photo album containing photos of the birthday.

Of course this is trivial to set up (create album before, upload photos after), but I found it amusing nonetheless.

positivecomment · on March 9, 2017

This is a very creative and practical idea. Well, I haven't seen anyone (seriously, 0 people) scanning a QR code before so there is that...

alanh · on March 10, 2017

Or have the QR code go to a redirection URL that you can set to the album URL after the fact. E.g., easy with my link shortener software, 'Lessn More'

superbatfish · on March 8, 2017

This sentence has forty-five (45) characters.

Lxr · on March 9, 2017

Reminds me of a fun problem I encountered as a younger lad:

This sentence has three as, one b, two cs, two ds, thirty-six es, three fs, three gs, eleven hs, nine is, one j, one k, three ls, one m, eighteen ns, twelve os, one p, one q, eight rs, twenty-six ss, twenty ts, two us, five vs, seven ws, three xs, four ys and one z.

I eventually found the solution by iteratively updating the approximate distribution of each letter and finally sampling [1] - not sure if there is a better way!

[1] solve_2 in tsh.py at https://bitbucket.org/akxlr/tsh/src

ouid · on March 9, 2017

you use a gradient descent approach, but there's no way to guarantee anything like convexity, right?

I bet you can construct a language where a solution exists, but all of its neighbors' errors (using the topology implied by your gd function) are local maxima.

Lxr · on March 10, 2017

Right, it is definitely not convex in any way and gradient descent is basically useless (and not very interesting because all it really means here is try perturbing each count by one and look for improvement). I left that there as an initial attempt at solving the problem but I don't use it in my solution (except for the call to `gd(iter(gd(v)))` where I explore the neighbouring solutions to each sample point, which is probably not necessary).

The basic idea instead is to treat each letter count as a random variable, and ask what the distribution of each count is. In particular, each letter count can be expressed as a sum of (a mapping of) the other letter counts, so if you know the distribution of all other letter counts you can improve the distribution of the letter of interest. Initially assume uniform for all counts, then iterate until convergence. The images in that repo show the distribution of some letters at various stages. After doing this for about 10 iterations you stop seeing any improvement (the letter distributions are as 'peaked' as they are going to get), at which point I drew samples until I found a solution.

ouid · on March 12, 2017

Alright, I admit that I skimmed, came across a bunch of GD code at the beginning and assumed that you just got lucky with a very greedy approach, sorry :P.

The update function you wrote on the distribution space is continuous, and distribution space is compact (since there's no way to have more than N of any letter), so there is necessarily at least one fixed point.

Fixed points could still be sources though, right? It clearly wasn't, but I'm curious to know if you got lucky, or if you merely didn't get unlucky.

I wonder how effective your code would be on the following self descriptive sentences (base j).

For instance when j=2, "100 1,11 0" has 4 ones and 3 zeros as described.

Does your code consistently find solutions as you increase j? At what point does the computation become unfeasible?

Lxr · on March 13, 2017

Nice! I admit I don't have a good understanding of the analysis, but if you mean "does repeatedly applying f to some starting point v, like f(f(f(...(v)))) eventually lead to a solution" the answer is definitely not - you can be right next to a solution and not get there by GD or by applying f. My function `iter` applies f over and over, but this was another failed attempt at a solution.

I will endeavour to try your problem, is it related in some way to a well-known problem?

JadeNB · on March 8, 2017

Or the good old self-referential aptitude test. http://www.maa.org/press/periodicals/math-horizons/self-refe...

calt · on March 8, 2017

Or "(This Song's Just) Six Words Long" ... even though he's clearly not singing it as a contraction.

https://www.youtube.com/watch?v=JWi5jdgTUJs

kristopolous · on March 9, 2017

That's pretty catchy. Weird Al is a really talented musician. I almost think he'd do greater things if he wasn't such a goofball all the time.

noonespecial · on March 9, 2017

Remaining a goofball all the time despite his talent and success is the great thing he's done.

SmellyGeekBoy · on March 9, 2017

The guy's career has spanned decades. Much longer than most of the "stars" he's parodied. He must be doing something right.

bdamm · on March 9, 2017

Not quite, because it is computationally infeasible to brute-force a pre-image on MD5.

glitchdout · on March 10, 2017

Slightly related, I think you'll like this puzzle: https://www.youtube.com/watch?v=x1THOPm0qTw

Exuma · on March 8, 2017

It's like a quine almost. My favorite quine is...http://aem1k.com/world/ (view source)

nkoren · on March 9, 2017

That is outrageously wonderful. Thank you!

djsumdog · on March 9, 2017

Have you seen this one? It's insane:

https://github.com/mame/quine-relay

TuringTest · on March 9, 2017

It's awesome how it can maintain the image layout in the source code through those 100 transformations.

I wonder whether it reads its own source code file to detect where to put the spaces for the image, or is it all encoded as data in the ruby program itself? I suppose it's the latter (otherwise it wouldn't qualify as a quine), but I can't figure out how it is encoded.

nacs · on March 9, 2017

It appears all the languages' code is encoded in the `code_gen.rb` file:

https://github.com/mame/quine-relay/blob/master/src/code-gen... < VB

https://github.com/mame/quine-relay/blob/master/src/code-gen... < PHP

power78 · on March 9, 2017

How is that possible?

MontagFTB · on March 8, 2017

There's an explanation of how this 'attack' was composed here: http://crypto.stackexchange.com/questions/44463/how-does-the...

bonzini · on March 9, 2017

The answer there doesn't explain chosen-prefix attacks, which are more powerful than building a collision out of a single prefix. They require much less flexibility in the fire format; building the PDF for shattered.io wasn't trivial exactly because the SHA-1 break didn't allow choosing two prefixes.

dexen · on March 8, 2017

In similar spirit, a Zip archive holding itself, by Russ Cox: https://research.swtch.com/zip

Infinite recursion!

clishem · on March 8, 2017

Title checks out.

    > md5sum md5.gif 
    f5ca4f935d44b85c431a8bf788c0eaca  md5.gif

joombaga · on March 9, 2017

Never knew that md5sum took a file argument. I've always `md5sum <foo`

Slightly embarrassing, but at least I wasn't `cat foo|md5sum`

coldpie · on March 9, 2017

The output format of md5sum and other checksum programs is well-defined, which lets you use the -c switch to verify. This is the format you see in checksum files when downloading source tarballs, distro ISOs, packages, etc.

    $ echo hello > fileone
    $ echo there > filetwo
    $ md5sum fileone filetwo | tee /tmp/ck
    b1946ac92492d2347c6235b4d2611184  fileone
    c4ff45bb1fab99f9164b7fec14b2292a  filetwo
    $ md5sum -c /tmp/ck
    fileone: OK
    filetwo: OK
    $ echo oops > filetwo
    $ md5sum -c /tmp/ck
    fileone: OK
    filetwo: FAILED
    md5sum: WARNING: 1 computed checksum did NOT match

Edit: Of course, nothing wrong with using stdin:

    $ wget -q -O- https://shells.aachen.ccc.de/~spq/md5.gif | md5sum 
    f5ca4f935d44b85c431a8bf788c0eaca  -

nicky0 · on March 9, 2017

What's the -O- for?

coldpie · on March 9, 2017

Output to stdout instead of a file on disk.

See: https://linux.die.net/man/1/wget

cobbzilla · on March 9, 2017

-O is the file to write output. plain - means stdout. so, run wget and print results to stdout. tip: add -S to include response http headers.

kevincox · on March 9, 2017

It's nice to hash a couple files and have the hashes displayed right next to each other :)

jez · on March 8, 2017

Wow! I'd love to read about how this was made.

JadeNB · on March 8, 2017

soheil heard, and your request was granted (https://news.ycombinator.com/item?id=13824445).

fniephaus · on March 8, 2017

Reminds me of https://en.wikipedia.org/wiki/Tupper's_self-referential_form...

arglebarnacle · on March 8, 2017

I was a bit disappointed when I found this out about it:

"The formula is a general-purpose method of decoding a bitmap stored in the constant k, and it could actually be used to draw any other image."

That makes it less of a self-referential formula in my opinion, and more of a cool formula that can print any bitmap.

jwilk · on March 9, 2017

Yeah, the title is misleading. There's nothing self-referential about this formula.

Now, if the formula was also capable of drawing the constant k, THAT would be exciting.

omaranto · on March 9, 2017

Jakub Travnik has got your back. http://jtra.cz/stuff/essays/math-self-reference/index.html

omaranto · on March 9, 2017

Jakub Travnik has a better version where the image includes the necessary constant. http://jtra.cz/stuff/essays/math-self-reference/index.html

zaf · on March 9, 2017

Not as impressive but similar idea: https://github.com/ziqbal/jpegfit/blob/master/61081.jpg

_sfvd · on March 8, 2017

Would love to know how this was made, if anybody knows

nmat · on March 8, 2017

This was posted to reddit yesterday, here's the discussion:https://www.reddit.com/r/programming/comments/5y03g9/animate...

logotype · on March 9, 2017

I think this uses one of the available GIF extensions (e.g plain-text, custom data, or other) since gif is a block based binary file. Animation blocks in one part of the file, then for the other blocks you could just put random data (which is ignored by the GIF parser) until you find the MD5 collision.

soheil · on March 8, 2017

I wonder if it being an animated GIF as opposed to just an image has anything to do with it.

pulse7 · on March 8, 2017

Surely it does: you have much more combinations in animated GIF than static GIF so it is easier to make "unvisible" changes to the file to get the MD5 hash displayed on final GIF slide.

quirkot · on March 8, 2017

https://xkcd.com/688/

mmanfrin · on March 8, 2017

I can't imagine how you'd begin to do this.

tekromancr · on March 8, 2017

Just a wild guess, but I imagine you would start by designing the gif animation to depict a plausible MD5 hash, then modifying the file in such a way that you make the file MD5 hash match the one depicted in the animation.

Very impressive!

moyix · on March 8, 2017

No, this almost certainly isn't a preimage attack – it is more likely to be caused by the fact that you can generate MD5 collisions cheaply. Getting the gif animation to show the right frames based on the colliding block is a neat trick though! Looking forward to the writeup.

Here's how a similar thing was done for PostScript: https://twitter.com/teh_gerg/status/838422647193157632

MichaelGG · on March 8, 2017

But any 128 bit value is a plausible MD5 hash. Modifying the file to make that value is akin to doing a preimage attack right?

rhaps0dy · on March 8, 2017

Yes, that's probably the point of the maker. To prove that they can do a preimage attack.

kmm · on March 8, 2017

No, it's just some fancy tricks with collisions. A preimage attack is still far beyond our current technologies

kevincox · on March 9, 2017

It's important to note that technologies here means not only our hardware technologies but also our cryptographic knowledge.

There is a chance that there are unknown flaws in the MD5 algorithm that make preimage attacks far easier to compute.

Our compute power is also growing cheeper, which would let us solve harder preimage problems.

_vya7 · on March 8, 2017

Are MD5 hashes that predictable that you can just make a few adjustments to a binary until it's the one you want?

Filligree · on March 8, 2017

MD5 as a cryptographic hash is thoroughly shredded, yes.

It's still fine for non-cryptographic purposes, except that it's far slower than e.g. CRC.

hueving · on March 8, 2017

That would be news if that were the case. The preimage resistance for MD5 was pretty strong last time I looked: http://crypto.stackexchange.com/questions/13303/is-md5-secon...

soheil · on March 8, 2017

This is incredible, it's like winning the lottery except that you can play incredibly fast with almost no cost for each try.

mh-cx · on March 8, 2017

That makes me wonder if it's also easy to create a string that is its own MD5.

ketralnis · on March 8, 2017

This is called a fixed point, that is an X such that md5(X)==X. A quick google for "md5 fixed point" shows some claims that the probability that one exists is about 63% (https://news.ycombinator.com/item?id=614027) but no examples of a known one

raverbashing · on March 8, 2017

Also the "real" fixed point wouldn't be necessarily ASCII

So you could have two types: one where the hash matches the input bytes, another where an hex string is inputed and that matches the hex representation of the hash

aidenn0 · on March 9, 2017

Also EBCDIC and UTF-16

jix · on March 9, 2017

There is http://www.olegkikin.com/md5game/ on which I wasted some GPU cycles many years ago. It uses a very silly metric for progress though. Finding a longer substring that matches doesn't bring you closer to finding MD5(x) = x. It was a fun problem to try out some crypto bruteforcing on the GPU though.

I don't think any of the known weaknesses of MD5 would help you in finding MD5(x) = x.

IAmGraydon · on March 8, 2017

That is really amazing. How was this done?

ruleabidinguser · on March 8, 2017

I bet this has something to do with the fact (or so I'm told) that MD5 is broken.

svenfaw · on March 8, 2017

Writeup by @angealbertini coming soon

j_s · on March 9, 2017

Dubbed HashQuines; here are PDF and PostScript too: https://twitter.com/i/moments/838685002703466497

matreyes · on March 8, 2017

How did you do that!

dgitg5266 · on March 9, 2017

b12092fb2bf87e09ab9e9ad6fc55f3c5

Kenji · on March 8, 2017

Now do a SHA1 one ;)

j_s · on March 9, 2017

People are reusing the collision in many interesting ways, but I don't think it helps in this case.

Two "different" EXEs with same SHA1/MD5 with different outputs: https://roastingbugs.blogspot.com/2017/03/eat-more-hashes.ht...

yuhong · on March 9, 2017

Seems to require a chosen prefix attack that is much more expensive.

bbcbasic · on March 9, 2017

Verified correct with http://onlinemd5.com/ :-)

quakeguy · on March 8, 2017