543 points by svenfaw on March 8, 2017 | hide | past | favorite | 94 comments

 Here is the explanation:1. Generate a gif for each possible digit in the first column2. Append collision blocks to each gif to make a 16 way collision3. Repeat for each digit4. Hash the final product5. Replace each digit with the correct digit
 Some more info on why the 16 way collision isn't hard:This is a Nostradamus attack. For any Merkle-Damgard hash function (MD5, SHA1, SHA2 -- this is a crucial part of why the recent SHA1 collision can be used to create arbitrary colliding pdfs), if hash(A) = hash(B), hash(A+C) = hash(B+C) assuming A and B are at block size boundaries (pad them if not). So you can always add the same suffix to colliding strings and get new colliding strings.Now, a preimage attack is beyond your means. Given a string, it's hard to find something that hashes to the same thing.But, given two strings, it's possible to mix crap with both of them till they give you the same hash. This is reasonably fast for MD5, and expensive (but still within the means of folks like Google) for SHA1.So what you do is you first create images for each digit. Pair them up.Now, take each pair, and append crap to both instances till you find a collision. Now you have 8 colliding pairs. Now, applying the hash(digit0 + crap0) = hash(digit1 + crap1), hash(digit0 + crap0 + crap01) = hash(digit1 + crap1 + crap01). Appending the same suffix (crap01) to both will still get you a collision.Now, pair up each pair. Let's say we take the digit0/digit1 pair and pair it with digit2/digit3. Find a collision between digit0+crap0 and digit2+crap2. Let's say that hash(digit0 + crap0 + morecrap0) = hash(digit2 + crap2 + morecrap2). Realize that this is also equal to hash(digit1 + crap1 + morecrap0) and hash(digit3 + crap3 + morecrap2) (since we can add the morecrap suffix to the already-colliding digit+crap combinations to get new collisions).Now, we have a four-way collision. Repeat with the other pairs and you have 4 of these. Now do the same process, and get 2 8 way collisions. Repeat to get 1 16 way collision. Ultimately you'll have something like- hash(digit0 + crap0 + crap01 + crap0123 + crap01234567)- hash(digit1 + crap1 + crap01 + crap0123 + crap01234567)- hash(digit2 + crap2 + crap23 + crap0123 + crap01234567)- ...- hash(digit6 + crap6 + crap67 + crap4567 + crap01234567)- ...- hash(digitF + crapF + crapEF + crapCDEF + crap9ABCDEF)which is a 16 way collision, of 16 gifs of different digits, followed by some crap for each one.This only requires 15 collision attacks, which isn't hard with MD5. Needs more than a million dollars for SHA1, though.
 Thank you for this explanation! Can you tell me more about why it is easy to "mix crap with both until they give you the same hash?" Specifically in the context of MD5.
 For the full details I thing you would have to dive quite a bit into MD5 as this isn't something that was supposed to be easy, but a problem with the algorithm.
 This is HN, surely someone can point me in the right direction to learn more about the specific part of MD5 that makes it weak to this, even if it is only conceptual.
 eriknstr on March 9, 2017 >this isn't something that was supposed to be easyParent to the comment you are replying to said that it wasn't hard. If it's not hard then it ought to be quite easy.
 What I meant is that when md5 was designed, it was supposed to make this thing difficult.However flaws in the design have made this easy.
 Ah, I see :)
 makomk on March 9, 2017 It's moderately expensive even with MD5. Ideally you'd like to use the same prefix and use the collision itself to choose between the two. This is easy with code, trickier with images, and even trickier with text-based PDFs but still possible: https://www.makomk.com/~aidan/selfmd5-lastfix.pdf
 Could you expand on this? I'm not quite sure what you mean here.Edit: Oh, I see what you mean.
 Why would you do it that way, rather than:1. Pick a target MD52. Make whatever cutesy animation you want showing that number3. Append collision blocks to the resulting gif to make it match the target MD5
 Because this attack is not feasible against MD5 with current state of the art cryptanalysis.Instead collisions between files are demonstrated by appending bytes to two files until their hashes match.
 Ah darn, for some reason I thought there was a practical preimage attack on MD5. I must have been thinking of CRC32 (which is just useless for this sort of thing.) My mistake.
 CRCs are linear, all 'attacks' are efficient on them.
 corndoge on March 9, 2017 Got a code snippet?
 Create your own MD5 collisions – https://natmchugh.blogspot.cz/2015/02/create-your-own-md5-co...(2015, on HN here, but without any significant traction: https://news.ycombinator.com/item?id=9003429)The software used is "HashClash", seems it supports SHA1, too: https://marc-stevens.nl/p/hashclash/index.php
 Oh man. Thanks! That makes so much more sense with the explanation.
 Took me a couple of minutes looking at this to realize why this was interesting.It's like a baby being born holding its completed birth certificate.
 More closely, a baby being born with a birthmark of its DNA sequence on its skin—where the birthmark is genetic such that any clones (with the same DNA) would have the same birthmark.
 I'm not sure that's closer, DNA->organism is roughly a continuous mapping.
 One of my friends had a birthday cake which had a QR code on it of the URL photo album containing photos of the birthday.Of course this is trivial to set up (create album before, upload photos after), but I found it amusing nonetheless.
 This is a very creative and practical idea. Well, I haven't seen anyone (seriously, 0 people) scanning a QR code before so there is that...
 alanh on March 10, 2017 Or have the QR code go to a redirection URL that you can set to the album URL after the fact. E.g., easy with my link shortener software, 'Lessn More'
 This sentence has forty-five (45) characters.
 Lxr on March 9, 2017 Reminds me of a fun problem I encountered as a younger lad:This sentence has three as, one b, two cs, two ds, thirty-six es, three fs, three gs, eleven hs, nine is, one j, one k, three ls, one m, eighteen ns, twelve os, one p, one q, eight rs, twenty-six ss, twenty ts, two us, five vs, seven ws, three xs, four ys and one z.I eventually found the solution by iteratively updating the approximate distribution of each letter and finally sampling [1] - not sure if there is a better way![1] solve_2 in tsh.py at https://bitbucket.org/akxlr/tsh/src
 you use a gradient descent approach, but there's no way to guarantee anything like convexity, right?I bet you can construct a language where a solution exists, but all of its neighbors' errors (using the topology implied by your gd function) are local maxima.
 Right, it is definitely not convex in any way and gradient descent is basically useless (and not very interesting because all it really means here is try perturbing each count by one and look for improvement). I left that there as an initial attempt at solving the problem but I don't use it in my solution (except for the call to `gd(iter(gd(v)))` where I explore the neighbouring solutions to each sample point, which is probably not necessary).The basic idea instead is to treat each letter count as a random variable, and ask what the distribution of each count is. In particular, each letter count can be expressed as a sum of (a mapping of) the other letter counts, so if you know the distribution of all other letter counts you can improve the distribution of the letter of interest. Initially assume uniform for all counts, then iterate until convergence. The images in that repo show the distribution of some letters at various stages. After doing this for about 10 iterations you stop seeing any improvement (the letter distributions are as 'peaked' as they are going to get), at which point I drew samples until I found a solution.
 Alright, I admit that I skimmed, came across a bunch of GD code at the beginning and assumed that you just got lucky with a very greedy approach, sorry :P.The update function you wrote on the distribution space is continuous, and distribution space is compact (since there's no way to have more than N of any letter), so there is necessarily at least one fixed point.Fixed points could still be sources though, right? It clearly wasn't, but I'm curious to know if you got lucky, or if you merely didn't get unlucky.I wonder how effective your code would be on the following self descriptive sentences (base j).For instance when j=2, "100 1,11 0" has 4 ones and 3 zeros as described.Does your code consistently find solutions as you increase j? At what point does the computation become unfeasible?
 Nice! I admit I don't have a good understanding of the analysis, but if you mean "does repeatedly applying f to some starting point v, like f(f(f(...(v)))) eventually lead to a solution" the answer is definitely not - you can be right next to a solution and not get there by GD or by applying f. My function `iter` applies f over and over, but this was another failed attempt at a solution.I will endeavour to try your problem, is it related in some way to a well-known problem?
 JadeNB on March 8, 2017 Or the good old self-referential aptitude test. http://www.maa.org/press/periodicals/math-horizons/self-refe...
 calt on March 8, 2017 Or "(This Song's Just) Six Words Long" ... even though he's clearly not singing it as a contraction.https://www.youtube.com/watch?v=JWi5jdgTUJs
 That's pretty catchy. Weird Al is a really talented musician. I almost think he'd do greater things if he wasn't such a goofball all the time.
 Remaining a goofball all the time despite his talent and success is the great thing he's done.
 The guy's career has spanned decades. Much longer than most of the "stars" he's parodied. He must be doing something right.
 bdamm on March 9, 2017 Not quite, because it is computationally infeasible to brute-force a pre-image on MD5.
 glitchdout on March 10, 2017 Slightly related, I think you'll like this puzzle: https://www.youtube.com/watch?v=x1THOPm0qTw
 It's like a quine almost. My favorite quine is...http://aem1k.com/world/ (view source)
 That is outrageously wonderful. Thank you!
 Have you seen this one? It's insane:https://github.com/mame/quine-relay
 It's awesome how it can maintain the image layout in the source code through those 100 transformations.I wonder whether it reads its own source code file to detect where to put the spaces for the image, or is it all encoded as data in the ruby program itself? I suppose it's the latter (otherwise it wouldn't qualify as a quine), but I can't figure out how it is encoded.
 It appears all the languages' code is encoded in the `code_gen.rb` file:
 power78 on March 9, 2017 How is that possible?
 There's an explanation of how this 'attack' was composed here: http://crypto.stackexchange.com/questions/44463/how-does-the...
 The answer there doesn't explain chosen-prefix attacks, which are more powerful than building a collision out of a single prefix. They require much less flexibility in the fire format; building the PDF for shattered.io wasn't trivial exactly because the SHA-1 break didn't allow choosing two prefixes.
 In similar spirit, a Zip archive holding itself, by Russ Cox: https://research.swtch.com/zipInfinite recursion!
 Title checks out.`````` > md5sum md5.gif f5ca4f935d44b85c431a8bf788c0eaca md5.gif``````
 Never knew that md5sum took a file argument. I've always `md5sum
 The output format of md5sum and other checksum programs is well-defined, which lets you use the -c switch to verify. This is the format you see in checksum files when downloading source tarballs, distro ISOs, packages, etc.`````` \$ echo hello > fileone \$ echo there > filetwo \$ md5sum fileone filetwo | tee /tmp/ck b1946ac92492d2347c6235b4d2611184 fileone c4ff45bb1fab99f9164b7fec14b2292a filetwo \$ md5sum -c /tmp/ck fileone: OK filetwo: OK \$ echo oops > filetwo \$ md5sum -c /tmp/ck fileone: OK filetwo: FAILED md5sum: WARNING: 1 computed checksum did NOT match `````` Edit: Of course, nothing wrong with using stdin:`````` \$ wget -q -O- https://shells.aachen.ccc.de/~spq/md5.gif | md5sum f5ca4f935d44b85c431a8bf788c0eaca -``````
 What's the -O- for?
 Output to stdout instead of a file on disk.
 cobbzilla on March 9, 2017 -O is the file to write output. plain - means stdout. so, run wget and print results to stdout. tip: add -S to include response http headers.
 kevincox on March 9, 2017 It's nice to hash a couple files and have the hashes displayed right next to each other :)
 soheil heard, and your request was granted (https://news.ycombinator.com/item?id=13824445).
 I was a bit disappointed when I found this out about it:"The formula is a general-purpose method of decoding a bitmap stored in the constant k, and it could actually be used to draw any other image."That makes it less of a self-referential formula in my opinion, and more of a cool formula that can print any bitmap.
 Yeah, the title is misleading. There's nothing self-referential about this formula.Now, if the formula was also capable of drawing the constant k, THAT would be exciting.
 Jakub Travnik has got your back. http://jtra.cz/stuff/essays/math-self-reference/index.html
 omaranto on March 9, 2017 Jakub Travnik has a better version where the image includes the necessary constant. http://jtra.cz/stuff/essays/math-self-reference/index.html
 Not as impressive but similar idea: https://github.com/ziqbal/jpegfit/blob/master/61081.jpg
 Would love to know how this was made, if anybody knows
 This was posted to reddit yesterday, here's the discussion:https://www.reddit.com/r/programming/comments/5y03g9/animate...
 I think this uses one of the available GIF extensions (e.g plain-text, custom data, or other) since gif is a block based binary file. Animation blocks in one part of the file, then for the other blocks you could just put random data (which is ignored by the GIF parser) until you find the MD5 collision.
 I wonder if it being an animated GIF as opposed to just an image has anything to do with it.
 Surely it does: you have much more combinations in animated GIF than static GIF so it is easier to make "unvisible" changes to the file to get the MD5 hash displayed on final GIF slide.
 I can't imagine how you'd begin to do this.
 Just a wild guess, but I imagine you would start by designing the gif animation to depict a plausible MD5 hash, then modifying the file in such a way that you make the file MD5 hash match the one depicted in the animation.Very impressive!
 No, this almost certainly isn't a preimage attack – it is more likely to be caused by the fact that you can generate MD5 collisions cheaply. Getting the gif animation to show the right frames based on the colliding block is a neat trick though! Looking forward to the writeup.Here's how a similar thing was done for PostScript: https://twitter.com/teh_gerg/status/838422647193157632
 MichaelGG on March 8, 2017 But any 128 bit value is a plausible MD5 hash. Modifying the file to make that value is akin to doing a preimage attack right?
 Yes, that's probably the point of the maker. To prove that they can do a preimage attack.
 kmm on March 8, 2017 No, it's just some fancy tricks with collisions. A preimage attack is still far beyond our current technologies
 It's important to note that technologies here means not only our hardware technologies but also our cryptographic knowledge.There is a chance that there are unknown flaws in the MD5 algorithm that make preimage attacks far easier to compute.Our compute power is also growing cheeper, which would let us solve harder preimage problems.
 _vya7 on March 8, 2017 Are MD5 hashes that predictable that you can just make a few adjustments to a binary until it's the one you want?
 MD5 as a cryptographic hash is thoroughly shredded, yes.It's still fine for non-cryptographic purposes, except that it's far slower than e.g. CRC.
 That would be news if that were the case. The preimage resistance for MD5 was pretty strong last time I looked: http://crypto.stackexchange.com/questions/13303/is-md5-secon...
 This is incredible, it's like winning the lottery except that you can play incredibly fast with almost no cost for each try.
 That makes me wonder if it's also easy to create a string that is its own MD5.
 This is called a fixed point, that is an X such that md5(X)==X. A quick google for "md5 fixed point" shows some claims that the probability that one exists is about 63% (https://news.ycombinator.com/item?id=614027) but no examples of a known one
 Also the "real" fixed point wouldn't be necessarily ASCIISo you could have two types: one where the hash matches the input bytes, another where an hex string is inputed and that matches the hex representation of the hash
 Also EBCDIC and UTF-16
 jix on March 9, 2017 There is http://www.olegkikin.com/md5game/ on which I wasted some GPU cycles many years ago. It uses a very silly metric for progress though. Finding a longer substring that matches doesn't bring you closer to finding MD5(x) = x. It was a fun problem to try out some crypto bruteforcing on the GPU though.I don't think any of the known weaknesses of MD5 would help you in finding MD5(x) = x.
 I bet this has something to do with the fact (or so I'm told) that MD5 is broken.
 Writeup by @angealbertini coming soon
 Dubbed HashQuines; here are PDF and PostScript too: https://twitter.com/i/moments/838685002703466497
 How did you do that!