Some more info on why the 16 way collision isn't hard:
This is a Nostradamus attack. For any Merkle-Damgard hash function (MD5, SHA1, SHA2 -- this is a crucial part of why the recent SHA1 collision can be used to create arbitrary colliding pdfs), if hash(A) = hash(B), hash(A+C) = hash(B+C) assuming A and B are at block size boundaries (pad them if not). So you can always add the same suffix to colliding strings and get new colliding strings.
Now, a preimage attack is beyond your means. Given a string, it's hard to find something that hashes to the same thing.
But, given two strings, it's possible to mix crap with both of them till they give you the same hash. This is reasonably fast for MD5, and expensive (but still within the means of folks like Google) for SHA1.
So what you do is you first create images for each digit. Pair them up.
Now, take each pair, and append crap to both instances till you find a collision. Now you have 8 colliding pairs. Now, applying the hash(digit0 + crap0) = hash(digit1 + crap1), hash(digit0 + crap0 + crap01) = hash(digit1 + crap1 + crap01). Appending the same suffix (crap01) to both will still get you a collision.
Now, pair up each pair. Let's say we take the digit0/digit1 pair and pair it with digit2/digit3. Find a collision between digit0+crap0 and digit2+crap2. Let's say that hash(digit0 + crap0 + morecrap0) = hash(digit2 + crap2 + morecrap2). Realize that this is also equal to hash(digit1 + crap1 + morecrap0) and hash(digit3 + crap3 + morecrap2) (since we can add the morecrap suffix to the already-colliding digit+crap combinations to get new collisions).
Now, we have a four-way collision. Repeat with the other pairs and you have 4 of these. Now do the same process, and get 2 8 way collisions. Repeat to get 1 16 way collision. Ultimately you'll have something like
Thank you for this explanation! Can you tell me more about why it is easy to "mix crap with both until they give you the same hash?" Specifically in the context of MD5.
For the full details I thing you would have to dive quite a bit into MD5 as this isn't something that was supposed to be easy, but a problem with the algorithm.
This is HN, surely someone can point me in the right direction to learn more about the specific part of MD5 that makes it weak to this, even if it is only conceptual.
It's moderately expensive even with MD5. Ideally you'd like to use the same prefix and use the collision itself to choose between the two. This is easy with code, trickier with images, and even trickier with text-based PDFs but still possible: https://www.makomk.com/~aidan/selfmd5-lastfix.pdf
Ah darn, for some reason I thought there was a practical preimage attack on MD5. I must have been thinking of CRC32 (which is just useless for this sort of thing.) My mistake.
More closely, a baby being born with a birthmark of its DNA sequence on its skin—where the birthmark is genetic such that any clones (with the same DNA) would have the same birthmark.
Reminds me of a fun problem I encountered as a younger lad:
This sentence has three as, one b, two cs, two ds, thirty-six es, three fs, three gs, eleven hs, nine is, one j, one k, three ls, one m, eighteen ns, twelve os, one p, one q, eight rs, twenty-six ss, twenty ts, two us, five vs, seven ws, three xs, four ys and one z.
I eventually found the solution by iteratively updating the approximate distribution of each letter and finally sampling [1] - not sure if there is a better way!
you use a gradient descent approach, but there's no way to guarantee anything like convexity, right?
I bet you can construct a language where a solution exists, but all of its neighbors' errors (using the topology implied by your gd function) are local maxima.
Right, it is definitely not convex in any way and gradient descent is basically useless (and not very interesting because all it really means here is try perturbing each count by one and look for improvement). I left that there as an initial attempt at solving the problem but I don't use it in my solution (except for the call to `gd(iter(gd(v)))` where I explore the neighbouring solutions to each sample point, which is probably not necessary).
The basic idea instead is to treat each letter count as a random variable, and ask what the distribution of each count is. In particular, each letter count can be expressed as a sum of (a mapping of) the other letter counts, so if you know the distribution of all other letter counts you can improve the distribution of the letter of interest. Initially assume uniform for all counts, then iterate until convergence. The images in that repo show the distribution of some letters at various stages. After doing this for about 10 iterations you stop seeing any improvement (the letter distributions are as 'peaked' as they are going to get), at which point I drew samples until I found a solution.
Alright, I admit that I skimmed, came across a bunch of GD code at the beginning and assumed that you just got lucky with a very greedy approach, sorry :P.
The update function you wrote on the distribution space is continuous, and distribution space is compact (since there's no way to have more than N of any letter), so there is necessarily at least one fixed point.
Fixed points could still be sources though, right? It clearly wasn't, but I'm curious to know if you got lucky, or if you merely didn't get unlucky.
I wonder how effective your code would be on the following self descriptive sentences (base j).
For instance when j=2, "100 1,11 0" has 4 ones and 3 zeros as described.
Does your code consistently find solutions as you increase j? At what point does the computation become unfeasible?
Nice! I admit I don't have a good understanding of the analysis, but if you mean "does repeatedly applying f to some starting point v, like f(f(f(...(v)))) eventually lead to a solution" the answer is definitely not - you can be right next to a solution and not get there by GD or by applying f. My function `iter` applies f over and over, but this was another failed attempt at a solution.
I will endeavour to try your problem, is it related in some way to a well-known problem?
It's awesome how it can maintain the image layout in the source code through those 100 transformations.
I wonder whether it reads its own source code file to detect where to put the spaces for the image, or is it all encoded as data in the ruby program itself? I suppose it's the latter (otherwise it wouldn't qualify as a quine), but I can't figure out how it is encoded.
The answer there doesn't explain chosen-prefix attacks, which are more powerful than building a collision out of a single prefix. They require much less flexibility in the fire format; building the PDF for shattered.io wasn't trivial exactly because the SHA-1 break didn't allow choosing two prefixes.
The output format of md5sum and other checksum programs is well-defined, which lets you use the -c switch to verify. This is the format you see in checksum files when downloading source tarballs, distro ISOs, packages, etc.
$ echo hello > fileone
$ echo there > filetwo
$ md5sum fileone filetwo | tee /tmp/ck
b1946ac92492d2347c6235b4d2611184 fileone
c4ff45bb1fab99f9164b7fec14b2292a filetwo
$ md5sum -c /tmp/ck
fileone: OK
filetwo: OK
$ echo oops > filetwo
$ md5sum -c /tmp/ck
fileone: OK
filetwo: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
I think this uses one of the available GIF extensions (e.g plain-text, custom data, or other) since gif is a block based binary file. Animation blocks in one part of the file, then for the other blocks you could just put random data (which is ignored by the GIF parser) until you find the MD5 collision.
Surely it does: you have much more combinations in animated GIF than static GIF so it is easier to make "unvisible" changes to the file to get the MD5 hash displayed on final GIF slide.
Just a wild guess, but I imagine you would start by designing the gif animation to depict a plausible MD5 hash, then modifying the file in such a way that you make the file MD5 hash match the one depicted in the animation.
No, this almost certainly isn't a preimage attack – it is more likely to be caused by the fact that you can generate MD5 collisions cheaply. Getting the gif animation to show the right frames based on the colliding block is a neat trick though! Looking forward to the writeup.
This is called a fixed point, that is an X such that md5(X)==X. A quick google for "md5 fixed point" shows some claims that the probability that one exists is about 63% (https://news.ycombinator.com/item?id=614027) but no examples of a known one
Also the "real" fixed point wouldn't be necessarily ASCII
So you could have two types: one where the hash matches the input bytes, another where an hex string is inputed and that matches the hex representation of the hash
There is http://www.olegkikin.com/md5game/ on which I wasted some GPU cycles many years ago. It uses a very silly metric for progress though. Finding a longer substring that matches doesn't bring you closer to finding MD5(x) = x. It was a fun problem to try out some crypto bruteforcing on the GPU though.
I don't think any of the known weaknesses of MD5 would help you in finding MD5(x) = x.
1. Generate a gif for each possible digit in the first column
2. Append collision blocks to each gif to make a 16 way collision
3. Repeat for each digit
4. Hash the final product
5. Replace each digit with the correct digit
From https://www.reddit.com/r/programming/comments/5y03g9/animate...