The pleroma instance linked in the OP is hosted on a very tiny VPS with no CDN, I fear it may fall over - if it does, consider swapping to the twitter URL.
> Since there are 36 pairs of blocks, there are 2^36 possible combinations, for each message. If we enumerate them all and check the resulting CRC, we'd have a reasonable chance of stumbling across the target CRC (0xdeadbeef) by chance - since there are only 2^32 possible CRC32 results.
> This is a feasible computation, but it would be a bit slow, probably on the order of minutes - infeasible as part of the inner-loop of a larger attack (such as a PNG hashquine!). Fortunately, there's a trick we can use to speed this up. [Meet in the middle.]
Wait a sec, CRC32 (and Adler32 IIRC) are linear — they are the “sum”, for all positions i in the input, of f(i, message[i]). (The actual details of f are irrelevant here.). For CRC32, “sum” is xor and, for Adler32, I think it’s the sum of pairs of u16 values. (No serious Galois field arithmetic needed — the Galois sum in the representation used by CRC32 is xor.)
So those 36 pairs each give a pair of vectors in the 32-dimensions space of checksums. Start with the message with all the a blocks. One at a time, switch a single block to its b value and calculate the change in the checksum, which can be done from the command line :). Each of those changes is a vector in checksum space, and they should form a basis with respectable probability m. You know the change you want to produce in the checksum from the all-a state to the correct checksum, and you can solve your underdetermined system of 32 equations with 36 unknowns using your favorite technique.
(Hmm. I haven’t written this out for real. I’m fairly confident it will work for CRC, since CRC’s checksum space has order 2 — solving the linear system will give 0 or 1 for each variable (i.e. colliding block), and you can read off the answer. For Adler32, the space is order 2^16, but you need a solution that only applies each available perturbation zero or one times. Off the top of my head, there are probably some good algorithms for this, but it’s maybe not trivial.)
> you can solve your underdetermined system of 32 equations with 36 unknowns using your favorite technique
What's the time complexity of your favorite technique?
(I'm a bit out-of-the-loop on the specifics of equation solving techniques - I let Z3 do this kind of thing for me, and take an educated guess on whether I'll finish during my lifetime or not)
You’re missing the constant factors. Gaussian elimination on a matrix containing 32-bit elements using xor involves operations that likely take a single cycle. The brute-force and meet-in-the-middle operations involve CRC-ing an entire message, which is vastly slower. (Or it involves optimizing the CRC using essentially the same linear algebra technique, at which would it would be pretty silly to solve a linear system by exhaustive search instead of algebra…)
I would expect the linear algebra approach to win by orders or magnitude.
Well to make the playing field equal it would be 2^19 times the time to check a solution, which is another factor of 36 times a fairly large constant C. So really you're comparing 2^19*C to 32^2.
> you need a solution that only applies each available perturbation zero or one times
I'd usually go to wagner's k-tree algorithm for a modular subset sum problem.
But these are pretty small instances and you shouldn't need to use a particularly memory efficient algorithm. So for example, splitting the space of all choices in two, enumerating all combinations on each side and searching for a meet-in-the middle should also work.
This is wildly impressive. Thank you for doing and sharing this, and the 1337 on either end is just icing on the cake. It's also a beautiful example of just how broken MD5 is. I wish I'd had this to show my students 3 weeks ago when teaching about collision attacks.
Stuff like that is turning into one of our running gags at work. You have all these machine learning processes and our legacy java applications gobbling up memory by the dozen of gigabytes, or our newer java apps being a bit less greedy but still gigabyte sized - and it's never enough.
And then we have something like our zabbix proxies, and when they passed a couple tens of thousands of items, we had to increase it's cache memory... from 16MB to a glorious 64MB. Such a splurge. And the server is using a whole 128MB for its write caches. Or our Grafanas are using a total of about 500MB of memory server side total to chew through oodles of data.
It's still ridiculous that we ever thought giving Java processes so much memory was normal. I'm very happy golang and rust are getting more main stream.
Indeed. In some cases, I find the memory need to be plausible. For example, our logging ES has nodes running with 32GB of heap, but these nodes are indexing and searching ~900GB of logs each. That's, per node, actually in line with our larger postgres instances, running at 32 - 64GB or memory on 600GB - 800GB of dataset. Or our logstash needs 4-8G go go through 2k messages per second.
But then I have other stuff running on huge node and struggling to process 10 messages per second and falling over whenever load increases by 10%.
I keep imagining the fantasy scenario where some day I stumble on a software download or repo or something. Nothing special, no fanfare, just a random piece of code to do some random task.
Inside the repo or zip, is a simple text manifest file. The file has some bog standard readme stuff and then lists all the files and their hashes, you know, so you can check nothing's corrupted whatnot.
But in that list of files, is a line item for the manifest file itself and along with its own hash! Something that on the surface looks completely innocuous but becomes profoundly impressive as you ponder it.
In reading that page, I was super taken in and immediately wanted to get a copy of the collection; trying to follow the links left me sad to realize it's a fictional story, but it's an amazing piece of fiction none the less.
One kind of that was when DRM rolled out... my memory is fuzzy on what this is as it's a vague recollection of an article once read and I hope someone will recall it and link it.
But in the cat and mouse between games trying to introduce DRM and this being instantly defeated each time because the DRM code was being removed by gamers/hackers... a company started making patches, and at the end of the patch was a bunch of what looked like garbage code, binpacked to create a certain size patch. It look erroneous. But patch arrived, then weeks and months later another patch, and so it went all the while the cat and mouse continued. Then finally DRM was enabled, and the question was how? How was this done when it hadn't been caught before? The answer was that all those fragments of code weren't innocuous or meaningless, it wasn't binpacking at all... the DRM code had been split into many fragments, shipped disguised as binpacking over a lot of patches, and by the time this was figured out the whole package had been delivered.
I'm not for copyright or DRM, but I did think that an elegant move.
op actually edited the comment like 4 time to try and make the spoiler tag work. So op, for what it's worth, ROT13 (google) is a good idea for spoilers in the future :)
> But in that list of files, is a line item for the manifest file itself and along with its own hash! Something that on the surface looks completely innocuous but becomes profoundly impressive as you ponder it.
Agree this would be cool. But there would be one last realisation after contemplating it for just a little longer. The sheer number CPU cycles that would have to have been committed to achieving it would be awful. Like Bitcoin mining but orders of magnitude even more wasteful.
Trying to find that hash by iterating on the manifest hash (as opposed to e.g. manipulating whitespace etc) would eventually succeed approximately 1/e of the time. Better get cracking!
If you define HASH(data) = SHA3(data) xor SHA3(manifest), then the manifest will always have a zero hash. This trick can be useful if you need to create a cycle in a hash-addressed list (e.g., an object-oriented class hierarchy where inheritance is by hash of the class definition—the definition of the base class would have a zero hash and could inherit from itself).
To me this can be done with a simple collision attack (assuming you can fiddle with some bytes inside the manifet file while freezing everything else in advance), which can be found under a second for MD5 with a laptop, and a few hundred thousands dollars of cloud resources for SHA1.
No, that takes a preimage attack, which doesn't exist (yet? I'm not too optimistic, personally).
It could maaaaaaaybe be done using multiple collisions that exploit the structure of a DEFLATE-compressed stream, so that you can control the extracted zip contents on a byte-by-byte basis - but I haven't figured that out just yet. Watch this space!
I don’t understand how this is a pre-image attack, as the manifest file only references itself (and not the zip file) and you can fiddle with the manifest file to your liking. To me this is the same theoretical problem as this self-referencing PNG file.
This is indeed a preimage attack if the manifest content (besides its own self-referenced hash) is fixed. However this is not the case in practice: to pull off this trick you could just append some random bytes at the end of the manifest, disguised as ASCII art or something like that. The manifest would still be human readable and correct, but this would become a collision attack.
Again, to me this is the exact same problem as this self-referential PNG file, which is a very cool trick but which can be (demonstrably) computed with limited compute resources.
One last comment though: I didn’t realize you were the author of the post (great work!!). This let me think you know your stuff, and you know something that I don’t and I need to think of all of that more carefully. So it is very probable you are right and I am wrong. Thanks for the discussion!
Again, the idea is not to find a specific hash value, but any $hash for which the property md5($manifest_content, $hash, $random_bytes) = $hash is true. You don’t need to match a specific hash value.
And you never answered how this manifest is somehow different than the self-referential png.
It seems we do not understand each other (unfortunately HN comments are not the best avenue for deep discussions) so this will be my last post on this thread as we both have better things to do than talking past each other.
In hindsight, I really regret not including the adler32 and crc32 in the image itself too, since I knew them ahead of time. If you inspect them with hexeditor, you'll find they have non-random values :)
However, it appears no one has actually discovered it yet, if it exists.
A more tractable question might be to find a cycle in the MD5 hash space, like a->b->c->d->a. So one might ask, what is the shortest MD5 cycle found so far?
Not even that many. With MD5 and at least through SHA-2 you can build the search tree based on a branching structure, where you clone the in-process hash as you go, so that for instance all of the hashes that differ only by the last byte share one partial hash and then feed in the last digit.
As long as you can keep 31 partial hashes in memory at a time, which is trivial, you don't have to rerun the work so far to increment the last digit by 1. I think I may have written the code for this at one point. I'll have to look around and see if I can find it.
> Is there an MD5 hash string that hashes to itself?
That’s calculable in cpu hours, not GPU years. If the answer is one then you’re done. If it’s not then you test N=2 in 4 GPU years, rather than 8. As N increases the answer becomes less interesting, and the cost goes quadratic.
You'd need to run many parallel processes like MD5(MD5(MD5(thing1))), MD5(MD5(MD5(thing2)))...
I think the math still works if you do that. But the cycle would have to be contained entirely within one of the individual runs. So that could potentially take longer.
The tricky part would be combing through the 160 exabytes of output to find the cycle.
So I’m getting much bigger than 160 exabytes. Can you show your math?
MD5 is 128 bits so that’s 2^128 keys. And it’s 32 bytes for the output, so that’s really 2^133 bytes. Which is 10^38 which we don’t even have a unit for yet. Apparently it’s 100 million queccabytes.
You can get that down to two million queccabytes if you use a Bloom filter.
Probably not but, would love to be proven wrong. To get this stuff to work you have to add a ton of garbage data to the end of the file from what I understand which you can't really do with a string.
There's not _that_ many encodings that feel natural enough. Maybe: bytes, ascii/utf-8/latin1/etc. (all equiv.), utf-16, utf-32, maybe ebcdic.
I guess it should be pretty likely to exist if you try them all, but the search is likely very computationally difficult unless I'm forgetting some particular weakness in md5 (quite possible).
There are more options if you consider whitespace. You can separate the digit pairs with single or double spaces (or in groups of 4, etc. etc.), and end the line with nothing, or \n, or \r\n
True, but you are mapping X items onto X space. The odds of a hit, if the algorithm allows it is already reasonably high, add four more encodings and I'd put money on one hitting.
You can also mine Bitcoin with SAT solvers. And if the difficulty is high enough, it will be more efficient than the current brute force methods methods
> A very intriguing, and perhaps unintuitive property of the algorithm proposed is that with increasing bitcoin difficulty, or equally lower target, the search could become more efficient, at least in theory. This is because we can assume more about the structure of a valid hash -- a lower target means more leading zeros which are assumed to be zero in the SAT-based algorithm.
> If this is true and is a substantial effect then this is an important issue since the rate of the money supply is regulated with the difficulty. The higher the difficulty, the less likely it is that an individual block has a valid nonce. However, conventional mining algorithms always have to perform the same amount of work (i.e. try 4 billion nonce values) to reject a block. On the other hand, "Satcoin" miners will get progressively faster which could lead to imbalances in the controlled supply. These are all just speculations and depend on many factors, first and foremost how well the SAT-based approach can be improved and whether the probability of finding a valid nonce does not dwarf the efficiency gain of the algorithm.
The relevance/connection of the following now escapes me but I already wrote it so here it is.
I want to detect cheating in high level chess by estimating the humanness of chess moves with AI. The idea is simple. Take a huge database of GM matches, run stockfish on each position and train a classifier to map position+move to human/fish/both. Hopefully this can be done with less compute than coming up with the moves themselves
The problem is that if "fishy" moves can be identified, they can also be avoided. This could be done with adversarial training or just by using the above as a filter on stockfish output.
Are there any quines where an image or something contains not only one representation of MD5 of itself but also a MD5 of same image but without a quine (2 hashes)?
If not, is that any harder considering how badly MD5 is broken? For example, BMP image format seems perfect for this kind of quines because user can check one predetermined hash, that edit image in any simple image editor and check another predetermined hash.
> For example, BMP image format seems perfect for this kind of quines because user can check one predetermined hash, that edit image in any simple image editor and check another predetermined hash.
That is assuming there is only one BMP version, which is not true.
Yup - it's possible to do with any hash algorithm. You can also do it with any hash value. A key point here is that there are a ton of pixels in this image whose value can be changed slightly without disturbing the image. The # of variants possible quickly starts to dwarf the # of valid hash values possible, to the point where you can find the hash eventually, if you just keep looking.
Given that an md5 hash is 128 bit, by simple brute force, you would need to try 2^128 variants (in expectation) to find the matching hash. Thats rougly 10^38 combinations. That's a huge number, I can't imagine that the author simply hashed variants until one happened to match, there must habe been some other technique involved.
Neat. This reminds me (somewhat obliquely) of a shareware library in the 80's that allowed a c-compiled binary on windows to be aware of its own CRC-32. The idea was to prevent hex-tampering.
Developer writes some code and publishes it. It's big, so he puts it on an untrusted CDN, and also publishes an MD5 hash of the code (not via the CDN).
User downloads the code from the CDN, and verifies the published hash matches.
A malicious CDN couldn't make an evil file with a matching hash, based on known attacks against MD5, unless they could influence the Developer to get certain data into the original file.
Then the CDN, by definition, would control the data that the end-user (downloader) hashes.
But I understand the confusion: londons_explore meant to write "there are no known (practical) preimage attacks" against MD5, which is true, since the only theoretical preimage has a complexity 2^123 or so.
Trying to make md5 safe is annoying enough, and has few enough benefits, that it's basically a waste of time.
I _think_ that would do it though, if your salt is private and secure enough and you apply it the right way. I easily could be missing an attack though, so take with a large grain of salt (heh).
Trivially broken. Ignoring the adler32 and crc32 checksums, I can retroactively adjust the hex digits in the image without affecting the resultant hash, even with a suffix - as long as I know what the final hash is.
I was talking about basic string hashing (not images), so I guess that would be even simpler.
> I can retroactively adjust the hex digits in the image without affecting the resultant hash
My point is: I give you a hash H, can you generate a collision by finding a string X, so that when I append an (unknown to you) salt S then MD5(X + S) = H.
EDIT: To make it clear, the only feedback you get when you try X is whether the final hash matches, you don't get the resulting hash each time.
MD5 is not broken in the way you think it is. Even without salting if you give a hash H there is basically no way to recover a string that hashes to it other than randomly trying strings 2^128 times.
What you can do trivially is find 2 strings X1 and X2 such that md5(X1) == md5(X2). In this case seeding the way you described won't help because md5(X1+S) will equal md5(X2+S) due to the way MD5 works
Assuming that you are able to easily trigger the hashing with salt enough times for you to adjust the image... right? Any idea (ballpark) how many such test iterations would you need?
MD5 is vulnerable to both length extension and chosen prefix collisions, so I think an attacker can generate a collision even without knowing the prefix, just by observing a single output. That said, I'm also pretty sure HMAC-MD5 is still unbroken in practice.
I disabled replies to save server resources, it would probably be down right now, otherwise. If you have access to another fediverse instance, you should be able to view them that way.
Good job, can we just try to make sure this is the last MD5-insecurity demonstration effort ever? Anyone who is not convinced is never going to be convinced. At the same time, MD-5 or even CRC-32 unintentional collisions are too unlikely to worry about in most practical scenarios. Future computational efforts would do well to improve tampering resistance or unintentional corruption detection rather than re-litigating the obvious.
The pleroma instance linked in the OP is hosted on a very tiny VPS with no CDN, I fear it may fall over - if it does, consider swapping to the twitter URL.
Direct links to the image itself:
https://retr0.id/media/a13f403f-fff5-4f40-b9a2-13cce355f61b/...
https://pbs.twimg.com/media/FdUxWg-XkAE5FBx?format=png&name=...