Hacker News new | past | comments | ask | show | jobs | submit login
The image in this post displays its own MD5 hash (retr0.id)
766 points by kstrauser on Sept 23, 2022 | hide | past | favorite | 130 comments



Hi everyone - I go into slightly more detail on my twitter thread on the same topic: https://twitter.com/David3141593/status/1573218394358386688 (Yes, the PNG also survives being uploaded to twitter)

The pleroma instance linked in the OP is hosted on a very tiny VPS with no CDN, I fear it may fall over - if it does, consider swapping to the twitter URL.

Direct links to the image itself:

https://retr0.id/media/a13f403f-fff5-4f40-b9a2-13cce355f61b/...

https://pbs.twimg.com/media/FdUxWg-XkAE5FBx?format=png&name=...


Neat. Following some links:

> Since there are 36 pairs of blocks, there are 2^36 possible combinations, for each message. If we enumerate them all and check the resulting CRC, we'd have a reasonable chance of stumbling across the target CRC (0xdeadbeef) by chance - since there are only 2^32 possible CRC32 results.

> This is a feasible computation, but it would be a bit slow, probably on the order of minutes - infeasible as part of the inner-loop of a larger attack (such as a PNG hashquine!). Fortunately, there's a trick we can use to speed this up. [Meet in the middle.]

Wait a sec, CRC32 (and Adler32 IIRC) are linear — they are the “sum”, for all positions i in the input, of f(i, message[i]). (The actual details of f are irrelevant here.). For CRC32, “sum” is xor and, for Adler32, I think it’s the sum of pairs of u16 values. (No serious Galois field arithmetic needed — the Galois sum in the representation used by CRC32 is xor.)

So those 36 pairs each give a pair of vectors in the 32-dimensions space of checksums. Start with the message with all the a blocks. One at a time, switch a single block to its b value and calculate the change in the checksum, which can be done from the command line :). Each of those changes is a vector in checksum space, and they should form a basis with respectable probability m. You know the change you want to produce in the checksum from the all-a state to the correct checksum, and you can solve your underdetermined system of 32 equations with 36 unknowns using your favorite technique.

(Hmm. I haven’t written this out for real. I’m fairly confident it will work for CRC, since CRC’s checksum space has order 2 — solving the linear system will give 0 or 1 for each variable (i.e. colliding block), and you can read off the answer. For Adler32, the space is order 2^16, but you need a solution that only applies each available perturbation zero or one times. Off the top of my head, there are probably some good algorithms for this, but it’s maybe not trivial.)


> you can solve your underdetermined system of 32 equations with 36 unknowns using your favorite technique

What's the time complexity of your favorite technique?

(I'm a bit out-of-the-loop on the specifics of equation solving techniques - I let Z3 do this kind of thing for me, and take an educated guess on whether I'll finish during my lifetime or not)


Just naive Gaussian Elimination works in any field.


With what time complexity?


O(n^2 * w) where n is the number of equations and w is the number of elements in an equation.


I see. So in this instance, it's O(32^2 * 36)? That is better than 2^19, depending on implementation efficiency.


You’re missing the constant factors. Gaussian elimination on a matrix containing 32-bit elements using xor involves operations that likely take a single cycle. The brute-force and meet-in-the-middle operations involve CRC-ing an entire message, which is vastly slower. (Or it involves optimizing the CRC using essentially the same linear algebra technique, at which would it would be pretty silly to solve a linear system by exhaustive search instead of algebra…)

I would expect the linear algebra approach to win by orders or magnitude.


I did my meet-in-the-middle using a search tree, so you don't need to re-CRC the whole message at once.


Well to make the playing field equal it would be 2^19 times the time to check a solution, which is another factor of 36 times a fairly large constant C. So really you're comparing 2^19*C to 32^2.


> you need a solution that only applies each available perturbation zero or one times

I'd usually go to wagner's k-tree algorithm for a modular subset sum problem.

But these are pretty small instances and you shouldn't need to use a particularly memory efficient algorithm. So for example, splitting the space of all choices in two, enumerating all combinations on each side and searching for a meet-in-the middle should also work.


This is wildly impressive. Thank you for doing and sharing this, and the 1337 on either end is just icing on the cake. It's also a beautiful example of just how broken MD5 is. I wish I'd had this to show my students 3 weeks ago when teaching about collision attacks.

Looking forward to your next project: SHA3-512 :D


Was the 1337 at either end of the hash intentional?


Yes


Sorry if I overwhelmed your VPS! But seriously, this was super impressive. Well done!


It's still alive, just barely!

  CPU[*****************************************************************************100.0%]   Tasks: 49, 30 thr; 1 running
  Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||****641M/768M]   Load average: 2.51 2.12 2.01
  Swp[||||||||||||||||||||||||                                                  215M/768M]   Uptime: 166 days(!), 19:40:07


My Mastodon instance cries in how little RAM you're able to run that on. I'm envious.


Stuff like that is turning into one of our running gags at work. You have all these machine learning processes and our legacy java applications gobbling up memory by the dozen of gigabytes, or our newer java apps being a bit less greedy but still gigabyte sized - and it's never enough.

And then we have something like our zabbix proxies, and when they passed a couple tens of thousands of items, we had to increase it's cache memory... from 16MB to a glorious 64MB. Such a splurge. And the server is using a whole 128MB for its write caches. Or our Grafanas are using a total of about 500MB of memory server side total to chew through oodles of data.


It's still ridiculous that we ever thought giving Java processes so much memory was normal. I'm very happy golang and rust are getting more main stream.


Indeed. In some cases, I find the memory need to be plausible. For example, our logging ES has nodes running with 32GB of heap, but these nodes are indexing and searching ~900GB of logs each. That's, per node, actually in line with our larger postgres instances, running at 32 - 64GB or memory on 600GB - 800GB of dataset. Or our logstash needs 4-8G go go through 2k messages per second.

But then I have other stuff running on huge node and struggling to process 10 messages per second and falling over whenever load increases by 10%.


I'm legitimately curious whether zram would help or make things worse given this specific configuration.


> 768M

(╯°□°)╯︵ ┻━┻


I keep imagining the fantasy scenario where some day I stumble on a software download or repo or something. Nothing special, no fanfare, just a random piece of code to do some random task.

Inside the repo or zip, is a simple text manifest file. The file has some bog standard readme stuff and then lists all the files and their hashes, you know, so you can check nothing's corrupted whatnot.

But in that list of files, is a line item for the manifest file itself and along with its own hash! Something that on the surface looks completely innocuous but becomes profoundly impressive as you ponder it.

Kind of a low-key [fridge brilliance](https://tvtropes.org/pmwiki/pmwiki.php/Main/FridgeBrilliance) kind of flex.

[Bonus] I'm also reminded of this paper: https://vision.cornell.edu/se3/wp-content/uploads/2014/09/ge...


Sounds like the basilisk collection: https://suricrasia.online/unfiction/basilisk/


In reading that page, I was super taken in and immediately wanted to get a copy of the collection; trying to follow the links left me sad to realize it's a fictional story, but it's an amazing piece of fiction none the less.


Yeah, it reminded me of the Three-Body Problem... "hard" science fiction that's believable as an extension of what we currently know


That would be a great, and terrifying at the same time, first contact response message to us.

proof-of-superiority if you will.


One kind of that was when DRM rolled out... my memory is fuzzy on what this is as it's a vague recollection of an article once read and I hope someone will recall it and link it.

But in the cat and mouse between games trying to introduce DRM and this being instantly defeated each time because the DRM code was being removed by gamers/hackers... a company started making patches, and at the end of the patch was a bunch of what looked like garbage code, binpacked to create a certain size patch. It look erroneous. But patch arrived, then weeks and months later another patch, and so it went all the while the cat and mouse continued. Then finally DRM was enabled, and the question was how? How was this done when it hadn't been caught before? The answer was that all those fragments of code weren't innocuous or meaningless, it wasn't binpacking at all... the DRM code had been split into many fragments, shipped disguised as binpacking over a lot of patches, and by the time this was figured out the whole package had been delivered.

I'm not for copyright or DRM, but I did think that an elegant move.



same category as this. See if you can spot what's so amazing about it. https://gamefaqs.gamespot.com/snes/588741-super-metroid/faqs...


I had to Google what was so special about this guide. Now that I know, I'm trying to find some justification to spoil it in this thread.


Clearly you didn't need to justify spoiling it anyway :)


smooth :)


I didn't Google it, but when it hit me, so did your comment!

It's like the realization arrived at the same time from two different directions. Thank you!

(One of the FAQ questions in the guide helped)


If it took you a lot of time to notice it like I did: <spoiler> it's the paragraph justification


Even with this spoiler I didn’t understand it, but then I read https://twitter.com/mikko/status/1383390503635324928?lang=en and then I understood.


What's the point of a <spoiler> tag when it's impossible to not read the spoiler at the same time.

Rot13 it or something at least (unicode upside-down, etc).


op actually edited the comment like 4 time to try and make the spoiler tag work. So op, for what it's worth, ROT13 (google) is a good idea for spoilers in the future :)


Aw, the other spoilers here were at least fun.


Wow. First I thought it was only the disclaimer, but it's all the way down. How much work must that have been!?


All the way down, but sadly there appears to be an off-by-one line in the Credits and Thanks section.

As a bonus challenge: fix the spelling of "missile" (mis-spelled as "missle" throughout, in 200+ occurrences), while maintaining the layout.


And it predates any kind of AI


> But in that list of files, is a line item for the manifest file itself and along with its own hash! Something that on the surface looks completely innocuous but becomes profoundly impressive as you ponder it.

Agree this would be cool. But there would be one last realisation after contemplating it for just a little longer. The sheer number CPU cycles that would have to have been committed to achieving it would be awful. Like Bitcoin mining but orders of magnitude even more wasteful.


Or they "just" broke the hash algorithm.


Trying to find that hash by iterating on the manifest hash (as opposed to e.g. manipulating whitespace etc) would eventually succeed approximately 1/e of the time. Better get cracking!


If you define HASH(data) = SHA3(data) xor SHA3(manifest), then the manifest will always have a zero hash. This trick can be useful if you need to create a cycle in a hash-addressed list (e.g., an object-oriented class hierarchy where inheritance is by hash of the class definition—the definition of the base class would have a zero hash and could inherit from itself).


To me this can be done with a simple collision attack (assuming you can fiddle with some bytes inside the manifet file while freezing everything else in advance), which can be found under a second for MD5 with a laptop, and a few hundred thousands dollars of cloud resources for SHA1.


No, that takes a preimage attack, which doesn't exist (yet? I'm not too optimistic, personally).

It could maaaaaaaybe be done using multiple collisions that exploit the structure of a DEFLATE-compressed stream, so that you can control the extracted zip contents on a byte-by-byte basis - but I haven't figured that out just yet. Watch this space!


I don’t understand how this is a pre-image attack, as the manifest file only references itself (and not the zip file) and you can fiddle with the manifest file to your liking. To me this is the same theoretical problem as this self-referencing PNG file.


How do you fiddle with the manifest without changing the MD5?


You do want to change the MD5 of the manifest. This is what makes it a collision attack instead of a preimage attack.


Change it to what? A specific pre-determined value? That's a preimage.


This is indeed a preimage attack if the manifest content (besides its own self-referenced hash) is fixed. However this is not the case in practice: to pull off this trick you could just append some random bytes at the end of the manifest, disguised as ASCII art or something like that. The manifest would still be human readable and correct, but this would become a collision attack.

Again, to me this is the exact same problem as this self-referential PNG file, which is a very cool trick but which can be (demonstrably) computed with limited compute resources.


Appending a suffix to try to meet a specific hash value is equivalent to preimage (and is not currently possible)


One last comment though: I didn’t realize you were the author of the post (great work!!). This let me think you know your stuff, and you know something that I don’t and I need to think of all of that more carefully. So it is very probable you are right and I am wrong. Thanks for the discussion!


Again, the idea is not to find a specific hash value, but any $hash for which the property md5($manifest_content, $hash, $random_bytes) = $hash is true. You don’t need to match a specific hash value.

And you never answered how this manifest is somehow different than the self-referential png.

It seems we do not understand each other (unfortunately HN comments are not the best avenue for deep discussions) so this will be my last post on this thread as we both have better things to do than talking past each other.


Thank you for the bonus! This is the best kind of humor :)


Related, from earlier today: "MD5 Collision with CRC32 Preimage" https://news.ycombinator.com/item?id=32956235


From the discussion:

> This was particularly tricky to make work because the image data in a PNG needs to have a valid adler32 checksum, and a valid crc32 checksum.


In hindsight, I really regret not including the adler32 and crc32 in the image itself too, since I knew them ahead of time. If you inspect them with hexeditor, you'll find they have non-random values :)


But to make up for it you started and ended with 1337 so all is forgiven


Does this mean you could in theory make the adler32 and crc32 part of the md5 hash as well?


I could, but I'd have to set adler32==crc32 (doable), otherwise the md5 bruteforce would be too difficult.

In fact, I could do this fairly quickly without needing to recompute much at all...


Is there an MD5 hash string that hashes to itself?


Neat question. The best answer seems to be an HN thread from 13 years ago (!) which posits that the chance of one existing is ~63%:

https://news.ycombinator.com/item?id=614079

However, it appears no one has actually discovered it yet, if it exists.

A more tractable question might be to find a cycle in the MD5 hash space, like a->b->c->d->a. So one might ask, what is the shortest MD5 cycle found so far?


Funny timing, I was joking about this percentage mere days ago - https://retr0.id/notice/ANnYouw6w2di8XvcNE


Are there any other examples of this?


I think it's mostly a joke because the negation of a predicate is a predicate too, and they can't both have the same probability unless it's 50%.


Follow-up -- If you repeatedly do MD5(MD5(MD5(something))), how long will it take to find a repeat value?

Assuming MD5 is random, this is the birthday problem. Since MD5 is 128 bits, you need about 10^19 hashes to get a repeat. [1]

The MD5 hashrate on an RTX 3090 is about 65 billion hashes per second. [2]

Dividing that out, about 4 GPU-years would be needed to find a repeat in this brute-force way.

[1] https://en.wikipedia.org/wiki/Birthday_attack#Mathematics

[2] https://gist.github.com/Chick3nman/e4fcee00cb6d82874dace7210...


You could use a cycle-detection algorithm like the tortoise and hare, to avoid needing to keep a list of previous results etc. https://en.wikipedia.org/wiki/Cycle_detection#Floyd's_tortoi...


Not even that many. With MD5 and at least through SHA-2 you can build the search tree based on a branching structure, where you clone the in-process hash as you go, so that for instance all of the hashes that differ only by the last byte share one partial hash and then feed in the last digit.

As long as you can keep 31 partial hashes in memory at a time, which is trivial, you don't have to rerun the work so far to increment the last digit by 1. I think I may have written the code for this at one point. I'll have to look around and see if I can find it.


That doesn't apply to this situation.


> Is there an MD5 hash string that hashes to itself?

That’s calculable in cpu hours, not GPU years. If the answer is one then you’re done. If it’s not then you test N=2 in 4 GPU years, rather than 8. As N increases the answer becomes less interesting, and the cost goes quadratic.


I'm not following. Let's assume N=1, what's the pseudocode to find it in cpu-hours?


Isn't the GPU doing that in parallel? If you want to feed the output of the hash into itself, I think it'll be much slower.


You'd need to run many parallel processes like MD5(MD5(MD5(thing1))), MD5(MD5(MD5(thing2)))...

I think the math still works if you do that. But the cycle would have to be contained entirely within one of the individual runs. So that could potentially take longer.

The tricky part would be combing through the 160 exabytes of output to find the cycle.


So I’m getting much bigger than 160 exabytes. Can you show your math?

MD5 is 128 bits so that’s 2^128 keys. And it’s 32 bytes for the output, so that’s really 2^133 bytes. Which is 10^38 which we don’t even have a unit for yet. Apparently it’s 100 million queccabytes.

You can get that down to two million queccabytes if you use a Bloom filter.


You don't have to generate every possible md5. Just ~10^19 keys is enough to probably have a cycle.


> The tricky part would be combing through the 160 exabytes of output to find the cycle

I don't think you get to leave that out of the O(?) of this problem


You can, since there are cycle detection algorithms that do not increase the big-O complexity at all (constant factor).


Sounds a bit like proving there are no cycles in the 3n+1 equation. https://en.m.wikipedia.org/wiki/Collatz_conjecture. Veritasium has a good video about it.


Probably not but, would love to be proven wrong. To get this stuff to work you have to add a ton of garbage data to the end of the file from what I understand which you can't really do with a string.


It is interesting that the OP used “hash string” instead of MD5 value. I’d wager that if you dove into string encodings that you would find some.


There's not _that_ many encodings that feel natural enough. Maybe: bytes, ascii/utf-8/latin1/etc. (all equiv.), utf-16, utf-32, maybe ebcdic.

I guess it should be pretty likely to exist if you try them all, but the search is likely very computationally difficult unless I'm forgetting some particular weakness in md5 (quite possible).


There are more options if you consider whitespace. You can separate the digit pairs with single or double spaces (or in groups of 4, etc. etc.), and end the line with nothing, or \n, or \r\n


Good call, \n and \r\n feel like good ideas. Separating digits feels like it'd look a little cheap to me, but I'd still be impressed.


True, but you are mapping X items onto X space. The odds of a hit, if the algorithm allows it is already reasonably high, add four more encodings and I'd put money on one hitting.


Yeah I think it's essentially independent dice rolls, that's gotta put it well into the 90%s


I did a ELI5 writeup on this. Hope someone finds it useful: https://tongwing.woon.sg/blog/image-which-displays-its-own-m...


I found it useful. Thank you so much.


Thank you!


Someone did a GIF version before: https://github.com/Rogdham/gif-md5-hashquine



Related: Inverting hash functions using SAT and SMT solvers [0]

[0] https://blog.lse.epita.fr/2012/07/31/using-sat-and-smt-to-de...


You can also mine Bitcoin with SAT solvers. And if the difficulty is high enough, it will be more efficient than the current brute force methods methods

https://jheusser.github.io/2013/02/03/satcoin.html

> Bitcoin Difficulty and Assumptions

> A very intriguing, and perhaps unintuitive property of the algorithm proposed is that with increasing bitcoin difficulty, or equally lower target, the search could become more efficient, at least in theory. This is because we can assume more about the structure of a valid hash -- a lower target means more leading zeros which are assumed to be zero in the SAT-based algorithm.

> If this is true and is a substantial effect then this is an important issue since the rate of the money supply is regulated with the difficulty. The higher the difficulty, the less likely it is that an individual block has a valid nonce. However, conventional mining algorithms always have to perform the same amount of work (i.e. try 4 billion nonce values) to reject a block. On the other hand, "Satcoin" miners will get progressively faster which could lead to imbalances in the controlled supply. These are all just speculations and depend on many factors, first and foremost how well the SAT-based approach can be improved and whether the probability of finding a valid nonce does not dwarf the efficiency gain of the algorithm.

https://www.quora.com/Are-SAT-solvers-faster-than-linear-sea...

https://github.com/jheusser/satcoin an implementation


The relevance/connection of the following now escapes me but I already wrote it so here it is.

I want to detect cheating in high level chess by estimating the humanness of chess moves with AI. The idea is simple. Take a huge database of GM matches, run stockfish on each position and train a classifier to map position+move to human/fish/both. Hopefully this can be done with less compute than coming up with the moves themselves

The problem is that if "fishy" moves can be identified, they can also be avoided. This could be done with adversarial training or just by using the above as a filter on stockfish output.


Impressive! This is the true “hacking” spirit!


What's the estimated compute cost currently to do this for SHA-1?


This definitely gets extra points, made me snicker... twice. 1337!


Are there any quines where an image or something contains not only one representation of MD5 of itself but also a MD5 of same image but without a quine (2 hashes)?

If not, is that any harder considering how badly MD5 is broken? For example, BMP image format seems perfect for this kind of quines because user can check one predetermined hash, that edit image in any simple image editor and check another predetermined hash.


> For example, BMP image format seems perfect for this kind of quines because user can check one predetermined hash, that edit image in any simple image editor and check another predetermined hash.

That is assuming there is only one BMP version, which is not true.


Is this possible only because md5 is weak? Is it possible to do the same with sha2 or sha1?


Yup - it's possible to do with any hash algorithm. You can also do it with any hash value. A key point here is that there are a ton of pixels in this image whose value can be changed slightly without disturbing the image. The # of variants possible quickly starts to dwarf the # of valid hash values possible, to the point where you can find the hash eventually, if you just keep looking.


Given that an md5 hash is 128 bit, by simple brute force, you would need to try 2^128 variants (in expectation) to find the matching hash. Thats rougly 10^38 combinations. That's a huge number, I can't imagine that the author simply hashed variants until one happened to match, there must habe been some other technique involved.


Yes. Not yet.


Neat. This reminds me (somewhat obliquely) of a shareware library in the 80's that allowed a c-compiled binary on windows to be aware of its own CRC-32. The idea was to prevent hex-tampering.


Originally I through blockchain is a linked list that points to the hash of next block.

(The actual blockchain block points to it's previous block)


This looks impressive.

Is an MD5 hash still "safe" if you use a salt? Can an attacker generate a collision having the MD5 hash without knowing the salt?


There are no known attacks against MD5 as long as the data you hash is not controllable by the attacker.

You should still use a different hash algorithm though.


That's like saying "There are no known attacks against MD5 as long as you don't try to attack it"


Usecase where this property is useful:

Developer writes some code and publishes it. It's big, so he puts it on an untrusted CDN, and also publishes an MD5 hash of the code (not via the CDN).

User downloads the code from the CDN, and verifies the published hash matches.

A malicious CDN couldn't make an evil file with a matching hash, based on known attacks against MD5, unless they could influence the Developer to get certain data into the original file.


"malicious CDN couldn't make an evil file"

Then the CDN, by definition, would control the data that the end-user (downloader) hashes.

But I understand the confusion: londons_explore meant to write "there are no known (practical) preimage attacks" against MD5, which is true, since the only theoretical preimage has a complexity 2^123 or so.


Not really. This actually provides useful properties in practice, though you should still probably not rely on that.


Trying to make md5 safe is annoying enough, and has few enough benefits, that it's basically a waste of time.

I _think_ that would do it though, if your salt is private and secure enough and you apply it the right way. I easily could be missing an attack though, so take with a large grain of salt (heh).


> Can an attacker generate a collision having the MD5 hash without knowing the salt?

Depending on how the salt is applied, yes.


Let's say I append 128 random bytes to the input string.


Trivially broken. Ignoring the adler32 and crc32 checksums, I can retroactively adjust the hex digits in the image without affecting the resultant hash, even with a suffix - as long as I know what the final hash is.


I was talking about basic string hashing (not images), so I guess that would be even simpler.

> I can retroactively adjust the hex digits in the image without affecting the resultant hash

My point is: I give you a hash H, can you generate a collision by finding a string X, so that when I append an (unknown to you) salt S then MD5(X + S) = H.

EDIT: To make it clear, the only feedback you get when you try X is whether the final hash matches, you don't get the resulting hash each time.


MD5 is not broken in the way you think it is. Even without salting if you give a hash H there is basically no way to recover a string that hashes to it other than randomly trying strings 2^128 times.

What you can do trivially is find 2 strings X1 and X2 such that md5(X1) == md5(X2). In this case seeding the way you described won't help because md5(X1+S) will equal md5(X2+S) due to the way MD5 works


Assuming that you are able to easily trigger the hashing with salt enough times for you to adjust the image... right? Any idea (ballpark) how many such test iterations would you need?

Great work btw!


Approximately 1


MD5 is vulnerable to both length extension and chosen prefix collisions, so I think an attacker can generate a collision even without knowing the prefix, just by observing a single output. That said, I'm also pretty sure HMAC-MD5 is still unbroken in practice.


Can someone explain why the image displays a Dragon that eats itself? Mere coincidence?


It's a reference to quine. A program which has it's own source as output. A fun brain teaser that I recommend trying


Great, thanks. Now this all makes sense.


All the reply links other than #2 don't work. Any idea?


I disabled replies to save server resources, it would probably be down right now, otherwise. If you have access to another fediverse instance, you should be able to view them that way.


This is beautiful. Douglas Hofstadter would be proud. 1337 at the end got me.


Hy


Good job, can we just try to make sure this is the last MD5-insecurity demonstration effort ever? Anyone who is not convinced is never going to be convinced. At the same time, MD-5 or even CRC-32 unintentional collisions are too unlikely to worry about in most practical scenarios. Future computational efforts would do well to improve tampering resistance or unintentional corruption detection rather than re-litigating the obvious.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: