Hacker News new | past | comments | ask | show | jobs | submit login
The Curious Case of MD5 (katelynsills.com)
230 points by w4lker on Jan 3, 2024 | hide | past | favorite | 164 comments



The unsatisfying answer to this is probably that it just doesn't matter. It's not as if evidence chain of custody is assured cryptographically; it's assured by rules and regulations and an adversarial system. If you tried to submit as evidence a forged document vouchsafed with a colliding MD5 hash, you'd be putting your own freedom at risk, because the forgery will be straightforwardly detectable (the real document won't have hash colliding artifacts in it).

None of this is to say that the legal profession shouldn't move to SHA2; it should.


You say to use SHA2, but TFA says to use SHA3 or Blake. I think your recommendation is the better one, but I feel like teasing out why because it's interesting.

Firstly ... the NIST recommendation TFA links doesn't just recommend SHA-3, it actually says "Federal agencies should use SHA-2 or SHA-3 as an alternative to SHA-1." SHA-2 and SHA-3 are both valid and recommended hash functions by NIST. And while 3 is higher than 2, and SHA-3 is newer, in this case it doesn't mean "better". Being based on Keccak and a sponge construction, SHA-3 provides "diversity" more than "improvement".

Secondly ... SHA2 is widely implemented in existing hardware, and it's just currently more efficient (and likely to remain so). So why waste power, especially on something you'll be doing in bulk.

O.k., so that's why SHA-2 and not SHA-3. But Blake is worth avoiding IMO ... because the FISMA says that for Federal work, you have to use one of what NIST recommends in FIPS. Obviously the legal profession needs to be able to practice in Federal courts (Article III and administrative) ... so if you're going to pick a new standard, pick one of those (but not SHA-3!).

Lastly, and this is really an aside ... it's not uncommon for some folks to think SHA3 and SHA384 are the same thing, but they are not. SHA384 is just a variant of SHA-2 with a 384-bit digest length and correspondingly improved security margin. Other Federal standards, like CNSA, separately recommend SHA384 as a good minimum ... so it can be confusing and I think it's understandable why some people think this is what SHA3 is just short for.


BLAKE3 is faster than hardware accelerated SHA-2 because the tree mode used in BLAKE3 allows hashing parts of a single message in parallel (with SHA-2, parts of a single message have to be hashed one after another, and parallelism is only used in workloads where you process multiple messages at the same time).

https://github.com/minio/sha256-simd

https://github.com/BLAKE3-team/BLAKE3


SHA-3 (also BLAKE2 or BLAKE3) is definitely more secure for very large documents than SHA-2.

The security (i.e. the difficulty in finding collisions) decreases with the length of the document for SHA-2 and it stays constant for SHA-3.

Nevertheless, it is unlikely that typical legal documents are big enough for this to matter, except when the hashes would be e.g. for entire seized HDDs or SSDs, so SHA-2 is an acceptable choice for replacing MD5.

The only reason why SHA-3 has not become widespread is that it, like also AES, requires hardware support for good performance, but for some reason Intel has not added an SHA-3 instruction to the x86 ISA. Arrow Lake S, expected to be launched at the end of this year, will add support for SHA-512 and for the Chinese standard hashes, but there is still no intention to add SHA-3 (like Arm already did).

Both AES and SHA-3 are not recommendable on CPUs without dedicated instructions, due to low performance, but with hardware support they become faster than any alternatives. The difference between them is that now only the cheaper microcontrollers lack AES support, while SHA-3 support is still seldom encountered.


> The security (i.e. the difficulty in finding collisions) decreases with the length of the document for SHA-2

Could you spell this part out for me?


Twenty years ago this paper was considered surprising and, together with a handful of other attacks and with the concrete attacks against MD5 and SHA-1 succeeded by some Chinese researchers prompted the organization of the SHA-3 competition.

John Kelsey and Bruce Schneier: "Second Preimages on n-bit Hash Functions for Much Less than 2^n Work"

https://eprint.iacr.org/2004/304

The abstract at this link provides the essential results.

"We provide a second preimage attack on all n-bit iterated hash functions with Damgaard-Merkle strengthening and n-bit intermediate states, allowing a second preimage to be found for a 2^k-message-block message with about k * 2^(n/2+1) + 2^(n-k+1) work. Using SHA-1 as an example, our attack can find a second preimage for a 2^60 byte message in 2^106 work, rather than the previously expected 2^160 work."

Besides this result, there is also the previous result obtained by Joux for multi-collisions, which also become easier for longer input data (Antoine Joux: "Multicollisions in Iterated Hash Functions. Application to Cascaded Constructions").


Thanks! I've added a note about this here: https://github.com/oconnor663/bao/issues/41#issuecomment-119.... Does that sound like an accurate summary to you?


It is OK, but the problems demonstrated by these attacks and others similar to them are specific to the linear chaining of a great number of blocks, where the same hashing function is applied to all blocks.

Incorporating somehow a block counter in the block hashing function or using a sponge structure is equivalent to using a different hashing function for each block, which stops the attacks.

Tree hashing where the final branches i.e. the chunks are short, so there are only a small number of blocks hashed in cascade with the same function, behaves differently from linear block chaining, because the hashing function must include additional inputs anyway, so the blocks in a chunk are hashed differently from the blocks that merge tree branches or the root block.

Whether any of the attacks that exist for simple Merkle-Damgaard cascading like SHA-2 are applicable to a tree hash depends on how the blocks are encoded and on the structure of the tree. In any case, the probabilities of success will be different in the case of a tree hash.

I have never analyzed whether the BLAKE2 or BLAKE3 tree hashes could be secure enough without block counters. In any case, the block counters, which are absolutely necessary for linear block chaining, also increase the strength of tree hashing against any attacks and their added overhead is minimal, so there would be no reason to remove them.

A hash computed with a sponge structure, like SHA-3 or KangarooTwelve, is equivalent with a CBC-MAC where the key is updated at each block, instead of being constant. Because each block is hashed with a different key, there is no need for an additional counter input to differentiate the hashing functions. A sponge structure is much simpler than the structure used by BLAKE, but it needs a wider invertible mixing transformation, e.g. for a similar strength BLAKE2b uses an 1024-bit mixing transformation, while SHA-3 uses an 1600-bit mixing transformation. (BLAKE3 trades off strength for speed, so it is not comparable with SHA-3, but only with KangarooTwelve.)


It feels like you're just hunting for complexity a bit here. Any of SHA2, SHA3, or Blake are viable options.

I'd only let NIST and FIPS drive my crypto choices if I was being actively forced to do so by a government use case, and even so I'd be praying for the day that FIPS joins us in the modern era and stops tethering us to acronyms they could enumerate a decade ago.


He's hunting for complexity because it's interesting to talk about!


Lawyer here. This is right. Unlike some of the typical software use cases, md5 is usually not the only thing stopping you from falsifying evidence.

The records being produced almost always exist and the hash is being used to differentiate which is which and whether you got are correct copies.

If evil litigant produces fake documents instead of real ones, and you can't get access to the system they came from, no hash will save you - they can produce any hash they want and you can't verify against an original.

If you can verify against the original, forensics isn't going to look at a hash and say "hash matches my job is done".

Most of the time what matters is whether the evidence exists or not.

The tobacco companies lied and claimed they never had any evidence in the first place.

assume instead they wanted to falsify the data to say that as far as they could tell smoking did not cause cancer.

this is not a hashing problem. This is a problem of making up fake studies that look real.


This is more or less what I learned when I worked on forensics software, the kind that was supposed to maintain this kind of chain of custody/integrity. Like most things that touch the legal system, the presumption is that dishonesty or unsoundness in the chain of custody is fundamentally a legal problem with legal recourses, not something that can be solved with math.


I quite like the licensing trick Nintendo used on the gameboy as an example of this. [0]

Essentially, the gameboy expected a bitmap of the Nintendo logo to be present on the cartridge rom, and was shown on screen at boot. It had to match a version stored on the gameboy itself or else the game wouldn’t start.

The thinking (that I’m not sure was ever tested) was that someone producing a game that tried to trick consumers into thinking it was an official Nintendo product, would be liable for damages in a trademark lawsuit. Since the game would never start without an official Nintendo logo, the hope was to make the legal system enforce Nintendo’s licensing scheme.

[0] https://catskull.net/gameboy-boot-screen-logo.html


The thinking was tested (in U.S. jurisdiction) in Sega v. Accolade. https://en.wikipedia.org/wiki/Sega_v._Accolade

The court sensibly ruled that using technical means to force competitors to display your trademark against their will doesn’t mean you can then claim they’re infringing that trademark.


Apple tried this too with Dont Steal Mac OS X.kext, which uses a haiku with a copyright message as the key to decrypt certain executables like the Finder. I don't think it had any real-world impact.


Pretty rich considering their history with sosumi.


Also reminds me of MODULE_LICENSE and EXPORT_SYMBOL_GPL in the Linux kernel.

https://www.kernel.org/doc/html/latest/process/license-rules...


Right. If you don't have access to the original system (whether it's the thing that stored the emails or a computer with evidence on it), just an image, the hash doesn't matter anyway. They can just put fake data in the evidence vault system to begin with.

If you do have access to the original system, md5 is usually not the thing in the way of falsifying evidence.

most of the time they will claim, the evidence does not exist at all, rather than try to falsify it.

falsification requires making lots of things that makes sense historically, and humans to swear to them.

you also often have to falsify more than one system in a consistent way.

you have to do all of these things in a way that the forensic specialist is not going to think that everything looks really weird


Exactly. Both sides will have copies of all pertinent evidence, and if their copies conflict in any material way, it will be noticed and investigated and the party that manipulated their copy is going to have a very bad time.

The examples given in the article rely on an attacker manipulating an image prior to it being hashed, or compromising their opponent's database. If you can (and are willing to) do these things, no hashing algorithm will help. They also assume that the hash is literally the only thing anyone will rely on in identifying a document, which is not how it will work in practice. (No witness has ever said "yes there was a letter, I don't remember what it looked like or what it said but I remember its MD5 hash was [...]".)


> the real document won't have hash colliding artifacts in it

I don't think that hash colliding artifacts would necessarily be obvious. They could be in part of a file that ends up being ignored by parsers of the file format. Or it could be some low level noise in pixel values in scanned documents.


Both of those are things computer forensics experts are used to looking for and I'm certain even a beginner could find them.


Are court documents routinely analyzed by computer forensic experts, or only when there is suspicion of tampering?


It certainly does not matter. If you wanted to present a fake document as evidence, would sha-whatever help? Nope. If you wanted to withhold key evidence and pretend it didn’t exist, would it help? Nope.

And do you think we trust md5, or do you think we have another tool that can compare documents? We do have such tools.

What prevents chicanery in discovery and disclosure is how annoying it was to go to law school and pass the bar compared to how fun it would be to work at Wendy’s for the rest of our careers.

Should we use a better hash function: probably. Is this a problem of not enough technology literacy in the profession? Well, not really; there are plenty of tech literate lawyers (hi!) we just have questionable tools just like you do in IT and we make do.


What if both documents have collision artifacts? That's the most realistic attack in e.g. a contract dispute.


That's right.

Same reason why blockchain is (almost) never the answer: that's not how human systems work.

Trust is centralized. Always had been.


I would say the system understands MD5 being broken, but more of usability issue the system is leaving a backdoor to keep itself safe, comes to mind how the system got threatened about Epstein and went over and under in order to keep him free, society should be worried about legal system having backdoors


If you could just throw anyone who forged a digital signature in prison, you'd keep using MD5, too.

The reason why people like us keep changing everything for security is specifically because we have no access to justice. Computer crimes are international and difficult to prosecute, so you might as well drop an algorithm like a hot potato if anyone - even just nation state actors - could break it. We build our rules out of code because we do not have access to the material they make laws out of.

That being said, continuing to use MD5 is utterly inexcusable.


Yes, this is a thing. My arguments have bene shot down with a handwaving several times. "But that would be a crime so then we call our lawyers". Feels like it would be cheaper to just use something secure than to pay a lawyer :-)


If someone commits a crime, it's not the victim of the crime that has to pay for the lawyer to prosecute them.


Kim Yong Un is not going to make a bank transfer out of North Korea to pay for your Lawyers after they hacked you.


Exactly


Would have upvoted except for that last sentence. There is no such thing as a perfectly good airplane, at least as long as "perfect" means flawless rather than good enough. Granted, there is a whole dance for accepting the risk from known defects (that's the whole point).


The article mentions the key detail: MD5 is broken for cryptography (collisions) but not for second preimage attacks. I was hoping there would be some discussion of just how much more difficult the latter is. It is extremely difficult.

Let’s ignore that no second preimage attack is currently known for MD5. The software the author links to has a FAQ that links to a paper that lays out the second preimage complexity for MD4:

https://who.paris.inria.fr/Gaetan.Leurent/files/MD4_FSE08.pd...

It takes 2^102 hashes to brute force this for MD4, which is weaker than MD5. A bitcoin Antminer K7 will set you back $2,000, and it gets 58 TH/s for sha256, which is slower than MD5 or MD4. Let’s ignore that MD5 is more complex than MD4, and let’s say conservatively that similar hardware might be twice as fast for MD5 (SHA256 is really only 20-30% slower on a cpu). It’ll take 2^102/58e12/2/60/60/24/365, or about 1.4 billion years to do a second preimage attack with current hardware. So you could do that 3 times before the sun dies.

If you want to reduce that to 1.4 years, you could maybe buy a billion K7’s for $2 trillion. And each requires 2.8kW so you’ll need to find 2.8 terawatts somewhere. That’s 34 trillion kWh for 1.4 years. US yearly energy consumption is 4 trillion kWh.

It will be a while, probably decades or more, before there’s a tractable second preimage attack here.

Yes, there are stronger hashes out there than MD5, but for file verification (which is what it’s being used for) it’s fine. Safe, even. The legal folks should probably switch someday, and it’ll probably be convenient to do so since many crypto libraries won’t even let you use MD5 unless you pass a “not for security” argument.

But there’s no crisis. They can take their time.


> The article mentions the key detail: MD5 is broken for cryptography (collisions) but not for second preimage attacks.

The problem with this argument is that people often don't properly understanding the security requirements of systems. I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it". It doesn't have any downsides to choose a secure hash function, and it's far easier to do that than to actually show that you "don't need it" (instead of just having a feeling you don't need it).

For the unlikely event that you think that the performance matters (which is unlikely, as cryptographic hash functions are so fast that it's really hard to build anything where the diff. between md5 and sha256 matters), even that's covered: blake3 is faster than md5.


> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

I can count many more times that people told me that md5 was "broken" for file verification when, in fact, it never has been.

My main gripe with the article is that it portrays the entire legal profession as "backwards" and "deeply negligent" when they're not actually doing anything unsafe -- or even likely to be unsafe. And "tech" apparently knows better. Much of tech, it would seem, has no idea about the use cases and why one might be safe or not. They just know something's "broken" -- so, clearly, we should update immediately or risk... something.

> Just use a safe one, even if you think you "don't need it".

Here's me switching 5,700 or so hashes from md5 to sha256 in 2019: https://github.com/spack/spack/pull/13185

Did I need it? No. Am I "compliant"? Yes.

Really, though, the main tangible benefit was that it saved me having to respond to questions and uninformed criticism from people unnecessarily worried about md5 checksums.


>And "tech" apparently knows better.

The tech community has a massive problem with Dunning-Kruger, and has for basically ever. Hell two decades ago when I was a young guy working in the field so did I.

I'm not sure if its because the field is basically a young man's game and that's inherent with relative youth, or if there's something deeper going on, but its hard to ignore once you notice it.

That said, the idea that you have a better handle of what's going on in the legal system and the needs/uses legal professionals have then actual people in the legal profession and academics in the legal field is a pretty big leap even with those priors.


> I can't count the number of times I've seen people say "md5 is fine for use case xyz" where in some counterintuitive way it wasn't fine.

Help us out by describing a time when this happened. MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant.

I agree that upgrade is likely your best bet. But I've found the other direction of bad reasoning is a more pernicious trap to fall into. "My system uses bcrypt somewhere so therefore it is secure" and the like is often used as a full substitute for thinking about the entirety of the system.


> MD5's weaknesses are easily described, and importantly, it is still (second) preimage resistant

Most devs have no idea what that means, but most devs still need to use hash functions. They need to use primitives that match their mental model of a hash function. Said model is https://en.m.wikipedia.org/wiki/Random_oracle

The usual answer here is "don't roll your own crypto", but in practice abstinence-only cryptography education doesn't work.


> Help us out by describing a time when this happened.

Linus Torvalds saying that SHA-1 is okay for git, while it is used for Git signatures as well. Signatures are a classic "you need collission resistance to have safe signatures, but people are often confused about it" case.


I might be mistaken, but wouldn't a git signature already be signing trusted things (i.e. the person making the original signature is trusted), making any attack enabled by the input hash function a second preimage attack (i.e. an attacker onky knows the trusted input, not anything private like the signing key)?

Hash collisions mean you can't trust signatures from _untrusted_ sources, but git signatures don't seem to fit that situation.


As you pointed out, signatures make content trusted, but only to the degree of the algorithm's attack resistance. I think it's also important to define trust; for our purposes this means: authenticity (the signer deliberately signed the input) and integrity (the input wasn't tampered with).

If an algorithm is collision resistant a signature guarantees both authenticity and integrity. If it's just second preimage resistant, signing may only guarantee authenticity.

Now, the issue with Git using SHA-1 is that an attacker may submit a new patch to a project, rather than attack an existing commit. In that case they are in control of both halves of the collision, and they just need for the benign half to be useful enough to get merged.

Any future commits with the file untouched would allow the attacker to swap it for their malicious half, while claiming integrity thanks to the maintainers' signatures. They could do this by either breaching the official server or setting up a mirror.

One interesting thing to note though: in the case of human readable input such as source code, this attack breaks down as soon as you verify the repo contents. Therefore it's only feasible longer term when using binary or obfuscated formats.


> And tbh, I don't understand the urge of people to defend broken hash functions. Just use a safe one, even if you think you "don't need it".

The ideal discourse would not imply a binary sense of "safety" at all, much less for a function evaluated outside the context and needs of its usage....


The thing is: We have a binary definition of safety for cryptographic hash functions, and it works well.

You can add a non-binary sense of safety to cryptographic hash functions, but it makes stuff a lot more complicated for no good reason. If you use the "preimage-safe-but-not-collission-safe" ones, you need to do a lot more analysis to show safety of your whole construction. You could do that, but it gives you no advantage.


Second preimage attacks aren't the only threat in a forensics environment.

Also, hand-wavy extrapolations from Bitcoin miners aren't a reliable estimate of how fast & energy-efficient dedicated MD5 hardware could become.


Which part was hand-wavy/unreasonable? Do you think that dedicated MD5 hardware could become billions or even millions of times more efficient within a decade? If so, why?


MD5 is already not "fine" or "safe, even" against malicious actors who might pre-prepare collisions, or pre-seed their documents with the special constructs that make MD5 manipulable to collision-attacks.

Even if your extrapolative method was sound, you've already got several factors wrong. The best SHA256 Bitcoin miners are today more than twice your estimate in hashrate, and on plain CPUs SHA256 is more like 4x slower than MD5. (Your smaller estimate of MD5's speed advantage is likely derived from benchmarks where there's special hardware support for SHA256, but not MD5, as common in modern processors.)

But it's also categorically wrong to think the CPU ratio is a good guide to how hardware optimizations would fare for MD5. The leading Bitcoin miners already use a (patented!) extra 'ASICBoost' optimization to eke out extra parallelized SHA256 tests, for that use-case, based on the internals of the algorithm. As a smaller, simpler algorithm – also with various extra weaknesses! – there's no telling how many times faster dedicated MD5 hardware, either for generically calculating hashes or with special adaptations for collision-search – might run, with similar at-the-gates, on-the-die cleverness.

Further, attacks only get better & theory breakthroughs continue. Since MD5 is already discredited amongst academics & serious-value-at-risk applications – and has been since 1994, when expert cryptographers began recommending against its use in new work – there's not much above-ground scholarly/commercial activity refining attacks further. The glory & gold has mostly moved elsewhere.

But taking solace in the illusory lack-of-attacks from that situation is foolhardy, as is pronouncing, without reasoning, that it's "probably decades or more" before second-preimage attacks are practical. Many thought that with regard to collision attacks versus SHA1 – but then the 1st collision arrived in 2017 & now they're cheap.

You can't linear-extrapolate the lifetime of a failed, long-disrecommended cryptographic hash that's already failed in numerous of its original design goals. Like a bridge built with faulty math or tainted steel, it might collapse tomorrow, or 20 years from now. Groups in secret may already have practical attacks – this sort of info has large private value! – waiting for the right time to exploit, or currently only exploiting in ways that don't reveal their private capability.

You are right that there's no present 'crisis'. But it could arrive tomorrow, causing a chaotic mad-dash to fix, putting all sorts of legal cases/convictions/judgements in doubt. Evidentiary systems should be providing robust authentication/provenance continuity across decades, as that's how long cases continue, or even centuries, for related historical/policy/law issues to play out.

Good engineers won't wait for a crisis to fix a longstanding fragility in socially-important systems, or deploy motivated-reasoning wishful-thinking napkin-estimates to rationalize indefinite inaction.


If I understand this correctly, the paper only shows a particular attack of complexity 2^102. Someone may find a different attack with much lower complexity. That's the usual way how cryptography gets broken -- people find better and better attacks, and suddenly the latest attack has low enough complexity to be practical.


Another unfortunately place where MD5 is widely used: pirate libraries such as Library Genesis and Anna's Archive. While content is distributed at large in torrents with SHA1-summed shards, and Anna's Archive at least offers some structured metadata which would allow to slowly migrate away from MD5, files are still indexed using MD5 as primary key, and any other kind of file hash is nowhere to be found.

Pirate libraries are particularly important to preserve our cultural heritage in a transparent and trustworthy way. A role that traditional libraries sadly cannot fulfill due to draconian copyright laws, especially around digital books. With archive.org as notable exception.


I do not need to do a hash collision to upload malware to Library Genesis. I could just upload malware with a slightly different name than a popular book and claim it is a different release, like book_high_quality. To securely view content downloaded from such sites, update your software and sandbox the application.


It should be noted that md5 is probably still secure for this usecase (maybe you could do a bait and switch with a specificly prepared file, but you can't force a collision with a non-evil file)

Still, they should switch. Sha1 is not good either.


I actually picked SHA256 for the path-prefix feature in https://jacob.jkrall.net/benfords-law for “NIST compliance.” That is, I didn’t ever want to answer “yes” to a potential customer’s CISO security surveys question like “does your application use any non-NIST-approved hashing functions?”

It’s frankly broken that evidence-handling doesn’t have to follow the government's advice about hash function selection!


Funnily enough, I had an interesting discussion with a client's lawyer (who, to their credit, is reasonably tech-savvy) before the holidays. I had redlined "FIPS 140-2" from their contract language. I'll omit the context, because it's too nuanced to be discussed here, but the long and short of it was that she wanted to know why I did that.

I informed her that since FIPS 140-2 is about physical properties of key creation and management, all the relevant layers in a cloud-only solution are simply in the wrong scope. And I added that I am allergic to the string "FIPS" in general. Even having it present in official contract language makes people leap into weird assumptions about supported and allowed algorithms.

Her response? "Oh, that makes sense."


May we all be so lucky to have such an enlightened client(‘ s lawyer)!


I still use MD5 as a 128-bit checksum algorithm that is fast and universally supported and compatible everywhere. In this role it's still useful, just don't expect it to be a cryptographic hash anymore.


When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm.

Different checksum algorithms can provide better error detection for specific channel error models (potentially even with fewer bits). Non-cryptographic checksums are typically designed for various failure models like a burst of corrupt bits, trading off what they do/don't detect to better match detection of corruption in the data they will protect.

For example, if you know that there will be at most one bit flip in your message, a single bit checksum (parity check) is sufficient to identify that an error occurred, regardless of your message size. (Note that this is an illustrative example only, since, typically, messages have a certain number of errors for a certain number of message bits -- the expected number of errors depends on the size of the message.)


"When you don't have a need for a cryptographic digest, it's important to think of the channel's bit error distribution in selecting a checksum algorithm."

Important real-life-facts.

There was no "give-me-an-appropriate-hash" function.

There was:

md5sum yourfile.txt

Nobody wants to think about "channel's bit error distribution" in a non-security critical context. In fact, its irrelevant, and possibly a usability issue.


While I see the point, what starts as a checksum can easily become relied upon for security over time, after all, checking whether bits have been modified accidentally on purpose, is a subtle distinction in many systems.

SHA256 is also near universally supported and doesn’t have this drawback. The only cases where MD5 would be available and SHA256 wouldn’t, is systems that are out of security support anyway, where there are bigger problems to contend with.


SHA256 is something like 30 percent slower than MD5.

I'd suggest using Adler (what zlib does) for a simple and fast checksum. Then that should, one hopes, be painfully obvious to be a bad fit for anything security related.


On a modern CPU (i.e. 64-bit Arm since 2012, Intel Atom since 2016, AMD Zen since 2017, Intel Core since 2019) SHA-256 is twice faster than MD5.

The difference in speed between the hashing speeds is actually greater, a double speed for a long file is what you get when the execution time includes parts that are identical for the two hashes, i.e. launching md5sum/sha256sum and reading the file.

Older versions of the binary coreutils package may mask this speed difference by having executables compiled only for very old CPUs.

Recent coreutils versions normally use for hashing the OpenSSL library, if found, and OpenSSL uses the hardware instructions where available.

Where a bad sha256sum is installed, "openssl dgst -r -sha256" should work instead.


In Python the Adler library returns a 32 bit checksum. It works pretty well when you're comparing one file to another file. It doesn't work pretty well if you want to, for example, create a quick fingerprint that (tries to) uniquely identify tens of thousands of files.

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

When I tried making something similar I figured if I added the size of the file in bytes to the hash that would decrease the number of hash collisions since you would need a permutation of bytes in a set of bytes of same size to generate the same MD5 hash to get a collision. Still feels random and unavoidable in the greater scheme of things, though.


When making a dupe detector some decades ago, I kept the filesize outside of the hash.

No need to compute the hash until there's at least two files with the same size.


Conventional analysis of non-contrived files would suggest only a 50% chance of an MD5 collision among 2^64 (18 quintillion) files.

So any SO account of "hash collisions in MD5 (128 bit) after hashing around 20k files" may be indicative of some other bug, misreporting, or having a set of files that includes pairs intentionally-contrived to include MD5 collisions. (That's now trivial as long as you're not targeting a specific MD5 value, just intending two files to match, and have some ranges in the files where you can stuff the right result-aligning binary patterns.)


I'm not well-versed in this, but I think you're wrong. A MD5 hash has 2^128 permutations, so it's like the birthday problem, you'll get 50% chance of any two files in a set having the same hash with far fewer than 2^64 files.

Besides, we're talking about probabilities. It's possible for 2 files, one after the other, to have the exact same hash just because of chance. In fact, I think there was a bitcoin bug due to this, something about making the wrong assumption that the hash of the entire blockchain and the hash of the entire blockchain + 1 block would never be the same hash, if I remember correctly.

Hashes are simply not a reliable method of uniquely identifying things. They are a convenient method but if you implement any hash-based system for identification you should also implement a escape hatch for when two different things have the same hash but must be treated differently.


Nope. 2^64 is the number of tries needed for a 50% chance of a collision between two of the tries. You don't have to stay poorly read on this, and groundlessly allege wrongness:

https://auth0.com/blog/amp/birthday-attacks-collisions-and-p...


This happened to me. Users initially couldn't directly control the content being hashed, because it contained a random element (via UUID). Later, the API surface expanded.

Luckily, my personal rule is to default to a cryptographic hash unless I can convince myself that cryptographic robustness will never matter and performance definitely will matter, rather than the other way around.

In this specific case all users were internal to the company, so it wouldn't have really mattered if it was vulnerable. But it could just have easily been an external user-facing thing.


Except MD5 is slower than SHA2 on modern PCs / servers.


You may as well use CRC32 or Alder32.


Those are suitable for error detection for relatively short files, in the kilobyte range. They can be used for error detection in big files only if you compute one per page, e.g. one for each 4kB page.

They are not useful as file identifiers. I have found multiple CRC32 collisions even in a single directory (a big one, with around ten thousand files).

For error detection in a big file, you need at least some 64-bit CRC, though SipHash is likely to be a better choice than a CRC.

For identifying uniquely a file in a multi-TB file system, which may have many millions of files, even a 64-bit hash is not good enough, a hash of 128 bits or more is needed to make negligible the probability of collisions. I have verified this experimentally, finding several 64-bit file hash collisions in my file systems.


There's a difference between finding a collision and finding a second pre-image. While I agree you shouldn't use MD5, and absolutely don't use a signature algorithm which uses it, finding a second pre-image is harder than finding an arbitrary collision with MD5.

An "arbitrary collision" here means you can find two inputs (pre-images) which hash to the same thing. Like you ran some code and discovered that "SDFKLHKLJxchjasdfgklhjaskdhjlf9" hashed to the same thing as "klhkasdfhjkl899078790". Finding a second pre-image means you start with one message, like "ALL QUIET. REMAIN CALM." and figured out that "ATTACK AT DAWN 051928" hashes to the same MIC.

I can't believe I'm defending using MD5. But... finding second pre-images is still hard. Sasaki & Aoki say it's got a complexity of around 2^116.9 and requires 11 * 2^45 words of memory (thought 1400Tb isn't THAT outlandish these days.)

Still... statements like "finding a second pre-image is hard" don't age well and will guarantee a tractable second pre-image attack will be published tomorrow.

But... if you have a bunch of docs and you're not signing them or asking people to trust the hash of each doc, you can (reasonably) quickly de-dup by sorting by MD5 hash and then looking for dups. Which is how many people use MD5. And they continue using MD5 because multiple organizations have similar lists and if you wanted to change it, you would need to get everyone to move to a different algorithm.

But yeah... at this point we should assume someone will publish a tractable second pre-image attack "any day now" and get to work migrating from MD5 to MD5 : Next Generation. But good luck getting more than 2 people to agree to what the next preferred hash algorithm should be.


So if the document is evidence , then its probably created by the attacker. This seems like a setup where collision is more relavent than 2nd preimage.


How so?


Second preimage attacks are relevant for the documents that you create and give to others.

Keeping a hash of the document ensures that you can prove that any altered document shown by someone else is not the original.

Collision attacks are relevant for the documents created by others, which you receive.

If you have a hash of the document that is collision-resistant, you can trust that the creator does not have other variants of the document with the same hash.

If the hash is not collision resistant, i.e. it is MD5 or SHA-1, you cannot know if the creator of the document has not also created another variant of the document than the one handed to you, which has the same hash.

That is why a digital signature on a document received from others is meaningful only if it is based on a collision-resistant hash.

If you sign and verify your own documents, for detecting modifications, a second preimage attack resistant hash would be enough.


Again. A "collision" means you have two pre-images which hash to the same value, but you did not pick either of the two pre-images. So if someone gives you a doc that says "SDKLFHJSDJKLGHJKLb9iyasdfkghjasdf97897asdfg798789asd" and then gives you another doc that says "klhjasdfhjklasdfhjkl97879087908789sdfga" and they have the same hash, then... what has the attacker achieved other than proving they've found a collision.

A "second pre image" means they can give you a document like '{"status":"calm","launch_missiles":false}' and then later come up with another document like '{"status":"angry","launch_missiles":true,"whatever":"a9d7s8gh283g7d7"}' and both would have the same hash.

A critical part of using a hash function is understanding how it can be used. So if I was expecting a message to parse to a JSON blob and you gave me "SDKLFHJSDJKLGHJKLb9iyasdfkghjasdf97897asdfg798789asd", it doesn't matter what the second message's hash is, because you've given me two messages which can't be used.

In the use case you've given, it turns out courts don't look at hashes of messages, they look at messages. So a collision is of limited use for forensics.

In password hashing systems, if you could force someone to use the password "SDKLFHJSDJKLGHJKLb9iyasdfkghjasdf97897asdfg798789asd", you could come in later and use "klhjasdfhjklasdfhjkl97879087908789sdfga" to log in. But you should be pilloried for not using something like a PKCS#5 PBEKDF. If you used PBEKDF2 for instance, you would now be looking for a second pre-image of the salt prepended with the password. And again, second pre-images are harder than finding a collision.

I absolutely agree that a digital signature is only meaningful if it uses a collision and second-pre image resistant hash function. But that's not what we were talking about.

I'm also very happy that the knee-jerk response to MD5 is now "STOP IT BEFORE IT GETS TO THE CHILDREN." A decade ago I had a senior architect say it was okay to use MD5 in new systems because Bruce Schneier's 1996 "Applied Cryptography" said it was okay. I spent the next year moving that app from auth using straight MD5 of the password to an SRP based system.


> A "collision" means you have two pre-images which hash to the same value, but you did not pick either of the two pre-images.

I think the use of the word "you" is ambiguous here (do you mean the attacker? verifier?).

In an attack scenario for a collision attack, you would have an attacker prepare two documents that have the same hash but a different message. Attacker uses the innocent message initially, and then later swaps it to the evil message pretending it was that all along (or vice versa).

The way i could see it happening in a court setting (This is super far fetched and a bunch of reasons why this wouldn't work in practice).

Attacker, knowing they might end up in court, creates two payloads, one evil, one innocent with same md5 hash.

Attacker uses the evil payload to attack some target

Attacker gets arrested

In court, the put the payload the attacker used into evidence, indexed by its md5 hash

Attacker claims in court that it is all a misunderstanding, all they sent to the server was the innocent payload that just so happens to have the same hash as the evil one.

There's a bunch of (social) reasons why this probably wouldn't work, but this seems just as viable as the 2nd pre-image attack, and unlike the 2nd pre-image attack, actually is viable with md5.


We still see heavy use of MD5 in genomics as well. It's effectively used to generate a single identifier that can be used to reference a specific genome assembly. There have been discussions and attempts to move to other, more secure algorithms, but the community and its tooling is too deeply entrenched in using the MD5 for the reference that it would take a herculean effort to change.

I'm personally of the opinion that it doesn't matter. MD5 is fine for genomics. The chances of valid genome files colliding is still extremely low, and there's not really any relevant attack space. Replacing one assembly file with another will just break someone's analysis pipeline, and most likely in a very clear obvious way.


> there's not really any relevant attack space.

Then why use a cryptographic hash at all? much better hashes out there that only strive for distribution/avalanche.

https://en.wikipedia.org/wiki/Non-cryptographic_hash_functio...


MD5 has/had a well-known "media" surface - lawyers/genomics folks had heard of it. Libraries had it as an accessible function (command line utilities, even).

Sure, there are better non-cryptographic hashes, but, again the concern of lawyers and genomics folk is neither security nor efficiency - simplicity and "works most of the time" are the two metrics at stake.

If either laywers or genomics folks cared about document forgery of this nature (spoiler, they don't), they would move to something like SHA3. If they had a need for high-scalability hash algorithms (spoiler, they don't), they would switch to another faster algorithm.

This is a concept I understand security folks struggle to understand - sometimes we _just don't care_. And we never should.

Maybe, something a struggling security enthusiast could understand - a video game.

If you implement e.g. a caesar cipher, you can have fun, accessible puzzle. Implementing AES in your game as a puzzle, while much harder, fails desperately at the "accessibility" metric. In your single player game, if you want to see some "identifying hash", if you see an md5 one, that's enough. No, you should not worry about people forging documents for your ad-hoc identification system, if you don't have people attempting to forge in-game items. Maybe its even a feature that you need to forge such a hash, as a way to solve a puzzle.


Because they’re known to be collision resistant (it’s a primary requirement), whereas non-cryptographic hashes are not, so now you need to evaluate each function individually for this property which is a hassle. And an unnecessary one, I doubt the computation of the hash is what genomics are bound on.


But what's the relevance of collision resistance, without a meaningful attack surface?


Ensuring every sequence is uniquely identified.

Although they still want to avoid those, non-cryptographic hash functions often care a lot less about collision resistance, which is a problem when fingerprinting, which is the use case here.

The alternative to a CHF in this case is not a non-cryptographic hash function, it's a dedicated fingerprinting scheme (like Rabin fingerprints). But a CHF is a perfectly good fingerprinting function if you don't have more specialised needs (like rolling hashes).


Those properties are not a direct result of a function being collision-resistant, which is a property that only makes sense in adversarial contexts. If nobody is trying to produce collisions, it doesn't matter if they're easy or hard to find.

You might care that the output hashes are well-distributed for your closely-related input data, but as the comment you replied to above points out, there are non-cryptographic functions with good avalanche properties which would satisfy that need without being collision-resistant.


> Those properties are not a direct result of a function being collision-resistant

It kind of is though.

> as the comment you replied to above points out, there are non-cryptographic functions with good avalanche properties which would satisfy that need without being collision-resistant.

No comment I replied to points anything near that. Your comment has basically no content, and the comment before that only asserts such existence without providing any guidance or evidence, linking to a page about the general concept of non-cryptographic hash function, which is utterly useless.

Not only that but avalanche properties do not matter at all for the use case: the hash is just a label for the sequence, it's fine if two similar sequences get similar hashes as long as those hashes are different. Some identifiers (like geninfo) are just centrally assigned integers.


It's true that being collision-resistant is a strong enough property to make collisions unlikely, but it doesn't hold that collision resistance is a requirement for such a hash function.

What is the relevance of collision resistance in this case? Why do you say it's a primary requirement of a hash function here? Why isn't uniformity with a large enough image space enough? Given that there is no adversary trying to produce collisions of generated identifiers, why does it matter that collisions are hard to deliberately create, rather than simply unlikely to occur?


Yeah that's fair - it doesn't need to be cryptographic. But someone back in the day decided MD5 was what they wanted and it stuck. It always raises alarms with pen testers and security scans at work, and each time we have to explain that the cryptographic security is irrelevant; it's just some unfortunate genomics standard we need to support.


Sigh, been having this conversation in a related codebase. Md5 is just as fine as any other generic hash function if its being used as a non-unique key, which for many cases replacing it with one of the more "secure" alternatives does nothing except for the fact that the resulting hashes are frequently longer, thereby further reducing the statistical chance of an accidental collision. For something like a document store, duplication system, etc, simply taking the extra step of doing a binary comparison against the text associated with the hash assures that accidental (or intentional) collisions are handled. With the bonus that you probably get to either publish a paper or detect someone trying to attack the system should the text comparison fail.

And given the history of cryptographic hashes, i'm even more convinced that anyone depending on sha3/whatever being better than md5/etc over the next 10-20 years is fooling themselves.

Now would I use it in a secure boot chain/etc as a stamp of uniqueness? Probably not.


> Md5 is just as fine as any other generic hash function if its being used as a non-unique key

MD5 brings the feature that you'll forever be explaining why you chose a function that had already been broken for 30 years when other options were readily available.


Often, its less about picking it for a new project vs having the discussion about how to update an existing one, often with data at rest that needs converting. In the latter case often the hash needs to fit into the existing 128b field size, so one is throwing a good number of the SHA bits away anyway.


This is not a justification for using a broken function.

Truncating a fixed number of bits does not make a good secure function less secure, other than the implications that the shorter length has on brute-force and collision strength.

In some cases it can even make it even more secure, e.g. SHA-2-512/256.


Uh, truncation is a huge no-no for security. But, that isn't what I was talking about. For deduplication/hash key/etc kinds of functions where a checksum works (or some other algorithm with a high chance of collisions is implemented for unit testing to assure collision paths are handled correctly).

This is part of the entire discussion, it's possible to use algorithms originally intended for cryptographically secure purposes in places where the reversibility/collision properties aren't considered problems. And likely this will only happen more frequently as algorithms are picked because they have hardware acceleration and are fast rather for the underlying security properties.

BTW: There is a fun game that I've played; see how much you can truncate a modern cryptographic hash before it becomes trivial to find collisions.


The chances of having an accidental hash collision are really small.

I have build data warehouses with md5 as the hashing algorithm to generate keys from natural keys. Did some back of the envelope calculations back then and found that the chance of a hash collision was minute. Don't remember the exact numbers, but somewhere in the 100s of years if I was generating keys every second.

This could btw very well be a thing with large volumes of data, but in many systems this absolutely not a worry.


The chance of a random collision is minute but if someone is actually building collisions the system is broken. DVC uses MD5 of file for reduplication for example and when you purposely inject files withe the same MD5 (which take seconds to build) the result is data loss.


md5 is faster due to being older and made for older hardware so I guess that is why it's in use for things like that. All deduplicating tools I have used first check for file length before it even tries to do a checksum so I guess that would take care of some problems. It's harder to find a collision if you have to keep the filesize the same.


MD5 was faster on old CPUs.

All modern enough CPUs have dedicated SHA-256 and SHA-1 instructions and even SHA-256 is much faster than MD5.

So performance cannot be invoked for continuing to use MD5.

The cheapest Intel Atom CPUs have added SHA-256 support in 2016 and AMD Zen 1 has added support in early 2017. The Intel Core CPUs have added support later, due to the delayed 10-nm transition, in 2019.

64-bit Arm has added support since the beginning, in 2012, which was what forced Intel to add SHA support to the Atom CPUs first, in order to not loose in the Geekbench benchmarks, where Atom CPUs were compared with Arm CPUs.


It's not harder to find a collision if you keep the file size the same, if you control both files at least.


Supposing I did want to use the file hash as a unique key and I really don't want to do a byte for byte comparison... And I care about speed but not so much about bad actors, what should I use?


Didn’t think about it much, but file size should be a good indicator if the hash isn’t horrible. md5 + file size comparison could work for your use-case.


One of the inputs for MD5 is the length of the message, so I'm at least wrong in the case of MD5. Don't know about the general case and although I'm interested in the answer I can't spend time on it right now. But if anyone has a pointer to a useful resource please reply.

In general, just use a good hash function.


> MD5 should be considered broken and unsuitable for further use.

Ya know... It's 2024 and Azure's blob storage ONLY supports MD5 for integrity checks when writing blobs. There are no other hash functions supported there. The default cloud storage solution implemented by one of the largest cloud providers out there ONLY uses MD5.

I really want to use something else, but whenever I have to interact with them I must fall back to MD5. It's not up to me as a dev to use something better if I need to interact with Azure. Yes, I can use other hashes alongside MD5, but if I want integrity checks with the storage provider I can't completely abandon MD5.


WordPress still uses MD5 for database passwords to this very day with no immediate plans to change it.

That said, they apparently use eight passes of MD5 hashing along with salting, which they claim is a sufficiently secure combo.

WordPress's core and default themes are known to be fairly secure, so I'd like to believe they know what they're talking about, but if nothing else it feels icky.


I'm confused, if they're going through the effort to make something known bad (MD5 secure, then why not just use something secure in the first place (e.g. SHA3)?


The history of this makes it hard to convince people to supersede hashes based on the fact that they can be collided. If the legal community had switched to SHA-1 at the point that MD5 was found to be weak for collisions they would have had to consider switching over to SHA-2 10 years later. From their perspective they dodged a bullet.

There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle. If you chop down your 256 bit hash to 128 bits then due to birthday collisions you can probably brute force a collision anyway (you end up only having to do something like 2^64 operations). So there ends up being a usability argument for specifying that your system has to be able to be secure in the face of collisions. At that point you could then further argue that you will just stick with MD5.


A truncated SHA-256 is both more secure and also faster to compute on any modern CPU than MD5, and a visual comparison would work identically.

If a visual comparison is believed necessary, it should better be made easier, e.g. by overwriting the two hash values, using text of different colors.

Otherwise, even a bash script, or even just one bash command line can easily compare the output of two sha256sum executions and print an appropriate message.


If you have some sort of information processing system available to compare hashes, then you would be better off comparing the data directly. The hashes when used in this context are primarily to make things like manual comparisons possible. Usability is the point.


> There ends up being a usability issue here. An MD5 hash is only 128 bits long. So 32 hex digits. A SHA-2 hash is going to be 256 bits. Or 64 hex digits. Manually comparing 64 hex digits is in practice much harder than twice as hard as comparing 32 hex digits. People get lost in the middle.

Comparing MD5 and SHA-2 for visual human diffing is like comparing a stick of dynamite to a landmine when trying to pop a pimple; any potential safety differences are trivial once you start using something in a fundamentally unsafe way.


Running 2^64 SHA1 ops on a GPU takes 15 years, so I think finding a reasonable collision for that half using SHA2/3 is not as trivial as you suggest:

https://crypto.stackexchange.com/questions/84520/how-long-wo...


A chosen-prefix (i.e. a demonstration how to modify a given legal document to obtain two different documents that have the same hash) SHA-1 collision has already been computed:

"We have successfully run the computation during two months last summer, using 900 GPUs (Nvidia GTX 1060)."

Such resources can be easily rented from a cloud.

So for anyone willing to spend up to 100k USD, it is trivial to find SHA-1 collisions.

https://sha-mbles.github.io/


Since that post was published, the 4090 came out, which can (according to this hashcat benchmark [1]) do 50,638.7 million SHA1 hashes per second, so now it would only take a single 4090 GPU 11.55 years. Or you could buy 12 of them and do it in a year, etc. So it's definitely not cheap but 15 years is definitely an overestimate (and presumably GPUs will keep getting faster...).

SHA2-256 is "only" 21975.5 MH/s so you'd have to double the number of GPUs or amount of time.

[1] https://gist.github.com/Chick3nman/32e662a5bb63bc4f51b847bb4...


Generating 2^64 hashes isn't guaranteed to produce a collision, and even if a collision did exist in that set, you're not going to find it by getting a bunch of GPUs to compute 2^64 hashes. There's a huge difference between a haystack that maybe contains a needle, and a needle that's been pulled from the haystack and presented to you. To actually find and identify the collisions you'll need to hook those GPUs up to some sort of storage/retrieval system. Just to store 2^64 128-bit hashes would take 295.1 exabytes. That's an order of magnitude more storage than NSA's utah datacenter[1].

[1] https://en.wikipedia.org/wiki/Utah_Data_Center


You can generate pairs of hashes for random inputs and check for collision without storing all of the outputs, no?


True, which is why 256-bit or bigger hashes are recommended to make sure that the time for finding a collision is alone great enough to make this impossible.

Finding a collision for an 160-bit hash by brute force, on a big cloud or supercomputer, is about the maximum that someone with unlimited financial resources could do.


That requires way more hashes to be computed though.


I don't think that's correct? It's probabilistic, yes, but in expectation you would still need ~2^64 hashes to find a collision for a 128-bit hash (birthday paradox).


See my above comment. There's a huge difference between having a collision somewhere in a pile of 2^64 hashes, and actually having the colliding hashes ready to present.


Are you trying to get at the distinction between second preimage and collision resistance? If all you need is a collision, "try random pairs until you find a collision" method works fine and is arbitrarily parallelizable with no storage.


>"try random pairs until you find a collision" method works fine and is arbitrarily parallelizable with no storage.

That requires you to do 2^128 checks on average, not 2^64. Again, the problem is that the birthday problem only exists if you have all the hashes available to compare. If you're just doing two hashes at a time and comparing the two, the chances of you getting a match for each try is 1 in 2^128, and this is independent for each attempt. To get a 50% chance you need 2^128 attempts.

If you can't see why that's the case, you can empirically test this, by using a 16-bit "hash" rather than a 128-bit hash: https://pastebin.com/7xW04Jgg

The average I got during one invocation was 58161, which is 2^15.8, not 2^8 as you'd expect. However, if you modify the code to include a storage/retrieval system: https://pastebin.com/AxF9S0q5, the average drops to near 2^8. For my last invocation, I got 309.7 which is 2^8.2


This is the same legal system that still uses polygraphs as “lie defectors” and known-junk DNA matching tests as fact, so this isn’t exactly shocking.


What court has admitted polygraph test results into evidence? Surely, none in the US.


I suspect we have different definitions of “legal system” in mind. You are correct that such things cannot be admitted as evidence into a court case, but law enforcement agencies still use the machines and do their best to lie to unknowing victims that it will be admitted to court.


They are often used in pre-employment screening for sensitive government jobs. Yes, it would be illegal for a private employer to do this.


I can see the lawyers’ point. Any old checksum is good enough for spotting random data corruption. If you do happen to have a crooked lawyer who is submitting tampered evidence then you would hope that there are better systems in place to weed out this ethical corruption.

For instance: law school training to strengthen ethics, well known punishments to act as a deterrent, hiring processes to filter out unscrupulous actors, and whistleblower protections to encourage and reward vigilance.

It’s not an opinion that adheres to the cynical zeitgeist, but in my experience most members of this profession are extremely trustworthy. I’m sure they dislike the stereotype of lawyer=rotter just as much as hackers are tired of being typecast as Newman… sorry, Dennis from Jurassic Park!


“A cryptographic hash should uniquely identify a file.”

This is not true. It’s not possible to guarantee this. One can be certain that two files DON’T match if they have different hashes, but one cannot be certain that two files DO match based ONLY on the fact that they have the same hash.


MD5 is incredibly broken. The PDF file PoC||GTFO 0x14 (https://dl.packetstormsecurity.net/mag/pocgtfo/pocorgtfo14.p..., 42MB large) is a PDF file that can be also run in a NES emulator, and will display its own MD5 hash. The MD5 hash is also shown in the pdf document itself. (Don't download it from archive.org, their copy is altered)

The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

The hash also happens to start with 5EAF00D.


There's a GIF MD5-quine here: https://news.ycombinator.com/item?id=13823704

And a PNG version too: https://news.ycombinator.com/item?id=32956964

But no one has made an exclusively plaintext (ASCII) MD5-quine yet, and I suspect doing so may be impossible given the characteristics of collision blocks.


How is it impossible? I would think an MD5 quine exists with probability approaching 1 as the size of the document grows to infinity. Think about the reduced problem:

1. a document containing "1", whose hash begins with "1"

2. a document containing "12", whose hash begins with "12"

3. a document containing "123", whose hash begins with "123"

#1 is certain to exist. #2 exists, but would take 16x as long to brute force. #3 would take 16x longer again. If this pattern doesn't continue until 2^128, where would it stop, and why?

All hashes can be brute forced this way, even secure ones SHA-2. Its security relies on the fact that the earth doesn't contain enough computing power to execute a brute force attack within the universe's lifetime.


as the size of the document grows to infinity.

Therein lies the problem.

Also the fact that it would need to be constrained to 7-bit ASCII only, and on top of that be "valid" in its natural language. It's a neat trick to make two documents look completely different with the same hash, but looking at the techniques which are required, they all rely on a binary file format and copious amounts of data which are effectively "hidden" --- all of which do not apply to a text file.


Maybe try to have a look at it as permutations: the mapping "hex of the hash" → "its actual hash" is a (presumably random) permutation. And it's quite probable that such permutation has a fixed point: http://laurentmazare.github.io/2014/09/27/fixed-points-of-ra...

The problem is that we currently don't know how find it more efficiently than with exhaustive search, AFAIK.

Edit: previously on HN: https://news.ycombinator.com/item?id=614079


> a PDF file that can be also run in a NES emulator, and will display its own MD5 hash. The MD5 hash is also shown in the pdf document itself.

By that logic, SHA 256 is also broken:

  $ cat >sha256.py
  from hashlib import sha256
  s = 'from hashlib import sha256\ns = %r\nprint sha256(s%%s).hexdigest()\n'
  print sha256(s%s).hexdigest()
  $ sha256sum sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3  sha256.py
  $ python sha256.py
  14cc85c420ced317fdb73e9403ac3f6e1d96d19c70ae0dce8da9b8d96fa0b4d3
(Yes, PDF is turing complete. Yes, that's terrible. No, it doesn't have anything to do with hash function deficiencies; it's turing complete on (malicious) purpose, just like webpages with javascript.)


Don't be silly here, the MD5 is clearly in the plaintext here, and the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.


> the MD5 is clearly in the plaintext here

Alright, I'll bite: at what byte offset in the binary file contents does a trivial encoding[0] of the MD5 hash occur?

> the NES ROM is only the first 40k of the file. It is not able to scan itself and print out a hash that way.

It is possible to encode the effects of multiple blocks of arbitrary[1] data on a hash function internal state (independently of what state you start in) in much less space than that data actually takes up, although I'll grant that actually doing so is somewhat impressive in it's own right, so I don't have a trivial translation to SHA 256 immediately ready to post.

Edit: tracked down my saved version:

  $ md5sum pocorgtfo14.pdf
  5eaf00d25c14232555a51a50b126746c  pocorgtfo14.pdf
  $ grep -aoi 5eaf00d pocorgtfo14.pdf || echo not found
  not found
  $ # using ...b126746c because 5eaf00... has a nul
  $ grep -aoF $(printf '\xb1\x26\x74\x6c') pocorgtfo14.pdf || echo not found
  not found
The MD5 is definitely not clearly in the plaintext here, though it could be only mildly unclear.

0: Eg, I'd accept 31 34 63 63 38 35 63 34 ... as an encoding of 14cc85c4... from above.

1: Including random/incompressible data.


I stand corrected here, ran it in a debugger, and the NES ROM appears to be doing significant computation between each HEX digit being displayed. Haven't actually read the trace log yet to confirm if it is performing MD5.

NES ROM still doesn't have any access to the rest of the file though.


I dont know about that specific file, but pdf usually gzips its component parts so i wouldn't expect naive grepping to work.


PDF being Turing complete doesn't mean that the program embedded within can access the bytes of the document.


It might be the case that the more complicated the artifact you're trying to forge is, the easier the MD5 forgery gets; the challenge with doing hash forgeries is that you lose control of some of the values and placement of bytes, which is more noticeable/disruptive in a short document than in a file that is a bunch of (seekable, error-recovering) formats at once.

(I've never tried to built one of these, so I could be just totally wrong here).


>The fact that any document can contain its own MD5 hash embedded in there should be hugely concerning enough.

I think that's pretty amazing, to be honest.


> Yes, they say, MD5 is broken for encryption, but since they’re not doing encryption, it’s fine for them to use it.

Unless I missed it, this article seems to not refute the most fundamental point: MD5 was never broken for encryption. Hashing is not encryption.


While hashing is not encryption, any secure hashing function can be used for encryption, even when used as a black box, (by making an unpredictable PRNG with it).

Moreover, MD5, SHA-1 and SHA-2 contain a block cipher function used in the Davies-Meyer mode of operation.

The internal block cipher function can be extracted and used in any other mode of operation possible for block cipher functions.

Because of these possibilities, many older laws that have existed in various places, prohibiting the inclusion of encryption in software products, but allowing secure hashing functions, have been completely misguided.


It's broken for hashing too.

The point is that MD5 is no good if there's any way an adversary might want to subvert it. It's fine if you just want to use it for hashing your own documents, but as soon as there's an incentive for someone to substitute one document for another, MD5 is problematic.

That's certainly the case for encryption, but it's also the case for these legal document records.


I don't disagree, but I think my point still stands.


> only broken for encryption

It's broken in an adversarial situation: given the hash of evidence-file A, it's possible to construct a file B that gives the same hash.

But it would be a different matter entirely to construct a file B that actually looked like a file of evidence relevant to the case. I don't know how lawyers use these hashes, but unless they're being used to detect malicious tampering, I don't see what's wrong with MD5. And since the files to be hashed are evidence, they're in the custody of a court; things have got quite bad if court officials might be tampering with evidence.


> It's broken in an adversarial situation: given the hash of evidence-file A, it's possible to construct a file B that gives the same hash.

No, that's a second preimage attack. MD5 is safe against preimage & second preimage attacks.

What MD5 is not safe against, is a collision attack: you can create two messages/files with different content, that end up having the same hash.


Yeah, sorry. TFA made that clear.

So to exploit the vulnerability, you have to be able to manipulate file A, the original piece of evidence, to construct a file B that has a matching hash. I still fail to see how this impacts files submitted to a court in evidence.


I've been wondering, is there a term for a type of attack like this:

Given a message M, length function L(), and MD5 hash function H(); is there an attack which can generate message M', such that H(M)==H(M') _and_ L(M)==L(M')?

In other words: Two different messages, both of the same length, with the same hash?

It's almost like a chosen prefix collision attack, but with no prefix (so P is empty) and a given message (M is known, M' is up to the attacker).

I ask because I frequently use GridFTP for data transfer, and it uses both the file length and the MD5 has to verify that files were transferred correctly.


I don't know anything about GridFTP - but there's a huge difference between verifying if files were "transferred correctly" and verifying that files were transferred without being tampered with by a malicious party.

MD5 is fine for the first task, and totally unacceptable for the second.


Indeed, which is why I didn’t mention third-party tampering. For that, the transfer can be sent inside of a TLS-enabled connection.


That is still an attack on the second preimage or a collision resistance properties of the hash function. Most collisions do work this way, for example see [1].

[1] https://github.com/corkami/collisions


That makes sense, but is there a specific name for this type of collision?


Despite official statements about MD5, the leading e-discovery software provider uses sha256: https://help.relativity.com/9.0/Content/Relativity/Processin....

And it is unclear if that is in any way unusual.


MD5 hashes are half the length of the recommended hashing algorithms. This convenience (and the switching cost) is worth more than the theoretical security considerations.


> (and the switching cost)

Yes, I must say I smiled when I saw the author's assertion that moving the entire legal industry to a new hashing algorithm is "trivial".


In other industries, MD5 collision is the most minor problem, it just works, and God know when/how they encounter collision, so just use it.


I've always wondered what would happen if some black hat made all their payloads have the same hash in the hopes that if it ever went to court the confusion would hinder the proceedings.

Like i get that md5 is essentially a unique identifier and not meant to protect against malicious interference but if all the exhibits had the same identifier surely that would confuse people.


I worked for years as the tech guy for a document imaging company. We worked a few gigs for the Serious Fraud Office in the UK. So, while I'm not a lawyer, I bumped into this stuff a fair bit.

The point is that evidence is an agreement between the two sides in a case, and it's not an absolute thing.

If you have the original document that was signed by both parties, great. If you have a scan (using a lossless compression format) of the document and proof that the original was destroyed, great. If you have a scan but no proof of destruction, still great. If you have a photograph of the document and no proof, still great. If you have a vague recollection of what was in the document, still great. All of these are "great" if the other side accepts that they are accurate depictions of the original. If they don't accept that, then there's an argument about what the original document contained and the provenance of the evidence, and only then does the actual quality matter. Original document with wet signature is hard to argue with (but not impossible - wet signatures can be forged). The further away from that, the easier it is to argue that the document presented is not accurate and should not be accepted as evidence.

Knowing that it's possible to use collisions to create false evidence doesn't matter if no-one contests that the evidence is false. It only becomes significant if one side says that the document has been tampered with, and that's not that common. The side claiming it was tampered with would have to present their version of the document, and their version of events that allowed the document to be tampered with, and so on. The judge would make a ruling about which version of the document was considered the "real" one and the case would continue. Obviously there are edge cases where the whole trial verdict hinges on which version of the document is the correct one, but they're edge cases. And in those cases you could-re-hash the documents involved and double-check with one was right, etc.

In the OP's example, where a letter of recommendation has the same hash as a authorisation letter, this is only going to matter if one side says the accused was authorised and the other says they weren't. The authorisation letter will be produced by one side, and the recommendation letter produced by the other, and there'll be an argument about which was the original document. The fact that they have the same hash isn't really relevant. It's a minor point of interest given that these are two clearly different documents saying different things.

In the specific cases for the SFO that I worked on, the SFO descended on the accused's offices like locusts, sweeping every single document into carefully numbered bags. We scanned the documents in secure facilities, stored the originals in secure facilities, stored the resulting images in secure storage, and deleted any cache or copies. My professional opinion is that it would be impossible for anyone to create two documents prior to the SFO's investigation that would create an intentional MD5 collision in the evidence used in court. And, even if they somehow did, it wouldn't matter because both documents would be in evidence bags in storage and could be recovered to be examined by the court.

Obviously, from a black/white technical point of view, using a better hash algorithm would be better. But I can see why the legal profession is reluctant to adopt the new thing; it's a hassle and it will only affect a tiny amount of cases, if any.


Interestingly, use of MD5 has been been found in court to make evidence less convincing. https://www.schneier.com/blog/archives/2005/08/the_md5_defen...

To the article, what tptacek said.


This is actually something I discussed with my legal team and was prepared to bring up at trial with our own forensic expert. My team was (correctly, IMO) not expecting much to happen because the amount of precedent that exists with MD5 being used as a way to say these documents haven't been swapped or tampered.


It's worth noting that there are no known attacks against MD5 HMACs, which look identical to MD5 hashes.


Quantum computers will severely break MD5 and SHA-1, so they'd be broken even if they are used with HMAC. Use SHA2-256 unless you need quantum-resistant collision resistance, in which case you should use SHA2-384. Use HMAC-SHA2-* with an 256-bit key if you want to prevent length extension attacks.


Severely break is a bit of a overstatement.

It will make a speed up, but its not like shor's algorithm - you need a really powerful quantum computer before md5 comes under threat.

But to be clear. Md5 is broken do not use.


Interestingly, DBT is still using MD5 as the default algorithm for generating row identifiers.


I read through this hoping to have a reasonable discussion of the difference between preimage attacks (see https://en.m.wikipedia.org/wiki/Preimage_attack) and was disappointed when I did not see the topic mentioned once. :(

It is much more computationally feasible to create two inputs from scratch that hash to the same value than to forge an existing documents hash (the threat model I’m assuming they’re discussing in relation to the law).

As far as I know I am not aware of a demonstrated second preimage attack on md5. Not saying to keep using it, just trying to not spread fud.

Edit: I do see second preimage is mentioned about 3/4 of the way through the article. I confess that I did stop reading and started skimming before then.


The article mentions the fact that a successful second preimage attack is not always necessary.

A successful second preimage attack is needed if you want to make a second variant with the same hash like an already existing legal document.

However, when the original document is not yet in the possession of others (or there might be a way to destroy or replace their older copies), you can make more or less invisible modifications to it, so that a second different document will have the same hash with it. Then the altered original document can be handed to other parties, who will not notice changes from whatever had been agreed, while keeping an alternative document that can be shown later as having the same hash.

While opportunities for such a forgery should happen less often, it is much better to use a collision-resistant hash to completely remove this possibility.


Yes, if you haven't already noticed, crypto is just a religion at this point, propagated by the "experts" who don't actually think and actively silence dissent.


I think it's simply that the blog author and commentators have an unrealistic threat model when it comes to how the legal profession uses MD5s.

After the first high-profile case where authenticity of evidence gets called into question because a seized electronic document was deliberately doctored to allow for a hash collision (if that ever happens), there will be a will to change to something new.


I doubt it, legal types won't see this as a math problem[1], but a legal problem (forging documents)

[1] unless I'm missing something, this boils down to: "given f(x: string) => y, how can I minimize the odds that you can generate an X for a desired Y"


[flagged]


This would be a better comment without that last line. There's a fair bit of dogma in your comment, too!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: