This was fun, but my gut reaction to the title was “it doesn’t have to be easy, just possible” and that holds. Unless, I guess, you want to collide a sha1 for some reason.
I found the AWS vs GPU cost comparison to be fun. 10x the cost of the hardware to run the collisions in the same time window. Crazy.
> I found the AWS vs GPU cost comparison to be fun. 10x the cost of the hardware to run the collisions in the same time window. Crazy.
The real benefit of the AWS solution is that you could rent a large number of AWS nodes and run the calculations to completion right now if you wanted to. As in, literally tonight. And you wouldn’t have to source the hardware, assemble machines, install software, power it all, cool it, figure out all of the power complexities and so on.
> There is no runtime available right now. Please change the compute type or try again later.
recently, that's not always true.
It's always funny to me when you run into a wall of "no more instances" in AWS. It doesn't happen often unless you're at a massive scale, but it does happen surprisingly often with the GPU types
Is it surprising that relatively niche configurations (especially ones which clearly need specific hardware per compute) are finite in aws? But imagine aws running out of m4.X-Large or something. That would be interesting. What scale would you need to run out of the most common instance in a top availability zone? 10k? 100k?
Happens with Google cloud too, sometimes for weeks at a time.
It's pretty sad how Google/Amazon can't at least temporarily move their own workloads to free up resources for their public clouds while they install new hardware.
> free up resources (...) while they install new hardware
That's likely not the reason. It's more likely caused by inherent spike-y ness of the traffic, given that most GPU instances are probably used for batch numerical computations (like training neural networks). The pricing incentivizes using a lot of nodes for a shorter amount of time, much like in the SHA example.
They are never going to significantly prioritise “public” cloud resource over internal needs, beyond own needs having quotas to make sure they don't run amok if something gets unexpectedly out of shape.
If their own public services slowed down in high demand times, it wouldn't look good. The sales channels could try to spin it as “giving you priority over us” but who would genuinely believe them.
They could reduce resources available to low priority internal stuff during spikes of other activity, but that would not have a large effect considering the combined size of activity on the public cloud resources.
That in itself doesn't explain the 10x though. You can insta deploy your VM on Vast.ai for 1/8th the cost. Literally tonight. You won't have to source the hardware, assemble the machines, etc.
People also overestimate how much effort it is to manage hardware, especially for raw compute (ie 99% SLA fine) operations.
The cooling and power is handled by your datacentre. You don't need to deal with that.
I've built a 200 GPU mining operation in one workweek, pay me $170k?
I think the 99% SLA vs a few more nines can be the deal breaker. And you probably didn't even get a full 99% - with all due respect, as you likely didn't need it in the first place for a mining rig. I know plenty of serious workloads that would accept as little as 90% availability (~1 full day down every ~10 days), and plenty that need a large amount of 9s (seconds, maybe a minute of downtime per month on a normal basis, everything more is an outage).
I don't know of any HPC facility that aims for/achieves anything beyond 99% uptime, most well-run ones are probably 98%-ish.
And that is just raw uptime, as an end-user you also have to live with you jobs waiting for the scheduler to fit them into the queue. A day or two waiting time is pretty common on larger jobs.
The real benefit of the in-house solution is that marginal compute costs are zero. And while this current example is just for fun, I do want my researchers and engineers not worried about the compute cost of every experiment or line of inquiry.
I dig the cloud for production, but for R&D I like my people to have fixed costs and close to zero marginal costs.
But you are right that the marginal costs will be mostly made up of opportunity costs.
Unless you have only one researcher, it doesn't matter for an individual researcher whether their marginal costs are opportunity costs of their co-workers or AWS bills.
(And if you don't have any mechanism for your individual researchers to feel the opportunity costs of their co-workers; that is the same thing as saying that one researcher can hog up all the compute time.
Social pressure is also a way to make them feel opportunity costs.)
They don’t ban it, but new accounts or even semi old accounts will have hard time getting access to more than a few instances. Most of the GPU machines are behind lock and key that go toward AWS bigger clients and clients with a proven track record. They don’t want to be stuck with an unpaid ec2 bill for a multi thousand dollar a month vm.
You still have to make sure that the money you got was clean.
Legally, even money you have in your account now might not stay there, if there was fraud involved. But the mined coins will be gone (and the compute time, too).
It’s a magnet for credit card fraud: take a stolen card and create an account, run the mining until the owner gets the bill and does the chargeback. You are untraceable and Amazon is left holding the bag.
> This was fun, but my gut reaction to the title was “it doesn’t have to be easy, just possible” and that holds.
It really depends why you’re hashing and what the consequences of a collision are. Using a hashing algorithm to protect sensitive data (should I hash passwords with SHA-1?) should weight the value of this answer significantly differently than using it as a convenient lookup optimization (should I use SHA-1 to identify git references?). And there’s a spectrum between those, where that convenience might be an attack surface (can I impersonate someone’s git commit?) or might be worth the risk (does it matter if I can easily find a collision?).
Unless I’ve missed some new proof, it should be assumed the answer to “is it possible?” will likely always be yes. The question then becomes “has the known wisdom on the following questions changed since I last asked?”, followed in no specific order by, “how plausible is the collision? how much do I care if it happens? how motivated are people to find it? are there other plausible mitigations in place? is the known wisdom on these questions likely to change in a way meaningful to me? how soon?”
> Using a hashing algorithm to protect sensitive data (should I hash passwords with SHA-1?)
You're almost certainly fine! A collision oracle doesn't actually make it any easier to conduct a preimage attack.
We normally recommend against using hashing algorithms with known collision attacks for password hashing because (a) these hashes aren't well-suited to password hashing in the first place (b) because having a collision attack implies structural weakness and you don't really want to be caught relying on that hash if a preimage attack does turn up (c) as a fashion statement.
> using it as a convenient lookup optimization (should I use SHA-1 to identify git references?)
Now this is a bad idea, because having collisions in the wild means that someone can swap out one for the other without the hash catching it, so now you need to build in extra complexity to mitigate that.
(Of course, collisions always exist in theory, so you want to design around the possibility. The difference having collisions in the wild makes is that as long as they're only theoretical, you can get away with e.g. screaming loudly about running into an impossible situation and guarantee that you don't do something actively dangerous, or falling back to something horrendously slow but always correct; once collisions exist in the wild, that's a denial-of-service.)
One concrete example is using hashes as identifier for objects.
It is possible that this a bad idea because of collisions. However, if you have for example a federated system you don't have a central instance issuing identifiers for objects. So you create a hash. In reality it doesn't matter if you used SHA1, MD5 or even a random identifier. As long as you have enough bits, a collision becomes unlikely. However again, if you have many objects consider the birthday paradox.
The interesting thing here is that this is an issue in any case: Even if you have an extremely secure hash algorithm, in the end you have less bits than the original data, so collisions must occur, even if they are immensely improbable.
If you then decide that you ignore collisions because after considering the birthday paradox collisions won't occur practically it can be worth thinking about what would happen if such a collision does occur.
Let's say your application executes a database update and has a hash collision. Then another object would be overwritten.
This is a use case of not mixing object creation and object update. If you program carefully such that you never overwrite another object if you update your object you'll get an internal error if you have a hash collision. That's better than losing unrelated data.
However this is difficult to test. And I am sure that other bugs are a lot more probable.
But this thinking shows me that it is useful to program defensively.
If you have assumptions, encode them in assertions. So even if you don't have tests for this rare problem, your chance of being lucky is higher.
<i>However, if you have for example a federated system you don't have a central instance issuing identifiers for objects. So you create a hash. In reality it doesn't matter if you used SHA1, MD5 or even a random identifier.</i>
But it <i>does</i> matter. Usually it's only a big problem in edge cases, but sometimes those edge cases are <i>extremely</i> important. TLS certificates use a hash as their identifier in a distributed/federated model like you describe. By generating collisions, one can submit a normal signing request to a trusted public CA, but then use the resulting signature with a second certificate that has malicious properties (being for a domain name belonging to someone else, having additional properties like it being trusted to sign other certificates, etc.).[1] This is why you can't get new MD5 or SHA1 certificates signed by public authorities anymore.
As you say, there is always going to be at least an infinitesimally small chance that two realistic inputs to the same hashing function will produce the same output, but IMO virtually any system that uses hashes as identifiers will have its security degraded in some way if it's practical to generate a collision.
<i> If you program carefully such that you never overwrite another object if you update your object you'll get an internal error if you have a hash collision.</i>
It sounds challenging to handle this gracefully in a distributed system without some fairly complex logic. Alice uploads a PDF of sensitive data that hashes to value A on node 1. Node 1 grants Alice full access to the file with that hash. Bob uploads some meeting notes that also hash to value A, but on node 2. Node 2 grants Bob full access to the file with that hash. When the nodes are synchronized an hour later, the system detects a collision. What does it do?
Somebody better tell Linus? I kid. But yeah the answer is you use it as part of your lookup and then mitigate collisions with more expensive comparisons.
It's a non-issue he's addressed many times. What happens after you collide a single hash, you need to have a useful commit that isn't going to destroy the ability of the code to run AND do something useful to you, etc etc? Also what happens after the next commit, where all the hashes recalculate? It's possible to be an issue, but highly improbable. More likely your creds are hacked or the fact that no one signs their commits finally bites us on a wide scale.
Well, it's well known since basically as soon as git was published and determined to have not been a big deal, with work on the way ever since an actual collision was published.
Anyway, practically you're going to write a collision fallback anyway, the difference is that as long as collisions aren't likely, you can be pretty sure that collision handling is a very cold code path that's never going to have more than a handful of items in a bucket.
The point I'm trying to make, though, is that while practical collision attacks do have practical implications, they aren't relevant to password hashing. People routinely assume that collision attacks make a hash unsuitable for password hashing. They're not exactly wrong, but it's kind of a non sequitur. I felt that this was a point that had to be made because the framing of parent ("what the consequences of a collision are") implied that the expected answers were exactly the opposite.
The interesting thing about your comment is that you view password hashing as needing more security but git commit hashes to be less of an issue. I couldn't disagree more.
First, for a password hash function to matter, the hashed password must be compromised. If you don't have the hash it doesn't really matter all that much. Yes, hashed passwords do get compromised but many don't.
Second, I can't think of many bigger security threats than intentionally creating an SHA1 collision for a git repository. Poisoning a Linux git repo could be truly disastrous. Remember too that if you can poison a git repo, you can also make the binary output have the same size and hash too. I mean that's much more difficult but not impossible and it wouldn't surprise me if state actors have dedicated a lot of computing power to this.
And "is it possible?" is by definition always "yes".
The issue is that your commit still has to do something useful and have the proper colliding hash, which dramatically increases the effective collision complexity for something that is already nearly impossible to occur. Sure you could have some garbage commit with a colliding hash, but all that means is an inconvenience while they remove that commit.
Well, you can just comment out the garbage you create to match the hash so that's not a big problem in the case of source code. Creating a correct binary file may or may not be a bigger problem, depending on the file format.
> The interesting thing about your comment is that you view password hashing as needing more security but git commit hashes to be less of an issue.
I do? I should have been more clear that I think this is another case which very much depends on the other considerations I addressed and probably many more.
> And "is it possible?" is by definition always "yes".
You’re not the only person to say so, and… I know. I left the possibility open in my response mainly because I wanted to focus on the other considerations without dismissing the parent comment. Even if the answer was less definitive I believe considering level and degree of risk should be a bigger factor in these conversations than what’s theoretically possible.
My point wasn’t to address whether any hashing technique might be collision-free, but that assessing that fact is not a good metric for the fitness of a given hashing technique.
Collisions are possible in any hashing algo. You’re taking infinite inputs and mapping them to finite number of outputs. There are literal infinite collisions in any hashing function.
The thing is, collisions should correlate with hash length. If hashes were uniformly distributed, you should need about 2^160 hashes to find a SHA-1 collision. But researchers got that number down to 2^61. That's the difference between completely impossible, even with a state actor's level of resources, and almost possible with consumer grade hardware.
> Security isn't measured in time, it's measured in money.
That's not how cryptography works though: many schemes (for sufficiently large keys) are, in our current understanding of mathematics, totally unbreakable over a period of time that stretches from now until the heath death of the universe.
That's measured in time and no amount of money is going to change anything about it right?
Security is measured in resources. Money is the best one but its often hard to put a $ figure on things, especially when you are making a standard for everyone, so often people will use cpu time instead.
If you can make the cost to break effectively "infinite" ,that's great as its really easy to analyze.
The assumed time requirement simply establishes an infinite present value lower bound on the cost in money. This becomes clear once you admit the possibility of an incorrect current understanding of the mathematics.
(Admitting the possibility of an incorrect current understanding of the mathematics is, like, a major reason to go with the enormous overkill on key sizes we do.)
Whenever you say how long something takes, you need to specify how much computing power is used to achieve it in that duration (FLOPS or whatever appropriate unit). Then you see what it costs to manufacturer and operate the device that operates at that speed. Then you spend double to have 2 such devices and parallelize the job, completing it in half the time. Repeat that last part until the time is reasonable (TFA is all about this). It doesn't break down until you exceed the available natural resources and our ability to put them to work.
Brute-forcing 2^256 is impossible at any price; if you could convert all the matter in the Sun to energy and capture all the energy and drive the lowest power computers physics allows, you wouldn't have a billionth of the required energy:
Typically we calculate assuming the total energy needed at the Landauer limit at 3K. It's usually going to be multiples of the entire sun's energy output for billions of years, more if not using a quantum computer. Money isn't the relevant metric at the scales of modern cryptography.
But the Landauer limit, even at 300K, for a first preimage of even a completely secure hash is essentially zero. Define:
f(s, n, h) = 1 if there exists an n-bit preimage of h that starts with s, else 0
Computing f by brute force costs one bit erasure to prepare room for the answer. (And takes 2^(n - len(s)) operations, but we’re imagining a civilization that can build Dyson spheres and has quantum, and hence at least somewhat reversible, computers. We’re talking about the Landauer limit in particular.)
Now choose an appropriate n (slightly greater than the length of the hash) and compute f for s=[0] and s=[1]. After at most two tries, you’ll get a match, assuming a preimage exists. Call the first match s_1. Now repeat for s_1||0 and s_1||1. Call the first match s_2. Repeat this up to s_n. Now s_n is the lexicographically first n-bit preimage!
This costs O(n) energy at the Landauer limit. All crypto is broken! Never mind that the actual computation involved exceeds 2^n.
With reversible computing, yes. I strongly suspect that'll never be practically realized, or in the event that they are that the slowdown will be large enough that other fundamental limits on the speed (rather than the energy) of computing will instead dominate.
With little money you need a long time, with a lot of money a shorter time is enough.
The security is good enough when the attackers need either more money than they have to break the cryptographic scheme in a useful time, or when using the amount of money that they afford results in breaking the cryptographic scheme after a time that is too long to be useful.
For cryptographic problems which are ideally parallelizable, the product of money by time is constant.
For most real problems the amount of required money increases much faster than the solving time is reduced (except that there are some thresholds where slightly more money can produce a large decrease in the solving time, e.g. when you can afford to design and make a dedicated ASIC for the problem; however after such a threshold and the corresponding step in the required time, the quantity of money still increases faster than the time decreases).
Agreed to the first sentence. Let's say you have a table full of SHA-1 salted + hashed passwords, is even remotely worth the effort to spend even 1k per user/per password? Especially when many (though obviously not all) may be using unique passwords (such as myself)? Even if they use the same passwords for all sites, that is an incredibly large hill to climb...
Oh uh, to be clear, you should NOT be using SHA-1 right now. If you are reading this in order to justify not moving to a stronger hashing algorithm, none of us give you permission to do this, and if you get caught, any of us involved will vote for hanging you...with a noose...in a good old western style hanging...or something.
Seriously. Don't use SHA-1. Move on. If you are a manager and a developer tells you that you can't move on, please reach out. I've (regretfully) moved many from SHA1 -> SHA-XXX and I've provided migration paths. Migration paths with accessibility mind you...though that apparently doesn't matter to many of you cough Citibank cough.
Password hashing is better a slow function to compute. Delaying login by e.g. 0.001s (say 1000 times slower than normal) is unnoticeable, and login is rare enough that the total compute impact is quite low. But for a brute force attack a 1000 times slowdown is horrible. Even if specialized hardware makes it only 10 times slower it might still be effective.
KDFs try to be slow, and they try to be hard to speed up with specialized hardware, all to prevent brute forcing.
There is ab interesting parallel with PoW systems. Many cryptos have switched PoW algorithm to remove the advantage of specialized hardware. This is supposed to prevent decentralization.
> There is an interesting parallel with PoW systems. Many cryptos have switched PoW algorithm to remove the advantage of specialized hardware. This is supposed to prevent decentralization.
Unfortunately, this tends to just prolong the development of ASIC's, and often keeps the first working versions hidden from the public with the sole purpose of being able to mine with vastly increased efficiency compared to others.
I think it worked for EthHash, given the number of GPUs using it.
Monero's RandomX also seems succesful, but that has a higher bar (wanting to be CPU only) and a smaller footprint so ASICs might be easier to hide.
As a bonus for KDFs, there is much less economic pressure to 'break' them. So trying to develop ASICs / FPGAs / GPU implementation is pure cost, rather than an investment.
They're underestimating GPU costs substantially by looking at the MSRP, even for AMD cards, versus AWS pricing which is based in part on Amazon having to actually buy hardware.
For the 6800 XT, $900 is more realistic than $650 right now, so it'd be fair to add another 50% hardware price by about 40%.
Can you really get MSRP on a $20,000 order though? That's small enough that people could coordinate group buys instead of going through retailers if it were easily doable.
Probably not at MSRP, especially given the card they're talking about is the 6800XT which is very in demand.
Based on personal experience of being involved with orders of similar magnitude (~$30k-$40k), I'd guess you could get within 25% of MSRP rather easily, and within 10%-15% if you got lucky (and are good at negotiating).
> That's small enough that people could coordinate group buys instead of going through retailers if it were easily doable.
You'd have to trust somebody with several hundred dollars to a grand of your money. There's very few people I would do that for and I don't think I could easily put together a group big enough. Also there's other ways to get near/at-MSRP GPUs as an individual (ex: weekly AMD.com drop).
> You'd have to trust somebody with several hundred dollars to a grand of your money. There's very few people I would do that for and I don't think I could easily put together a group big enough. Also there's other ways to get near/at-MSRP GPUs as an individual (ex: weekly AMD.com drop).
Hmm if you know enough video game players at the office you could coordinate this through work. Might try it for the next generation.
I don't know for sure. If I had to guess, it has good performance per watt for mining, and miners generally care more about that than performance per dollar.
> And is it worth selling mine to buy a 6900XT or an nvidia?
If you're able to sell yours and buy a 6900XT (or Nvidia GPU better than the 6800XT, basically 3080 and up), then I don't see why not.
If you can sell yours and upgrade for a small cost - how much is the performance worth to you? You're probably only getting 10%-20% more performance (depending on GPU), is that worth $50, $100, etc?
Prices have been coming down quickly over the past month or so though. Lately I've even seen brand new 3090s and 3080ti-s available at MSRP or even slightly below it on Amazon for hours.
Although the pricing for AMD's lineup does seem to be in a weird spot right now, 6800XT for $950 but 6900XT for MSRP@$999.
Reminds me of multithreading. If a race condition is possible, you have to assume it will happen, especially because you can never predict what weird edge cases will trigger it to happen more than you'd expect.
Can someone ELI5 why collisions are so serious?
As far as I understand what a collision is, it means you have two source files that produce the same hash output, but why is this so bad? How do you have any control over what the contents of those two files are?
There's some theoretical background on how collision functions work. Generally, the assumption is that the hashing functions in cryptography meet the following:
- Each input can be quickly converted to a digest (hash).
- Getting from a digest to the input should be infeasible.
- It should be infeasible to produce two inputs such that they map to the same digest (although strictly mathematically, people tend to prove it is inefficient to do so).
and a few more assumptions that are irrelevant for your question.
In a nutshell, the idea is that hashing functions are deterministic. One input maps to one output and by knowing the output I cannot find the input nor I can produce two inputs that map to the same output.
Now, those attacks basically violate the principle that one input maps to one output. Suddenly, many inputs map to the same output.
What's the risk though?
You want to download a Linux Distro with hash digest of value ABC. I'm somewhat interested in people using that linux distro so I tamper the distro to introduce a tracking mechanism that also produces ABC when hashed. You download the distro, verify that the hash is indeed ABC and install it.
A more structured answer to your question is available here https://github.com/corkami/collisions (he also put together a presentation that I can't find right now).
Hope this helps :)
> One input maps to one output and by knowing the output I cannot find the input nor I can produce two inputs that map to the same output.
> Now, those attacks basically violate the principle that one input maps to one output
That’s not really… the principal. That’s an impossible desire. You simply cannot fit a larger set in a smaller set. Any function that transforms an arbitrary set of data into a smaller set of data is definitionally going to have collisions.
A flawless hashing function that transforms an arbitrary amount of data into a fixed amount of data is subject to an infinite number of collisions for every given output.
The problem is not that there are collisions, it’s mathematically impossible for there not to be. The problem is the speed at which they can be found. If you can find them quickly enough, the potential exists in finding malicious inputs in the mix.
That's not what this means and in fact they might not be. It is possible for a hashing functions to have no input that leads to a particular hash (it would still be a hashing function).
If you use the hash (that we assume you aquired overa secure channel) to verify the download of eg an installer for some sensitive software, and someone manages to give you a backdoored installer instead with the same hash, that would be rather bad.
I think the question is that how is it useful if you happen to have two hashes that match - it's very unlikely that they will be
Where the problem comes is when you can generate a matching 128 bit hash with your own content without having to do 2^128 different hashes (or strike ridiculously (lottery winner being struck by lightning) lucky and only need to do 2^100 hashes)
Lets say you have an anti-malware service that sees file1, analyzes it, and decides the file is benign. The service then updates a database saying 'hash xyz has been analyzed and is benign'.
Then you send file2 which is actually malicious, and the service decides it has already analyzed this file and lets it go.
This would work on a lot of big AV appliances and software. A surprising amount of these are still using md5 for de-duplication.
This specific attack is not possible against SHA-1 (yet).
The known collision weakness only allows generating two colliding pieces ahead of time with a hash value that you don't control. It does not give you ability to collide with a pre-existing file/pre-existing hash, so you can't attack any existing SHA-1-verified file. You can only attack files that you've specifically crafted/tricked to contain your colliding block.
Because people use it as a unique identifier for data.
Let's say you have a file on dropbox and they use SHA1 for data de-duplication. Someone else creates a file with the same SHA1. Dropbox will recognize the SHA1 and do one of two things:
- sync their file on top of your, or;
- sync your file to their file system
(not a real example, and an actual implementation will be less naive, but this is what can happen)
I would expect a tiered approach when it comes to deleting data. E.g. before deleting you actually check byte by byte, but because that is too expensive to run on every file, you use the hash to narrow down what files you are testing against each other (and maybe even a cheaper hash on top of that one that determines whether you will spend CPU cycles on doing an expensive hash).
Perhaps that's how it's actually done, I'm not sure.
Here's a good example of why SHA-1 collisions are so serious:
Let's say there's a cloud service where you configure SAML by setting the SHA-1 thumbprint of your identity provider's response signing certificate instead of uploading the IdP's XML metadata or its signing certificate directly.
If an attacker can create their own SAML response signing certificate that hashes to the same value as your identity provider's, the attacker can forge seemingly valid SAML responses that would give them access to the cloud service. This attack would work because in SAML, the following things are true:
1. The public half of the key used to sign a SAML response is embedded in the response alongside the signature.
2. SAML does not use PKIX's hierarchy of certificate authorities, only the X.509 public key certificate format.
Trust relationships between SAML identity providers (a/k/a credential service providers) and SAML service providers (a/k/a relying parties) are normally established by directly exchanging entity metadata that includes at a minimum the identity provider's signing certificate or by using metadata aggregates signed by a trusted third party (e.g., all multilateral trust federations—eduGAIN, JISC, InCommon, INFED, etc.) But in this thought experiment, an attacker would be able to generate their own signing key-pair and append whatever junk to the user-defined fields of the X.509 public key certificate is needed to cause a SHA-1 collision.
That's basically how the Shattered proof of concept works, only using junk JPEG data embedded in a PDF:
Cryptography is disastrous to use incorrectly, and "make sure a prankster can't add suffixes to anything" is exactly the kind of property that's easy to screw up.
For example, have you ever run "git fetch" on a pull request? You just hashed files you didn't make, and stored them under their hash. Whoops.
Have you ever used nix? A distributed content addressing system? Validated something is safe then signed a hash of that thing? Anything where a file and its hash are assumed to be interchangeable for identification? If you just go looking, the world abounds with chances for mischief.
Follow-up question: Wouldn't another use case be if passwords from a database were leaked, but they were hashed, you can still login as a user if you found a collision? But I think salting would prevent you from doing that, as the collision you found would be of the salted password, which will be salted again before being hashed and compared, correct?
Not (at all) an expert, but I think what you are referring to is a harder problem: given an SHA1-hash, to find something that maps to it. Getting a collision is quite simpler: just create two inputs that map to the same thing, or (more realistically in attacks) you know the original content (and therefore its SHA1 hash) and you want to create another content that has the same SHA1. If, on top, you can also decide (to some margin) what to put in, you can add some "invisible" malevolent pieces. As far as I understand, this is much easier than reversing an arbitrary SHA1-key (with or without "chosen" additional content) without knowing one decrypted key, because the collisions that were found so far were always of rather big files sharing a vast majority of their contents. An ideal hashing would give totally different results changing just any byte in the contents, but apparently this is not the case for SHA1 (but again, I'm just a casual reader about these things!).
Sorry for the tangential post here but the cost difference between hosting the GPUs and using AWS is jarring to me. The whole value proposition is to do things cheaper. Now I can’t understand how this can be correct anymore. Basit ever been the case at all?
One of the reasons the cost looks so radically different is also because it is not comparing the same things.
AWS is not just GPU + electricity. It's GPU + a server + a rack + electricity + network + a physical data center + real estate + labor + reserve capacity + automation, etc...
AWS GPUs (or ones offered by any of the big cloud providers for that matter) are not really all that cost effective. They also have to use datacenter GPUs, which are a lot more expensive than consumer ones.
For cost effectiveness, one route (mostly in the ML community) is to rent people's machines via services like vast.ai, who can get around NVIDIA preventing use of consumer cards in data centers by making it so they're only selling/buying time on people's computers.
These end up being a lot more affordable (IIRC ~$20/day for a 2x 3090 system) but in exchange have the unreliability that comes with renting time on someone else's computer, like tasks randomly getting killed (which is pretty frustrating to wake up to, as you get charged for time rather than usage), some people having poor upload/download speeds or other gpu performance issues.
If I recall correctly Nvidia does not allow you to put consumer cards into data centers. They force you to buy a different line of cards which are like 10x more expensive. That may be a driver of the AWS price.
> The whole value proposition is to do things cheaper.
I don't think this was ever the case, at least, not in the short term or at smaller scales.
I always thought the value proposition for AWS was the fact that you could scale up and down nearly instantly and that you didn't have to run your own data center, which was a monster of an expense.
AWS made it so a single person could be a startup. With just a couple clicks, you've got a server running, and you only pay for the time you use.
Now, maybe you're asking about GPUs specifically...in that case, yeah, I don't know why people would use AWS unless they really only need a couple hours of GPU time per month or something. GPU instances are so expensive that the ROI for buying them yourself hits pretty quickly. You have to REALLY need the rest of the AWS infrastructure to be directly connected to your GPU to make it worthwhile.
That's not the case. They're saying you can find a collision easily (i.e. I can easily give you two files that have the same md5sum).
However, "crack an md5 hash" isn't what that means. First, you can't really "crack" a hash (there are infinitely many inputs that have the same hash), but even just going from a hash to any input with that hash is much harder.
Yeah, this is an important point. The known published attacks on MD5 are pretty narrow, so there are many circumstances where you can rely on MD5 hashes if you have to.
Was said a bit in jest, but in seriousness - only if you publish the hash and dox yourself, or the attacker gets your hash from a DB, some identifier, and finds you. That was the joke - in this case the comment does this. (I have no idea if or how much doxxed I didn't check)
Given I generated the hash with something like "head -c 1000 /dev/urandom | md5sum", I also don't really think a wrench attack would work here, even if you could find me. I don't know any input that produces that md5sum, no matter how many times you hit me with a wrench.
SHA256 and SHA512 alone are not awful, but they're slower than BLAKE3 and vulnerable to length extension.
You can add extra tricks to protect them from length extension, like a length prefix or like doing SHA512/256, but it's better not to add tricks to your crypto.
I think this is only relevant if you're using a hash as a MAC, in which case you should be using HMAC-SHA256 instead which is not vulnerable to length extension attacks
bcrypt is a password hash. It's optimized to defend against brute-force attacks on a short secret string -- which means it's slow (tens to hundreds of milliseconds), and it doesn't support inputs longer than 72 bytes at all.
SHA256 and the BLAKE family are designed for hashing larger bodies of data. They're fast (~5 CPU cycles per byte) and can handle an arbitrarily long input.
Pick your poison. Here's the latest results from SMhasher listing various hashing options many even much faster than Blake3. Find one that works for you and has a stable library for your language/platform.
I don't know how that benchmark figure was measured, and I think these are more informative: https://bench.cr.yp.to/results-hash.html. The specific CPU and the input length both matter a lot.
How long is a string? That question is just as easy to answer as the one you posed.
Your question is best answered by asking more questions. What are you doing with the hash is probably the first question. Depending on the answer will go a long way to helping pick an algo that provides the most appropriate when considering security vs expense/time to hash
Does using and verifying multiple hash functions help? If something verifies with SHA1 and MD5 is it better than just one? Of course you’d add SHA256 and BLAKE3 or whatever.
>Does using and verifying multiple hash functions help?
Technically yes, but it takes extra time to hash something twice and it will take extra storage to store 2 hashes. This trade off is not really worth it compared to just using a good hash function alone.
Earlier versions of TLS use MD5 + SHA1 not repeated, but applied side-by-side to produce a 288-bit hash; this presumably has a multiplicative effect on the strength, since now you have to generate a collision which satisfies both functions simultaneously.
> If F and G are good iterated hash functions with no attack better
than the generic birthday paradox attack, we claim that the hash function F||G
obtained by concatenating F and G is not really more secure that F or G by
itself.
I can't generalise for f(g(h(foo))), but as referred to in sibling comment (dannyw), the specific example
sha1(md5(sha1(foo)))
decreases security because the output of MD5 is 128 bits, compared with 160 bits for SHA-1. The arbitrarily long input foo is reduced to an input of 128 bits at the final step, and you only need to find that input for a collision (independently of any MD5-specific weakness you may also have added to the mix).
It's not that simple. Finding collisions for md5(f(x)) could be harder than for md5(x) if f is a half-decent hash function. For the former, likely the best you could do is a (brute-force) brithday attack, with complexity ~64. OTOH, the best known collision attack on SHA-1 has complexity < 63.
The "when done correctly" is a huge caveat, given that what you're replying to does it incorrectly. If X and Y collide, so that SHA-1(X) = SHA-1(Y), then f(SHA-1(X)) = f(SHA-1(Y)) no matter what f is -- so you get a collision.
Concatenating the outputs of the hash functions fixes that particular issue. But concatenating makes preimage attacks easier.
If you run it Against a database of thousands or millions of passwords you can trickle out a steady flow though. So each gpu can crank out a few cracked accounts a day.
I found the AWS vs GPU cost comparison to be fun. 10x the cost of the hardware to run the collisions in the same time window. Crazy.