It's not a particularly interesting email, I don't think it's bringing anything new on the table. The title also doesn't have anything to do with the contents (I understand that might be because of HN submission rules but it's very misleading in this case).
It's a bit hard to take the author seriously when he complains about "headless chickens" "considerably exaggerat[ing]" then goes on to say that you need "a nation-state's worth of resources" to find collisions. If anything this shattered proof of concept showed that it was actually a lot easier than that, giving an estimate of around 110k$ IIRC.
I'm also sure that SHA-1 remains pervasive in many codebases, although as long as pre-image are impractical it might be hard to exploit those vulnerabilities.
>I'm also sure that SHA-1 remains pervasive in many codebases
I've done a lot of security reviews of backup solutions over the years and nearly universally they use SHA1 for de-duplication. Security professionals have been providing warnings about this practice, but when you are dealing with a cloud backup solution as an example, speed and low compute requirements are a must and it's hard to win the argument when someone tosses "theoretical" onto the table. Well, we are now past "theoretical" and that's what matters.
I just tried out the SHA1 collider: https://alf.nu/SHA1 and created two PDFs containing entirely different bits of text in JPGs. It took seconds and indeed the output of the files were SHA1 identical. This is no longer nation-state level stuff and computing colliding documents to upload to cloud backup providers who de-dupe across the file system and not just restricted within the boundaries of each account could result in the ability to wipe out and change other people's data. At a minimum that could be an annoyance. At worst, it could result in the intentional change of someone's content to cause significant financial or physical harm.
the two PDFs used the same (colliding) prefix and included both files you supplied. once you've reached a hash collision, any number of identical blocks after that would hash the same.
if you want a proof, submit a different image: the hashes will be identical, but different from your first attempt.
Can you explain the difference between a collision attack and a preimage attack? How can you get two PDFs with arbitrary contents having the same SHA1 without a preimage attack?
The pdfs don't have arbitrary content, they contain very specific headers (crafted by Google research), in one file the header selects one image, in the other it selects the other. These two headers are SHA1 colliding and the remainder of the files are identical.
I think that's actually a great example of how it may not be considerably exaggerated.
SVN is probably not the only piece of software where you can create a mess solely with the already released collision. It's more like a DOS, and less like actually injecting a malicious payload, but potentially still destructive.
Edit: Perhaps I'm missing the context of "considerably exaggerated"? Are there some examples of people saying the sky is falling?
Peter suggests that everyone using SHA-1 should move to SHA-256. That's a reasonable suggestion, but I say as long as you're making hashing changes why not move to SHA-512.
Remember, it's also in the FIPS SHA-2 standard and faster on 64bit CPUs then SHA-256. It's only 64 bytes long, surly that's not too much to handle.
Edit: Goggle also suggests SHA-256, so perhaps Peter was simply seconding the recommendation. I suggest SHA-512 is the better recommendation.
Instead of just migrating everything to a new hashing function when something gets bad, why not migrate to something that would be future-proof and easy to switch out? Multihash[0] is one solution, where a hash contains information about what hash-function was used when generating the hash, so you can have the same input, multiple hashes.
Multihash is nice, but it solves exactly two out of many problems with migrating to a new hash.
First, it carries the information of which hash function was used to produce the digest "in-band", so that this information does not need to be obtained out-of-band, or from a different input, or inferred from context.
Second, because it carries this information inside one data item, it essentially domain-segregates the information space of the digest by hash function, which ensures that two digests generated by two different hash functions never collide into the same value. While this property seems very useful, in truth this can only happen if the two different hash functions in question generate digests of the same length, and because the hash functions are different, you can't mount a collision attack, so you must work backwards from one of the outputs to try to break the other [1].
This then shows that the hard part about "migrating hashes" isn't usually the data structures, but rather the policies and practices on how you affect access or treatment of data items identified by or checksummed by the now-insecure hash; whether you have enough data and knowledge contained within the closed system that seamless migration is possible internally; how you communicate changes in content-addressing to users such that they can evaluate the trust and risks (similar to "seamlessly" upgrading HTTP to HTTPS), etc.
[1] I'm not a crypto expert so I'll avoid commenting in precise, technical terms [2] of which exact attack this would entail.
This makes the output length rather odd (not even divisible by 4 bytes), although that wouldn't matter much in non-performance critical applications. Also to keep in mind; when using hashes as a key for something (file name, hash table), here you'll need to take bits from the end, not the beginning, when splitting it up (eg. for using multiple directory levels).
Didn't know about this spec/format (i know they called a protocol, m not so sure it can be accurately called that way) seems cool until you have to realize that you must support all hashes or at least a handful of that in your app which is convenient from the developer POV but wasteful from the database-administrator because you must either future-proof that column size or drop performance altogether and start using "document-db" for authentication which is kinda odd.
IMHO you should not "future-proof" password hashes, collision attacks for passwords are more difficult to achieve than many people realize. What you should as a developer do, is stop using the hash and only migrate systems that actually required in a case by case basis.
Depending on the application the overhead of 512 bits vs 256 bits is significant. In these applications SHA-512-256 and BLAKE2b-256 are my go-to recommendations. I don't recommend SHA-3; it's design is obviously considered very solid, but it's performance characteristics make it uninteresting for software compared to both SHA-2 and BLAKE2.
I wouldn't recommend using SHA-2-512 for added security. Choosing SHA-3-256 would make more sense, as more knowledge has gone into the construction of SHA-3 than SHA-2, and the double bit length doesn't buy you anything if the fundamental construction is broken.
The primary design purpose of SHA-3 was not to replace SHA-2 with something 'better' it was to provide an alternate approach for hashing such that if one approach to hashing is compromised there is a completely different approach to hashing that is still viable.
There are good reasons to use SHA-3, but one should not consider using SHA-3 on the grounds that it is known to be a better hashing algorithm then SHA-2. It is not. In fact, SHA-2 has a much greater level of confidence due to it's age.
Probably not. There's SHA-3 (though reputedly slow), and my favourite, Blake2b.
I've implemented both SHA-256 and SHA-512, and seen the more modern Blake2 in detail: Blake2 is not more complex, it's fast, and for now it's solid. I expect it to stay secure for a long time, given its Chacha heritage.
I meant that you definitely want the truncation, not whether you pick SHA2 or not. There are other choices for sure that are also immune to length-extension.
SO BIG. For git, most people don't realize in most cases if the repo isn't giant you can use just 96a as a shortcut. I imagine people would be turned off by the sheer size.
If I were eyeballing two hashes to make sure they were equal, I'd worry about my ability to see near collisions (similar to [0]) the longer the hashes were. My eyes just start skipping ahead at some point, and would probably miss different-but-similar-looking characters in the middle.
But I think SHA-512 is resistant to near-collisions (sorry, I don't have a citation), and besides, one would probably best compare hashes using a script one understands.
All good hash functions have an even distribution, and are therefore resistant to near-collisions. Hash tables for example require this property to work properly.
There is a nice clean implementation in tweetnacl - while we are at it, commits could be signed with eddsa (or xeddsa/vxeddsa, though I don't know if those have compact libraries?)
No, one of the characteristics of a cryptographic hash function is that output should be completely unpredictable based on input. So the probability that the low order bits will change vs the high order bits should be exactly the same given any arbitrary inputs.
There is the "hardened sha-1" implementation that was released along with the sample collisions. It supposedly detects this type of attack and returns unique SHA-1 hashes. Not sure that's really a fix, but perhaps something someone would want to use short term.
The description seems curious. I'm not sure what the need for that would be? Retro-compat with existing stores/caches (but only in one direction) + mitigation against some collision attacks? hm I don't see the point. If you switch the hash function, I guess in the vast majority of cases you can switch it completely for something that is known to be really better, not a weird hack. Must fit a purpose in some Google software as a better than nothing quick&dirty temporary hack -- and I would not advice doing it in other contexts unless you know for sure you have no other choice.
The sha-1 attack is only a collision attack. It's not a pre-image attack. So if you clone a git repo and want to sneakily replace an existing object, you will not be able to do this because one of the inputs has already been fixed.
In order to perform that kind of attack, there would need to be a second pre-image attack, which does not exist right now.
Even md5 still has second pre-image resistance with a search space only slightly below the entire output space.[1]
So add a new object to the git repo (open source projects usually allow contributions) for which you already have a malicious SHA1-colliding object in your pocket. If your change is widely distributed, you now have a hash in the wild that matches your malicious data.
Right, assuming it goes through no revisions for code review and doesn't get rebased, that could potentially work. You would need to figure out where to have some random data in the commit that you're altering to search for the collision. Maybe some "test data" or something and hope nobody asks you to remove it.
As a total neophyte on these kinds of things, the article seems to be talking about Google's SHA vulnerability as if it's a preimage attack rather than a collision one. Anyone more knowledgeable care to chime in?
Some protocols are sensitive to mere collisions. Also, this suggests actual preimage attacks could be found later. Not today, but still. Finally an accurate explanation is longer and has less punch.
It's safer to "join the headless chicken" route, consider SHA-1 officially broken, and start thinking of alternatives.
IIRC, collision attacks may be considered precursors to preimage attacks. Some preimage attacks look like "brute force" monte carlo simulations of collision attacks, running enough collision attacks in parallel until a collision attack results in the preimage.
I'd rather join the headless chickens, actually. True, it took a year's worth of computation with incredibly stock hardware now. But that's only going to get cheaper.
By the way, nation-states won't use GPUs. They'll use ASIC. They tend to be 4-6 orders of magnitude more energy efficient than GPU for this kind of things (at least they are for Bitcoin mining). I just hope nobody succeeded in the business of selling MD5 colliders —it would mean the same could work with SHA-1.
The GPU computations didn't take long for Google, it was the CPU computation that took a long time with 110 years GPU vs 6500 years CPU time. I didn't read into the technical detail but given it wasn't done with GPU I'd guess it wouldn't be easily done with or improved with ASIC either in this particular case.
Hmm, we'll need to wait for the source code and a detailed analysis of the exploit, but 6500 years of CPU time suggest a high degree of parallelism right of the bat.
Depending on the demands of the algorithm, an ASIC could outmatch an x86 farm —possibly by even more orders of magnitude than they do GPUs.
Possible hurdles for the ASIC are memory hardness, (memory costs the same no matter the architecture), branching, and complex operations such as multiplications. They could destroy any advantage the ASIC have.
Can anyone answer why exactly this makes Git vulnerable?
I was under the impression the use of SHA1 was only for hashing and not for a security signature. And what would be the benefit of intentionally causing a collision? Would it cause some sort of DoS like someone else here mentioned?
> I was under the impression the use of SHA1 was only for hashing and not for a security signature.
This is mostly correct. The commands git commit -S and git tag -s both sign commits and tags, respectively, using GPG. The signature covers only the commit object: the tree's SHA1, the commit/author data, and the commit message, not the entirety of the data.
git's objects' SHA1 is computed, for "file" objects, as "blob " + ascii decimal size of blob + nul + data in the blob. The two PDFs in Shattered are the same size, and thus, have the same git object header. However, naïvely prefixing the two shattered PDFs with their git header results in different hashes; I presume this is b/c the internal state of SHA1 differs from what the constructed data that causes the collision expects. You can see this yourself:
If you commit the two PDFs to git, those are the hashes they will have. They are different. Now, if you take the header into account when computing the collisions, you can create a collision. The paper seems to say that the attack takes a known (and controllable) prefix P, and finds two sets of two 512-bit (64 byte) blocks (different for the two files), M_1^(1) and M_2^(1) for the first file, and M_1^(2) and M_2^(2) for the second file, that cause the internal state of the hash to collide; after than, any (also controllable) suffix S (but the same for both files) can be appended.
This is why the diff[1] is as long as it is: each side of the diff is 128 bytes; the two M blocks.
Thus, if you computed a prefix P that started with the git blob header, then some data for your file, and ran this attack, you should be able to create two files that, when committed to git, collide. The two pieces of data in the header don't cause any trouble: the paper's method allows you to control the output size, mostly, so we can mostly choose any size we want; the type is always "blob" and is thus effectively constant.
Now, from what I can gather from the paper, there doesn't seem to be real control over the portion that differs; that's really what we need to take advantage of this beyond just creating collisions. This is a collision, but it's not what's called a chosen-prefix collision. A chosen-prefix collision lets me choose two different prefixes (which gives the attacker much more control over the differences in content, thus it becomes much easier to craft a "good" and a "bad" version); this attack requires both files to have the same prefix.
Now, here is the worst way I can think of as to how I could get a colliding object to you:
Imagine that I can convince someone you trust to sign a commit; this commit contains, either directly or indirectly via an ancestor commit, an object whose hash we will collide.
Now, if I can later get you to download that commit and all its parents, except I substitute one object's data for another's. The signature is still good: I've not changed the commit object in any way; it references objects by SHA1 hash, and the hash hasn't changed, "only" the data.
Here's another scenario, and you don't need signatures in this scheme; if I push a commit to master w/ the "good" version of the object, but before you pull it, I push to you a branch that contains the bad object, then git writes the bad object to your objects folder, under the colliding hash. You now pull master, but git doesn't pull the "good" version of the object, b/c it already has an object with that hash. Your master is now different; I've effectively poisoned your repo with the bad object.
Now, whether you can pull off this stunt or not, IDK. My point is that a) git's signatures don't cover the entirety of the repository, only a now-very-weak cryptographic hash, and b) git is (I believe) subject to object collision from this. But presently I'm not seeing how it can be maliciously taken advantage of. But then, people are really clever.
It probably wouldn't be too hard to create a fork of Git that replaces SHA-1 with SHA-256 (or Blake2, if that's your thing).
At the same time, you could remove the hash prefixes (blobs are prefixed with "blob") so that the hashes would be be identical to those generated by other software.
I'm sorry, but this mail is totally stupid.
It assumes it takes one year to complete the exploit.
I don't think there is anything preventing from doing such an exploit in 1 month, or even less.
With the various CaaS providers the total cost even remains the same!
The cost model also assumes non-criminal behavior. But for most cases where somebody would be motivated to find collisions for breaking a cryptographic scheme, they're unlikely to be restrained by the normal boundaries of the law.
IOW, for a criminal organization (or just a single criminal) with a huge botnet the direct cost might be closer to $0. Of course there's opportunity cost--botnets are often rented like a cloud service--but that's the case for everything.
Also, this is why I avoid bikeshedding technology like Argon2. If your password database is stolen, a 10,000 or 100,000 strong botnet isn't going to have much trouble cracking a substantial fraction of user passwords in that database. Memory-hard algorithms like scrypt and Argon2 are designed to thwart specialized FPGA- and ASIC-based solutions. But cloud services and botnets use general purpose hardware. While specialized hardware will always be more efficient, the scale you can achieve with general purpose hardware is mind boggling.
If people spent half the effort they spend bikeshedding password authentication and instead work to support hardware security tokens--both client- end server-side (i.e. with an HSM hashing client passwords using a secret key), we'd all be in a better place.
I remember when I first heard of git I wondered why it didn't use a member of the SHA-2 family (which had been out for several years by then). Even in 2005 I think that it was fast enough.
Or, if you had sufficient budget, not even completely unreasonable for a nation-state that presumably could use a very large cluster for other purposes, generate a collision in an hour or two. That would be an interesting exercise - how much hardware/kwH would it take to generate a SHA-1 collision in 60 minutes.
The numbers are available: 65,000 CPU years & 110 GPU years.
Assuming that the massive numbers of cores necessary don't cause any extra difficulties (seems like a stretch), we get 68,328,000,000 CPU cores to do the CPU portion in 30 minutes, and 115,632,000 GPUs to do the GPU part in 30 minutes.
Assuming 1U servers each with 2 64-core CPUs, that's at least 533,812,500 physical servers, which is at least 12,709,822 42U server racks. That would take up ~85,000,000 sqft, or 3.04 square miles.
That's just for the CPUs. You could probably fit 1GPU per server, but I assume this isn't the typical config for cloud GPUs. In any case, it could probably be done in, let's call it 4 square miles of server floor space.
Note: these are pretty conservative estimates at every step. It doesn't take into account any practicalities, like having server aisles wide enough for golf carts (because, really, 3 miles on a side is a bit of a hike on foot). Or how power is delivered. Or how to communicate all of that parallel work. Or the colossal amount of maintenance staff that would be required. Or keeping that many servers cool enough to work continuously... etc.
After you figure all that out, now you can get to the small task of figuring out how to continuously deliver 120-ish[1] gigawatts of power to those servers.
[1] 90 Watts per 64-core CPU (2 of those per server), and 180 Watts per GPU (lifted from fryguy's sibling comment). Turns out the CPU part is a lot more expensive in just about every way than the GPU part. Accordingly, you would probably adjust this so that you spend 55 minutes on the CPU and 5 on the GPU portion, or something like that.
[Edit: added the footnote, and refined my estimate of the power requirements, also clarity edits]
I get different numbers. First, it was 6,500 CPU core-years comparable to a relatively recent Xeon that uses about 10.5 watts per core. So, 57 million cores, 600 megawatts, and about 11,000 racks (if you can fit 128 * 42 equivalent cores per rack). Wholesale electricity is about $50/MWh, so $30,000 for the CPU part.
The numbers they give for the GPU part are more confusing, and despite the time difference, they say it was the more expensive phase of the attack. However, it appears to involve considerably less physical hardware.
Still pretty impractical for what's probably limited value, but not out of reach.
Eh, what's an order of magnitude here or there or at the beginning of your chain of huge multipliers? heh.. Yeah, no wonder the result was so enormous.
Well if you use the Bitcoin network as a metric, there's roughly 3 billion GH/s (which is really two chained SHA1 in hardware), and realtimebitcoin.info claims this is ~2000 MW. If you compare that to the 9 billion GH that the shattered article claims are needed, then that indicates it would take a network equivalent in size to the Bitcoin network ~3 seconds and ~1'600 kWh. There's no indication how "lucky" a 9 billion GH collision is, so perhaps it would be longer or shorter based on the statistics.
Looking at it from the other direction, they claim 110 GPU-years. A GeForce GTX 1080 is claimed to be 180 W. That's 175'000 kWh. If you assume that dedicated hardware ASICs are 100x more power efficient than the card I claimed, that has at least a similar order of magnitude. To do it in an hour would take a million graphics cards, and ~200 MW.
I guess the submission's title was changed to the email's subject, rather than the title of Mr. Gutmann's one-line summary (which is how I submitted it):
"Reports of SHA-1's demise are considerably exaggerated"
OK, we've switched it back to that from “SHA1 collisions make Git vulnerable to attacks by third-parties”. We detached this subthread from https://news.ycombinator.com/item?id=13725850 and marked it off-topic.
The title is incorrect. HN does have a rule that says click-bait or clearly incorrect titles can be modified to be more representative of the actual article.
And here we have an example of this very rule kicking in, though in the wrong direction: Article comes in. <h1> doesn't match submitted title. Correct title. Oops.
I assume either a bot or a tired moderator was involved.
That's kind of rude to label a reasonable opinion as "clearly incorrect".
And the title given to the specific email by its author is definitely more representative of its comments than the title of the first email in the chain.
As far as click-bait, I rate both of them equally click-baity.
There was little reason to change the submission title.
Well, to be fair, I labeled the title "Incorrect" - (though I'd agree with someone if they went and suggested, "Clearly", mostly because the content doesn't even mention Git...
It's a bit hard to take the author seriously when he complains about "headless chickens" "considerably exaggerat[ing]" then goes on to say that you need "a nation-state's worth of resources" to find collisions. If anything this shattered proof of concept showed that it was actually a lot easier than that, giving an estimate of around 110k$ IIRC.
I'm also sure that SHA-1 remains pervasive in many codebases, although as long as pre-image are impractical it might be hard to exploit those vulnerabilities.