This is a great essay on why you should never use a hash function for message authentication.
Except not for the reason the author thinks.
There are several problems here.
First, with SHA-1 for example, you have 64 bytes per chunk. That means you basically get a free ride on this problem for anything < 64 bytes. A lot of "application state" fits pretty well in 64 bytes.
Secondly, unless a message ends right on the 64 byte boundary, it is not nearly that simple. You have a bit of a problem, because the hash is padded, and when you add extra characters to your original string, that padding gets replaced with those values. So, it's no longer simple to just "keep going" from where you stopped.
Still, you can see how that leaves a distinct subset of cases where you'd be exposed. SHA-1, along with most secure hash functions, appends the length of the message the end of the source text before performing the hash function. That means that if you add even one byte to the string, you have now changed the last 8 bytes that were fed in to the "original" hash function. Oh, and your extra byte goes in before those bytes, so not only did you change those 8 bytes, but you shifted them down a byte.
So, no, it isn't nearly that easy to crack a SHA-1 based authentication, and yet, it is easy enough that you should totally NOT use them for authentication and instead use HMAC ; they are vulnerable to extension attacks, it's just not nearly as easy as this article suggests, and conclusions one might draw from this article (like you can solve this problem by feeding all source text in to the hash algorithm backwards) are likely ill founded.
It just turns out that cryptography is way more complicated, and even in terms of understanding weaknesses that arise from doing things wrong, you are going to get it wrong. Trust the experts, when they say it is a bad idea, but don't assume why it is a bad idea can be explained in a short blog article like this.
UPDATED: Added an explanation as to why it might be dangerous to just take this article at its word.
In practice, attackers just guess the bit length of the message by writing the Ruby script that generates the 100 candidate messages at different expected bit lengths and feeding them to their fuzzer. The bit length issue is not a real one in practice.
In practice, length-extendable SHA-1 MAC functions are mechanically exploitable; they'll be discovered quickly by competent attackers who don't even have access to your source code, because attempts to break them require very little code and even less time.
It is a very, very bad idea to put off fixing a problem like this because you've been led to believe by someone on a message board that MD padding is going to be an obstacle to attackers. It isn't. The Flickr MAC break required MD padding, too; Thai and Juliano broke it just the same.
There seems to be some kind of impedance. I'm not suggesting you can avoid using an HMAC and get away with just a hash.
Let's be clear about what an HMAC is. It's pretty much what this article describes, but executed with a specific protocol that avoids exploitable weaknesses in naive approaches.
It's that similar. The biggest non-security consequence is that you have to call your hash function twice. If that is a problem, you have bigger problems. If you are using a secure hash to authenticate a message, you should always use HMAC.
> In practice, attackers just guess the bit length of the message by writing the Ruby script that generates the 100 candidate messages at different expected bit lengths and feeding them to their fuzzer. The bit length issue is not a real one in practice.
You're closer to addressing the real complexity involved in cracking it than the original article. I agree that in principle it isn't hard to break a naive hash-based digest authentication. I'm merely pointing out that it isn't simple.
Also... not sure why they'd have to use Ruby or why that is relevant...
> It is a very, very bad idea to put off fixing a problem like this because you've been led to believe by someone on a message board that MD padding is going to be an obstacle to attackers.
No. It's a very, very bad idea to ever go down this road in the first place. Start with a MAC. If you think you can be more clever than the collective expertise of the field, study cryptography and prove it, but until you've finished studying it, do what they say. History is littered with people who thought they knew just enough about cryptography that they could do things their own way without consequences (WEP anyone?).
In general I'm all for encouraging amateurs to compete with the pros, but cryptography is definitely one of those fields where it can appear deceptively simple to achieve your goals, and it totally, totally, totally is not.
It's fine that you want to tell the thread what HMAC was, or point out that the exploit for this problem isn't literally 100 words of blog post long.
But the reality is that this is an exploit that everyone on our team, and I assume all our real competitors teams, are expected to be able to bust out on demand. It's not hard. It comes up a couple times a year in web apps, we spot it, and we exploit it. It's really not one of the more interesting or involved crypto flaws to exploit.
You wrote:
Still, you can see how that leaves a distinct subset of cases where you'd be exposed. SHA-1, along with most secure hash functions, appends the length of the message the end of the source text before performing the hash function. That means that if you add even one byte to the string, you have now changed the last 8 bytes that were fed in to the "original" hash function. Oh, and your extra byte goes in before those bytes, so not only did you change those 8 bytes, but you shifted them down a byte.
Well, respectfully, no shit. Was it your understanding that the Flickr team screwed their MAC up so badly that you didn't even have to guess glue padding to pull the attack off? You wrote this paragraph without context; you even followed it with a sentence about how you might be safer if you could fit your application state in a SHA1 message block because I forget why that could possibly matter?
Next time, when you want to point out how deceptive the apparent simplicity of a crypto concept is, choose one of the defender's problems. The attacker's problems are mostly deceptive in the opposite direction.
I think I was pretty clear that extension attacks work. I was clear that they work despite features of a hash algorithm that would appear to address the issues highlighted in the article. I was also very clear that it is the defender's problems that are deceptively hard.
In a couple of cases I probably should have said "simple" instead of "easy". "Easy" implies more about effort than complexity. At some points I meant "little effort" and other points I meant "little complexity", but I used the same word for both, so that's bad.
Still, I don't get how you inferred the above from this, which to me reads as "any efforts you might make to address this problem your own way will almost certainly fail miserably":
It's just not nearly as easy as this article suggests, and conclusions one might draw from this article (like you can solve this problem by feeding all source text in to the hash algorithm backwards) are likely ill founded.
"Never use a hash function for message authentication" is such a simplistic view. The author takes a common hash function design (Merkle-Damgard), and somehow extrapolates that hash functions should be simply ruled out for authentication.
First, this may send the wrong message to the less-focused reader: "what, I should use block ciphers instead?". Luckily, HMAC is eventually brought up, which is a fine solution.
HMAC requires 2 calls to H, our favorite hash function. Certain applications may find the overhead to be prohibitively high. With non-broken hash functions (e.g., any of the SHA-3 finalists), we can use the so-called envelope authenticator: A = H(K||M||K), with some padding to separate K from M to keep security proofs happy. This is significantly faster for short messages, and short is the most common size out there.
> Certain applications may find the overhead to be prohibitively high. With non-broken hash functions (e.g., any of the SHA-3 finalists), we can use the so-called envelope authenticator: A = H(K||M||K), with some padding to separate K from M to keep security proofs happy.
I think it is safe to say that you are suggesting a very narrow context. If invoking a hash twice is prohibitive and you are looking at choosing amongst SHA-3 finalists as an alternative (particularly given that many of them are slower than SHA-2, and I believe they are all slower than SHA-1 on low cost CPU's where you might expect this constraint), you are already talking about a very narrow performance window (literally one iteration of Moore's Law covers the gap). You then throw in that people are making refined selection of hash algorithms taking in to account various security nuances, and I think you are dealing not only with a narrow context, but a degree of cryptographic expertise that you don't have to worry that anyone solving it would be consulting this blog or Hacker News in general. ;-)
First, this may send the wrong message to the less-focused reader: "what, I should use block ciphers instead?". Luckily, HMAC is eventually brought up, which is a fine solution.
If the reader can't be bothered to read the article to the end, I hardly think it reflects on the author. Whilst it might indeed be a more concise article if it just said "don't use a hash function for message authentication, use HMAC", it would still miss the important final point about timing attacks, not to mention the journey of explanation about why you shouldn't just use a hash function.
Your point about padding and length bytes is spot-on. I actually left this out of the explanation, although it's interesting, because I felt it a useless diversion: something that's safe except in an easily describable subset of cases is still not fit for purpose when specially designed tools without these problems exist.
Point being, although hash functions might work most of the time, their general construction does not make them safe for this purpose. You can't say something is 'mostly secure' because it works 'most of the time'. The times is doesn't work will affect someone, and for them the system is not secure at all.
Plus, implementations with weird edge cases make for horrible debugging, especially when it comes to security.
> Your point about padding and length bytes is spot-on. I actually left this out of the explanation, although it's interesting, because I felt it a useless diversion: something that's safe except in an easily describable subset of cases is still not fit for purpose when specially designed tools without these problems exist.
I'm with you except for: "a useless diversion".
From the description in your article, I might think that if I have really short messages, I'm okay. I might think that if I reversed my strings, so any additional state is prepended instead of appended, I'd be okay. I might think that if I truncate my hashes (say, used the first 256 bits of SHA-512), I'd be okay. I'd be really, really wrong.
> Point being, although hash functions might work most of the time, their general construction does not make them safe for this purpose. You can't say something is 'mostly secure' because it works 'most of the time'. The times is doesn't work will affect someone, and for them the system is not secure at all.
That's the wrong point though. It's not that hash functions "might work most of the time". Uncompromised secure hash functions work all of the time for their intended purpose. For this other purpose, they work exactly none of the time.
The aspects I described that might make you think SHA-1 actually has this covered except for a few corner cases are trivial challenges for an attacker to overcome, as are all of the above "remedies" I mentioned (all of which resemble idiot moves that have in fact been made by naive developers who thought they could be clever and save themselves... I don't know.. maybe some CPU time?).
I do think the post was some great writing that illustrated an idea quite clearly. I think it could be quite informative for a lot of people. I just think what you wrote would benefit from being framed appropriately. I'd suggest a preface... something like, "HMAC's were invented to address the fact that you can't just use a hash for authentication. Somehow people miss this point, possibly because HMAC's seem like a trivial layer on top of a secure hash. In practice, there are a surprising variety of issues. To give you an idea of the kinds of problems you might run in to, and hopefully a sense that you can't just naively tweak your use of a hash to address them, I'm going to simplify a common problem with using secure hashes for authentication that has probably never occurred to you. I assure you that this problem is actually more complex, and there are many more problems."
I e-mailed visa about something similar with their new upcoming V.me service. They suggest that you use md5 to generate the tag which is known to be weaker than SHA-1. I was a little surprised that a company like Visa would mess up on crypto and not know to use HMAC instead of just a simple hash. Never heard a response from them either.
Here's what their documentation says:
Language Standard Syntax for Generating MD5 Hash
Java import org.apache.commons.codec.digest.*;
hash = DigestUtils.md5Hex(string1+string2+string3...);
PHP $hash = md5($string1.$string2.$string3...);
Ruby require 'digest/md5'
hash = Digest::MD5.hexdigest(string1+string2+string3...)
Python import md5
hash = md5.new(string1+string2+string3...)
> Finally, you should make sure your application does not exit early if the tag is invalid. You should do all the data processing you would normally do, just short of modifying the database, and check the tag last. If you return early you risk another timing attack.
What kind of timing attack is that? In order for there to be a timing attack, there has to be a difference in the timings.
1. You can either process the data, check the authentication code, then commit.
2. Or you can check the authentication code, process the data, then commit.
I don't see any attacks on #2 that couldn't also work on #1.
The timing attack is if you check the tag, that fails, and then you don't do any further request processing. This shortens the request time. It depends quite a lot on what you're actually doing with the message, but in general you want to leak as little info as possible about what's happening during any crypto-related process.
If your authentication and processing steps are distinct and independent, then you're not doing any good by not returning early (immediately after your authentication process). The only thing that the attacker can learn from the timing of the response is whether authentication was successful or not, what any useful API should convey anyhow.
The only good thing that "authenticating last" does is that it prevents the attacker from issuing lots of requests sequentially, thus brute-forcing your authentication, but this should be solved in another way, without slowing down legitimate users and overloading your servers.
But it only tells them that it failed, not that they got closer, like with the string-comparison based one. What does that gain an attacker if you already provide other feedback about the request being invalid? And not doing so would result in a bad user experience if you have users run into a bug somewhere and it fails silently so they don't know that nothing was actually done.
In some cases, not letting them know they've failed is nice. But the most common case to look out for is, if your auth/crypto process involves multiple steps, don't return early from it, or do anything that alters its runtime significantly. This leaks useful information in many situations. The course at http://crypto-class.org is better than me at explaining this stuff, and starts on Monday.
The leak that you're ostensibly timing is that in order to figure out how much of the candidate MAC string was valid, the target had to compare more byte, which takes more time, which adds observable lag to the error response.
Success: check authentication, process data, commit, return 2xx status.
Failure: check authentication, return 4xx status.
The reason that you don't get any information from the difference in timing between the two branches is because you already get that information from the status code.
Now on the other hand, if you use naive string comparison, different failure branches will take different amounts of time. That is a security hole, and it's not what we're talking about here.
I've heard this said many times before and I agree. However, using crypto libraries does not solve the problem of vulnerabilities through cryptography misuse. For example, keys must be stored correctly, algorithms often need initialising in the correct mode for your specific application, IVs must not be repeated, and, as shown in this article, hashes should be used in specific ways to work correctly.
There are many ways to fail with cryptography and avoiding them all takes considerable expertise. Using crypto libraries does not solve this problem.
This is the entire point of high-level crypto libraries, like Guttman's libcrypt and Google's Keyczar. So, yeah, don't use OpenSSL or javax::crypto or whatever .NET calls it; but, do consider using something like Keyczar, or, better yet, just use PGP/GPG to store data at rest, TLS for data in motion, and be done with it.
Those still require key management. There is no way a developer can abdicate all responsibility for this stuff, no matter how high level (at least, not until we have good, common, trusted security as a service).
High-level libraries are definitely the way to go for a variety of reasons, but they don't replace understanding. Developers still need to understand what exactly is and is not guaranteed by cryptography involved, because that's the stuff protocols are build on. The root article is a great illustration of this.
This is classic developerthink, and it's a good thing, but it doesn't serve you very well with crypto. The problem with crypto is that a partial understanding of the problems is actually worse than no understanding. You can be worse off learning crypto material than you were before you learned it.
If you're interested in picking up crypto knowledge, my advice is to do so in the context of breaking systems, not building them. I spend a lot of time doing crypto stuff, and I don't feel qualified to build them. But needing to figure out how to break all the random systems that end up on my desk has taught me a lot about crypto.
Agreed- you can take an intro course in crypto and still find many ways of getting it wrong. My intro to crypto professor reiterated this fact almost ever class and noted that even seemingly small changes to a protocol can render it vulnerable to attack.
The title is sort of linkbait, as in fact what it should be is "Never use hash functions vulnerable to extension attacks"... (And most common ones are) With that said, this stuff is pretty cool and after reading that the author learned all this in the Coursera Cryptography class I decided to sign up for it. (Starts June 11th)
Although really, the maxim in the title is true. You should never use hash functions for authentication, you should use authentication codes.
As an analogy, you should never use a hammer to put in screws. That's not linkbait just because you have a tool that's a hammer on one end and a screwdriver on the other end.
I think the title is fairly accurate if you read it as "never use a hash when you need a MAC". MACs have nice provable security properties that you get for free, so you don't get surprises like extension attacks.
People that build crypto after reading Applied Cryptography are doing a fine job paying for my kids college education, so I agree with you and encourage everyone to do likewise.
If you don't happen to like my kids, well, first, screw you, and secondly: buy _Practical Cryptography_ or _Cryptography Engineering_ (really the same book) and burn your copy of "Applied".
There's nothing wrong with Applied Cryptography so long as you _understand_ it. If you blindly apply outdated algorithms, yes, you lose. Everyone should read both Applied Cryptography _and_ the other books you mentioned, and keep up on the literature besides.
No, there's a lot wrong with _Applied Cryptography_, and those things have very little to do with the fact that AC writes about IDEA and not AES.
If you read _Practical Cryptography_, you don't need to read _Applied Cryptography_. AC is a book full of trivia, and of encyclopedia-style descriptions of random block ciphers with minimal attention given to the actual real-world attacks on implementations of those ciphers.
I strongly advise that you not waste time reading AC. If you're lucky, you can read it and just lose time; if you're unlucky --- and a lot of my clients have been --- you can find yourself having learned stuff you'll later need to unlearn.
>> This fact means that an attacker can determine the first correct character of the tag by submitting requests to a signed URL with a different first character in the tag each time, and stopping when the request takes a little longer than usual. After guessing the first character they can move onto the second, and so on until they’ve guessed the whole correct tag.
Wouldn't this attack be eliminated by using iptables rate limiting to reduce the attack window of opportunity?
Wouldn't this attack be better eliminated by fixing the timing leak that is potentially allowing people to guess valid MACs on packets?
The reality is that you probably can dick around with things in your deployment and your app to make timing attacks prohibitively expensive/annoying; if you understand that you're not eliminating the timing leak, but rather masking it, you can take advantage of the additional measurements required to unmask the leak to give your MAC enough of a buffer to last for its whole useful lifetime.
But when you do this, you're really playing on the razor's edge of what we currently know about side channel attacks on crypto, and you're probably going to end up putting more effort into your workaround than you would in just fixing the underlying bug.
The idea of trying to detect people employing timing attacks on your cryptography and block them individually by IP address is so obviously retarded that the comment I'm replying to is indistinguishable from trolling.
I'm not an IT guy, so no, I wasn't trolling. Why exactly is it "retarded" to build your system to reject (or at least flag) access patterns that are unlikely to be due to legitimate activity?
I'd recommend using the word "retarded" with a bit more circumspection. Obviously the incoming IP address doesn't uniquely identify a client who's likely to be on the other side of a NAT gateway. But the idea that a system should just sit there silently and carry on business as usual while any one address or class-C block generates large numbers of failed access attempts seems like a good application of the word in question.
You're not an IT guy, but you are a programmer, and you know that leaving a vulnerability in your code, hoping the devops team catches attempts to exploit it, is a fucking retarded idea. I think you're just trolling.
Who said anything about leaving a vulnerability in the code? If your security model depends on a suboptimal implementation of strcmp(), you have bigger problems than timing attacks.
I'm talking about iptables rate limiting (on Linux and I assume other OS's firewalls can implement this). Fixing bugs in isolated code is part of the scope we face but preventing the business and its customers from suffering due to a bug is also part of we get paid for. If you are still in school or work in the scientific area then perhaps you have not come across rate limiting?
Fixing the leak today is admirable and difficult but challenging as the many comments have shown. The problem is that at any time regressions can and will happen. There was a problem on Ubuntu with SSH a couple of years ago, SSH had been fine then someone made a change (I don't recall the details) that went unnoticed for I think it was 2 or 3 years. That change made SSH vulnerable to I believe it was timing attacks and it could have been easily prevented. This was in SSH. SSH.
No, the change you're thinking of is when Debian's maintainers managed to comment out the "secure" part of it's cryptographic secure random number generator, thus ensuring that SSH would only generate keys from a trivially small range of possible values. That change had nothing to do with the fundamental difficulty of generating random numbers.
The fix for not leaking timing from your comparison is trivial. Either double-hash, or use a timing-independent comparison function like the accumulator XOR function upthread.
It does nobody any favors to spread drama and FUD over what is in fact a simple and easily fixable problem.
Possibly, although it depends on the lifetime of the link and the size of the tag (e.g. SHA-1 is 40 chars, each case requires mean of 8 guesses, so whole thing only requires 320 requests).
Better to have resistance as baked into your crypto as you can, rather than relying on a firewall further up the stack. If the data itself is resilient, no implementation will be able to defeat it efficiently.
For the string comparison, could you really use that in a timing attack. Wouldn't the difference between comparison taking one char longer be measured in nanoseconds, while the overall network lag would be milliseconds?
I don't see string comparisons on tiny strings as feasible from either the Crosby/Wallach (suggests you need hundreds of nanoseconds of difference in execution time to be detectable) paper nor the Brumley/Boneh paper (which talks about a specific vulnerability in modular exponentiation which ends up revealing key details, not string comparison).
This is just a guess on how you would do it... but it would involve making many many many requests until you could establish a mean against a given confidence level. Then you could compare that mean for each starting character (once again, many many many requests for each character). If the mean of any of them (or heck, even the distribution of the timings) differ than the others, then you know you you hit the right mark.
This assumes that 1: you can reliably measure at that small of a time difference and 2: you can submit enough requests without being detected.
Yes, I don't see that being detectable for things like web-based APIs. For these string lengths (eg., HMAC with SHA-256), the network lag would add sufficient randomness that you would not be able to measure any difference in timings in the string comparison.
It turns out that it is possible, over a WAN, on some frameworks. The rule of thumb is, if the comparison is effectively being done by libc's memcmp(), the timing attack is very difficult to do even across a single switch; however, some platforms don't drop to memcmp and instead compare each byte explicitly. These are timeable.
If you're wondering, "how do I detect nanosecond differences over a network when my measurement will be swamped by other things happening on the target, the network, and my host", the answers boil down to:
* You're going to move your measurement code as close to the drivers as possible, and fix interrupt handling so that interrupts don't confound your measurements.
* You're going to get yourself on the same hosting provider as your target; for instance, a good chunk of all target apps can be attacked via Amazon EC2 for not very much money.
* You're going to take lots and lots and lots and lots of measurements and then use high school statistics to process the results.
> The easiest way to defeat this attack is, instead of directly comparing two strings, compare their mappings under a collision-resistant hash function.
Is this really the best way to compare strings without giving away timing info?
No. It's A way to do it, but not the fastest or the simplest. An easier, faster way to do it is what Rails does (after Nate Lawson via Coda Hale told them how): accumulate the XORs of each offset of both strings and then verify that they add up to zero.
> accumulate the XORs of each offset of both strings and then verify that they add up to zero
If you do that in a language with an optimizing compiler or JIT runtime you'd better look at the generated code and/or do your own timing measurements. A sufficiently-smart optimizer could recognize that the first difference encountered allows for early loop termination.
What's wrong with simply doing a traditional compare, but not exiting early when it's found the two strings don't match?
I see how the xor method works, but not why it's superior.
Also, it would seem that you'd be leaking information about the size of the other string with either method (whether you exit early or not when you run past the length of one of the strings) - not a problem when comparing hashes presumably, but is it never a concern?
I don't think you'd be leaking that information if you hashed both strings and compared the hash, because it wouldn't stop getting faster when the shorter string gets shorter.
Compiler optimizations, CPU pipeline and branch prediction would be my guess, although I'm sure someone else can jump in with more detail. XOR is basically immune to optimized tricks.
Not the best, but most string comparison methods are designed to fail as soon as possible, which is not appropriate for this use case. Compare the entire input, or use some misdirection to destroy the relationship between correctness and time.
This is just one of the easier ways. It's effective and easy to code.
Stupid question: Whenever I've used GPG to sign an email, it includes a line saying "Hash: SHA1". Does this imply PGP-signed messages are vulnerable to this, or does PGP/GPG do something different/smarter?
"This sequence is then folded using a compression function h(). The details of the h() depend on the hashing function, but the only thing that concerns us here is that the compression function takes two message blocks and returns another block of the same size."
So if the chunks in a SHA-1 hash are 512 bits each then surely the output of the hash function would be 512 bits rather than the 160 bit digest?
Edit: the IV is 160 bits , so "another block of the same size" means each derivative block is the size of the IV not of the actual data.
You guess the length of the last block (or iterate over possible lengths with trials of the attack). When you know it, you can easily predict the MD padding at the end of the hash to fill in the block. That fake padding (our code calls it "glue padding") actually ends up in the forged message; for instance, if you're signing URLs, you'll see it as gibberish in the middle of the URL. In practice, most code does not care about the gibberish "glue" bytes.
Is the timing attack hardening suggested in the blog post a standard approach?
If I was trying to attack a system and knew loosely that they did what he suggested (hashing then comparing vs comparing with timing exposed) , my untrained instinct would be that this is the weakest part. In other words I think this just makes the timing attack a little more difficult, but still possible, by producing specific hashes that carry out the timing attack.
When I've needed to harden comparisons against timing attacks, I've always just used constant time comparison functions, such as these ->
Also, HMAC is just one MAC (message authentication code). OMAC and is another good one; it has the interesting property of being built on top of a block cipher instead of a hash function, which can reduce the number of "moving parts" in a system if you're using a block cipher of some sort anyway.
I was wondering about that timing attack. Is that really possible? How many requests would you have to make until you can get reliable statistics over the timing of a string comparison, when you have network delays, other requests and all kinds of stuff that influence timing?
It's very difficult, but possible. It's a plausible enough threat, especially if you're cloud hosted now or ever might be, that you should take steps to avoid it.
Jitter and confounding are problems that can be addressed simply by repeated measurements.
The rule-of-thumb from Crosby & Wallach's paper on remote timing attacks is, assume tens-to-hundreds of nanoseconds precision if you can colocate the attacker at the same provider, and tens of microseconds if you have to do it over the Internet.
The high-level answer is that construction is vulnerable to practical attacks based on hash function weaknesses, but that it's significantly harder to attack than secret-prefix MAC functions. But once you understand how secret-prefix MACs are broken, you quickly see the logic behind HMAC.
For example, if you already happen to know a collision hash(m1)=hash(m2), where m1 and m2 have full block size, then you also get a collision hash(m1|key)=hash(m2|key), just as explained in the article. So, one could forge messages, which should clearly be counted as a weakness, even if it assumes knowledge of a collision.
Well, if you have an internal collision hash(m1)=hash(m2) and both messages m1 and m2 are of the same size, then it seems that one would also get hash(m1|key|size) = hash(m2|key|size). So, I cannot really see how appending the size will help.
(All subject to optimistic assumptions about block sizes, etc.)
Please substantiate. An attacker knowing an internal collision of the hash algorithm for m1 and m2 (of the same size...) can construct HMAC(m2,key) from HMAC(m1,key) without knowing the key?
Relying on the way you happen to combine data, instead of using a function that's designed for authentication and has baked-in a safe way to combine the inputs, is a bad idea. "What if $EDGE_CASE_OF_INAPPROPRIATE_CRYPTO_FUNCTION" is never a good question to ask. Just use the right tools in the first place.
The first reason is because the questioner is looking for an excuse to use something different. In this case, your answer is the right answer.
The second reason is because the questioner wants to learn more about how cryptography works, for educational value. In this case, your answer is not helpful.
This seems to me like a variant of "do not trust the client." Good info though. I have learned a lot more about how hash algorythms work. I do wonder though if fixed-param hashes are relatively safe due to the inability to add suffixes.
"Never trust the client" is, like "always validate input", one of those timeless bits of security strategy that is worth pretty much nothing in the real world. It's about as useful as "pretty much you should always make sure you're secure, pretty much."
I disagree with you here. One should always be sure input is validated somewhere before anything important is done with it. And not trusting the client is one element of what I call a "push security back" strategy (that strategy is, basically, don't do any security enforcement in your application you can't make a component further back do just as well. The reasoning here is that components like operating systems and web apps will always have more review and more eyes than your web app, so if they can work without trusting your app then all the better. Work from least privilege and design so that components further back can do things like authentication, authorization, and the like, to the extent this is practical. This creates a narrower security perimeter, and greater depth in defence. Of course, as in all things, security is a matter of perpetual tradeoffs.
Except not for the reason the author thinks.
There are several problems here.
First, with SHA-1 for example, you have 64 bytes per chunk. That means you basically get a free ride on this problem for anything < 64 bytes. A lot of "application state" fits pretty well in 64 bytes.
Secondly, unless a message ends right on the 64 byte boundary, it is not nearly that simple. You have a bit of a problem, because the hash is padded, and when you add extra characters to your original string, that padding gets replaced with those values. So, it's no longer simple to just "keep going" from where you stopped.
Still, you can see how that leaves a distinct subset of cases where you'd be exposed. SHA-1, along with most secure hash functions, appends the length of the message the end of the source text before performing the hash function. That means that if you add even one byte to the string, you have now changed the last 8 bytes that were fed in to the "original" hash function. Oh, and your extra byte goes in before those bytes, so not only did you change those 8 bytes, but you shifted them down a byte.
So, no, it isn't nearly that easy to crack a SHA-1 based authentication, and yet, it is easy enough that you should totally NOT use them for authentication and instead use HMAC ; they are vulnerable to extension attacks, it's just not nearly as easy as this article suggests, and conclusions one might draw from this article (like you can solve this problem by feeding all source text in to the hash algorithm backwards) are likely ill founded.
It just turns out that cryptography is way more complicated, and even in terms of understanding weaknesses that arise from doing things wrong, you are going to get it wrong. Trust the experts, when they say it is a bad idea, but don't assume why it is a bad idea can be explained in a short blog article like this.
UPDATED: Added an explanation as to why it might be dangerous to just take this article at its word.