Getting confused between hash functions and password-based key derivation functions, on the other hand, is absolutely inexcusable; that very confusion is why there are so many people making the mistake of using SHA256(salt . password) as a key derivation function. People hear that MD5 is broken but SHA256 isn't; not realizing that MD5's breakage as a hash function does not impact its security as a PBKDF, and not realizing that MD5 and SHA256, being not designed as PBKDFs are both entirely inappropriate for use in that context.
Also, for reference: MD5(12 printable ASCII characters) ~ bcrypt(8 printable ASCII characters) ~ scrypt(6 printable ASCII characters) in terms of cost-to-crack on sophisticated (ASIC/FPGA/GPU) hardware.
The problem here isn't that cryptographers are inflicting a terrible name at you, it's that security people have repurposed a function intended for one purpose (generating crypto keys) for another (storing passwords).
Also: don't blame us, blame the PKCS standards group.
The MD5 vulnerabilities that this post talked about don't change the security of PBKDF-MD5 (or any iterated MD5 password hash), because password hashes aren't used to authenticate data. Similarly (this is a little mindbending): HMAC-MD5 has no known viable attacks, so many crypto protocols that use MD5 don't have viable attacks.
Because they all seem like acronyms, people get MD5 and SHA256 (which are core hash functions, "primitives" in a cryptosystem) with PBKDF, which is a standard for turning passwords into crypto keys.
(I know you know both, I'm just trying to restate).
They don't all work this way but this is one of those disagreements where you're more likely to be right than me.
The best reference for this analysis is the scrypt paper: http://www.tarsnap.com/scrypt/scrypt.pdf
For something designed to defeat specialized hardware, see scrypt.
 Just noticed who I was replying to. Now that's funny.
That's why I recommend PBKDF2 with the PRF set to the authentication tag from FEAL4 in GCM mode. Nobody's got that in a GPU! Arrrrr!
> "Hashes are designed to be tamper-proof". Wrong. Cryptographically secure hash functions are.
> "Hashes, when used for security, need to be slow." Wrong. Password hashes needs this; not SHA1/MD5 etc.
I agree there are certainly other uses for hashes, just trying to distinguish between hashes and checksums.
Re-reading what I wrote, I open with "Hashes are a bit like fingerprints for data. A given hash uniquely represents a file, or any arbitrary collection of data" so the context of this article is hash functions that are able to uniquely identify something in a reliable, trustworthy way -- either checksums (fast, no need for security) or hashing (slower, more reliable, possibly "secure" for some definition of secure), but always in the context of "can I trust this value to tell me that the data is really what I think it is?"
Of course there is a tradeoff with speed; a person's name can be good enough identifier (checksum) in some circumstances, but maybe other circumstances require more reliability like DNA or fingerprints (secure-ish hash), at the cost of being far slower to collect and validate and way more onerous.
A hash (function) is any function that maps a big variable-length data structure to a smaller (fixed-length) data structure.
A checksum is a special hash (function) which has the purpose of detecting accidental errors during transmission or storage. (This makes them different from hash functions used for example in hash tables. For example relevant question: minimum how many badly transmitted bits are needed to break the error detection in the case of a specific checksum?)
A cryptographic hash function is another kind of special hash function which also has a very good definition on wikipedia.
(Basically these are the definitions on Wikipedia, and I think they are fine. If the terminology you use would be canonical we would use the name 'checksum table' instead of 'hash table', and it would be quite illogical also, because the term 'checksum' really includes purpose in its name 'checking something').
Why do they need to be fast? They're used everywhere. Opening up notepad.exe, you'll verify a handful of digital signatures, each of which will compute a hash of the binary as first order of business. HMAC is used in virtually every secure network protocol worth using --- it uses an underlying hash as building block. Need a fast (determistic) pseudorandom generator? Hashes it is.
Password hashing (and key derivation) is about the single use of hashes that requires computational hardness. It should be treated separately.
I do understand that password related hashes may have arbitrary computational delays added just to make them tougher to brute force, which would be bad in many other (all other?) situations?
Is that right?
Probably the best way to phrase it is that non-cryptographic hash functions cover a continuum with checksums at one extreme and cryptographic hash functions at the opposite extreme.
They go the other way round and set a "security parameter" that should hold for a particular instance (all the parameters fixed) of their algorithm. That "security parameter" determines how hard it should be for an arbitrary adversary to break the scheme by brute force, and in addition the scheme is only thought to be secure if there is no polynomial time algorithm that can break it in time significantly less than the brute force algorithm would take.
Given such a security parameter, the art is then typically to design the fastest primitive that upholds this same security margin. For example, if two (unbroken) algorithms would take O(2^80) for a brute force Birthday Attack, then you'd conceive the faster of the two as the "better" algorithm. ("Fast" of course can be subjective, e.g. fast in HW or fast in SW etc.)
For one thing, that's simply just because making an algorithm slower is always possible (PBKDF2), but in practice there are a lot of applications (probably more) where a hash should be as fast as possible rather than as slow as possible, as in digital signatures for example.
The first written mention of artificially slowing down key derivation was (AFAIK) here: https://www.schneier.com/paper-low-entropy.html The rationale was that, given a key (resp. password) that gives you s bits of security, you apply some function that gives you extra t bits of security, by making bruteforcing it 2^t times more expensive.
Also note that some applications of hash functions don't care about collisions. MD5 is still OK-ish for e.g. HMAC, where what matters is second-preimage resistance.
Imagine someone trying to attack a BitTorrent tracker by injecting false data packets with "true" hashes. This is a case where you do want the properties of a hash, not a checksum (changing a single bit greatly changes the output; it's hard to figure out an input that will create a matching output) but you don't want the properties of a cryptographically-secure hash (slow to calculate.)
Cityhash, spookyhash, and murmurhash are three modern non-cryptographic hash functions which explicitly don't call themselves "checksums".
Indeed, we say "hashtable", not "checksumtable".
> the hashes used in SSL and the like are designed to be as fast as possible without sacrificing too much security.
See also: map and set hashes. There, you're looking for speed and good distribution across the buckets, a slower hash function is not something to look forward to.
I feel like the line between "this is a checksum" and "this is a secure hash" is kind of illusory, other than for pure performance reasons.
Sufficient. all hashes (cryptographically secure or not) will fail to detect some changes, that's a hash collision and that's an intrinsic and non-negotiable property of m:n mappings where len(m) > len(n). "Checksums" assume there is no attack being performed, as in nobody is trying to craft a collision. In "normal operations", even with pretty shitty hash functions (e.g. CRCs) hash collisions between subsequent file changes (or whatevere) are unlikely enough.
It really bothers me when people misuse "in theory" like that. "In theory" means "we have a model that makes good predictions in some circumstances, but there are cases where it may fall short". But a model in which hashes uniquely represent arbitrary collections of data is a model which allows for infinite compressibility of strings. That is not a theory that merely falls short in some edge cases; it is a theory which is hilariously and catastrophically wrong.
Nope. Say you have a hash function `h` which guarantees a unique, 128-bit output for any input. Then `h` is a function which compress any string into 128-bits:
If `y = h(x)` then it is trivial for me to write a program that will reconstruct `x` from `y`. I will simply iterate `x` through the possible input strings (which I can do because the set of strings is countable) until I find one that satisfies `h(x) == y`. Impractical, yes, but allowed by the theory, and that means the theory is invalid.
If I let "some circumstances" be "non critical use with no attacker" and the model be "very very very very very very very very very very very very very low probability of collision when applied on pre-existing files", I don't see any problem.
The problem is that you were commenting the article using an absurdly geek point of view, interpreting each sentence as it had a very strict mathematical meaning.
In the real world, peoples sometime do misuse the language a little with the resulting sentences being easier to understand for everybody. That is generally not a call for a smart people to point out that when translated word by word (i.e. when poorly translated) using quantifiers, the statement is mathematically false. Especially when everybody on earth understand that use of "in theory" in the first place.
If Jeff wanted to make a mathematical article, he probably had written mathematical statements in the first place.
If every time you see the world "theory" you are feeling you should correct the one using it by telling him about corner cases he very probably already knows about their existence in the first place, you will loose a lot of time for a lot of non-constructive comments.
This only applies to cryptographic hash functions, like SHA-1 and Skein. There are non-cryptogaphic hash functions which are not designed to be secure, like Dan Bernstein's DJB-family of hashes (DJBX33A and DJBX33X), that are used e.g. for hash tables with string keys.
That's ridiculous. After the fifth attempt to enter my 20-character password correctly (which was chosen randomly by my Password manager), I gave up and entered a 8-character password.
So the problem is not only the user, but also ridiculous application requirements.
You can then paste in the password field.
It should be reasonably possible to make a bookmarklet that goes through all form elements on a page and does this automatically to all that have onpaste handlers, so you don't have to bust open the inspector and dick around every time you want to log in.
"No. Why stop the user from copy-pasting their own password?
Whenever you're looking at a security protection like this, it's important to ask yourself: Exactly what kind of attacks are am I trying to protect against? In this case, even if you prevent copy-paste, the user can just retype it if they really want to, after all. And if you're worried about Evil Spyware, that stuff can just install a browser extension and look at the password in the DOM directly, or install a keylogger and capture it as it's being typed.
Indeed, this can even reduce security. Consider if the user's using a password management program that can either put the password into the clipboard, or display it for retyping. If you prevent paste, that means the user must display the password on screen for any shoulder surfers to see."
You cannot "mimic" another person's fingerprint with MD5. Currently you can only "create" two people with the same fingerprints.
This might sound the same but consider the case where you want to replace someone's notepad.exe with an evil notepad.exe (and make sure that it has the same MD5 so they don't notice). Currently, no attack exists to do that.
If you store a salted hashed password, the authenticating party needs to send you the actual password.
Sure, they could send you the password over an encrypted connection. But that implies that a security context (and shared secret) has already been established. There's a chicken/egg problem here. The most desirable approach is to use a protocol which performs mutual authentication as part of the initial setup of the security context.
There are protocols which can perform password-based authentication without actually sending the plaintext password. It's called a zero-knowledge password proof. In addition, some of these allow a "verifier" to be stored rather than the password itself.
In this scenario, not only do you not store a secret which would allow impersonation of a user, but that secret is also never even sent to you.
We need support for this in browsers :)
At login, it requires a bit of brute-forcing on the server to check the hash since we have to go trough all possible pepper strings. This adds a few ms (e.g. with a random string of length 4).
Now, on the attacker's side the picture looks drastically different. The amount of time required to brute-force through the already salted hashes grows exponentially with the length of the pepper string. If it takes a few days to crack the whole database without pepper, it might take a few years to do it with pepper. There is absolutely nothing the attacker can do about it. No access to any part of your system will help him or her.
Second, the actual usefulness of any work factor has to be proven first. Do we know that the iterative approaches offer "real" work factors? What if subsequent iterations are easier to compute by exploiting the structure of the input (i.e. password + result of previous iteration). We have to prove that this is not the case. The same argument holds for peppering, which is also based on computing hashes with a certain structure.
I'd like to hear your arguments as to why certain work factors are more "real" than others, especially peppering.
The usefulness of work factors has been proven. You can start here: http://www.bsdcan.org/2009/schedule/attachments/87_scrypt.pd...
That isn't drastically different. The amount of time required scales exponentially with the pepper string for both the attacker and the legitimate authentication server. Not really different from increasing the work factor with bcrypt; and not as cool as increasing the circuit size with scrypt.
Peppering is applicable to all hashing algorithms unlike the specific parameters you mentioned.
Excellent article about hashes: http://home.comcast.net/~bretm/hash/
However, secure passwords are still a crutch. We should be fixing the problem of asking for passwords. We already have solutions for this, but they are not widely used. StartSSL(http://www.startssl.com/) gives you a secure certificate which can be used to login to their site. I would love to see the browser makers make it easy for users to use client side certificates.
More problematic, the chance of any collisions is 50% when you have filled just sqrt(space) - known as the birthday paradox.
Edit: Oh dear, there's a followup post:
I use two salts to hash a password:
sha1(SALT . SALT2 . $password);
the second salt is unique for every user and stored in a database.
Why is it not secure?
bcrypt seems to lack any research verification. Its a modified blowfish, but who of us can tell what modifications are safe without making the scheme ineffective and gauge its validity?
I'm not sure what kind of research verification you're hoping to get for a password hash. These aren't the kinds of constructions where the possibility of spending 48 hours of cluster compute time to generate a collision will actually hurt you.
Sometimes, "research verification" is just a fig leaf for "too many Rails developers got uppity about using bcrypt and now I need to signal that I'm not one of them". Password hashes don't get a lot of research attention because they're pretty basic, cryptographically speaking.
Perhaps the faster the hash is easier implement in hardware / less power for embedded devices?
The complexity of a hardware implementation is also a factor.
It's not always the case that a hash that is faster in software is easier or cheaper to implement in hardware. For instance Skein is very fast on x86-64 but apparently is less competitive in low power hardware implementations.
This is due to the confusion about the general purpose of a hash function versus the specific purpose of a "password hash"/PBKDF, which Jeff's article helps contribute to with it's loose terminology.
A general hash function should be fast as it needs to be invoked often.
I run several sites, and I'm not a security expert. It'd be great to have some sort of checklist of things I should and shouldn't do.
The reason why longer passwords take long is not a property of the GPU or computation, it is because there are many more possible passwords as the length increases and you have to try sizable portion of them.
For example, there are (to make the math simple) 10 000 four number pins (0000, 0001, ... 9999). However, if you append 4 to every pin, you have made them longer, but there are still 10 000 of them, because the 4 is fixed. However if you use 5 number pins, there are 100 000 of possible ones.
Random string would work only if you can store it in a way that attacker can't get to them. However since the application that uses them needs them to verify the password, this is not very realistic assumption.
However, I see that my exact reasoning in the post above is not solid. Having longer strings as the final result of password comparison is irrelevant, because they are not part of the brute-force computation in the first place.
Developers -> use bcrypt
Users -> use a service like Lastpass with a long passphrase as master password; use unique 12+ character (including specials!) passwords through Lastpass
$pw_hash = sprintf('%s|%s|%s|%s|%s',
SALT1, $user_name, SALT2, $pw, SALT3);
It doesn't really matter how much encryption and hashing we throw at anything if everything is stored on the same server.
What's the point of salt if salt is in plain sight?
What's the point of asking users to verify hash of downloadable files when hash is stored along side the file itself?
What's the point of cryptography when code points straight to all that's necessary to dispell the protection?
Ultimate shame is that we still lack the necessary infrastructure for minimum level of security despite all the cloud-related hype, leaving each server to stand-alone which is no security at all.
To prevent rainbow tables and to increase attack complexity. And to prevent someone who sees a hash from being able to drop it into a search engine and get the password (works for unsalted common passwords).
The point of salt has never been secrecy.
> What's the point of asking users to verify hash of downloadable files when hash is stored along side the file itself?
To verify data integrity. It isn't an question of security, or they would be signed properly.
> What's the point of cryptography when code points straight to all that's necessary to dispell the protection?
I have no idea what you're talking about.
> Ultimate shame is that we still lack the necessary infrastructure for minimum level of security despite all the cloud-related hype, leaving each server to stand-alone which is no security at all.
Still no idea what you're talking about.
They do not make sense for servers with high security needs.
As for the download hash, this has nothing to do with security at all, as I already said.
Re download hash, I think we disagree on what data integrity means. By data integrity, I mean trustworthiness which implies security, meaning file content hasn't been modified in-storage or in-transit. I'm guessing you are using those words to mean protection against transport-errors or typos (credit card entry) for which checksum is commonly used.
Given that HTTP is still the most common transport for file download and it's a reliable transport, file-hashing for transport-errors detection is I think meaningless. I know some browsers failed to report errors but comparing length is typically suffice to catch this.
That leaves only file-hashing as protection against tempering in-storage or in-transit which is why I am saying it's security-related. Keyed-hashing can mediate but, unfortunately, is not commonly used as it requires PKI.
Data integrity is not trustworthiness. Data integrity means I got what you said I should get, not that you are trustworthy (and therefore I can trust what I got from you). If you want secure assurance that you got what I really sent, then you need secure assurance that I am who I say I am, and no hash will get that. You need a signed file (or hash). A hash by itself is indeed nothing but a robust a checksum in this case.
There's no point in arguing about the md5 for downloads. You're assuming that it's for security, and it's really not. It's primarily for people who are paranoid about corruption. It is also useful for validating when downloading from a mirror, though.
I do, however, appreciate your cordial and elaborate responses for which I thank you.
If you by "plain sight" mean "the same place as the hash" you're misunderstanding what the salt is for. If you want to access X servers in order to verify a password, you don't need a salt; you just split the hash in X parts and store each on different servers.
Salts are for preventing rainbow tables.
Salting has another benefit too: it makes it virtually useless to find a mere password collision.
Say, for example that I use two sites, Spacelook and Squitter, which both use a popular hash called SPARTACUS for their passwords. My password on both sites is "abracadabraSpanishFly_92@prosperity;". An attacker gets into Spacelook's database and does a cross reference with their SPARTACUS rainbow table. I happen to be very unlucky and my nice strong password has a SPARTACUS collision with the very short "a2e34". Now the attacker goes to Squitter, puts in my username and "a2e34" and logs in with no problem even though that's not my password.
Now, suppose that Spacelook and Squitter had each used a different per-site salt (not good enough, but better) when they hashed my password, and this time suppose the attacker builds a custom rainbow table for Spacelook's salt and finds that my salted password collides with salted "fr-_9x32;". If they go to Squitter and try to log in with "fr-_9x32;", Squitter's use of a different salt will cause it to hash differently than my password, and so only my Spacelook account will have been compromised.
Besides, the point of my original comment was the problem of placing everything in the same basket which includes the code used to hash passwords and files. How would you protect them?
That's what we call bruteforce…