Hacker Newsnew | comments | show | ask | jobs | submit login
Speed Hashing (codinghorror.com)
169 points by csomar 1183 days ago | 125 comments



Using "hash" to mean "strongly collision-free function" is a valid terminological choice; in a wide-readership piece like this it should ideally be pointed out in order to avoid confusion, but there are many fields where that definition would be assumed without statement.

Getting confused between hash functions and password-based key derivation functions, on the other hand, is absolutely inexcusable; that very confusion is why there are so many people making the mistake of using SHA256(salt . password) as a key derivation function. People hear that MD5 is broken but SHA256 isn't; not realizing that MD5's breakage as a hash function does not impact its security as a PBKDF, and not realizing that MD5 and SHA256, being not designed as PBKDFs are both entirely inappropriate for use in that context.

Also, for reference: MD5(12 printable ASCII characters) ~ bcrypt(8 printable ASCII characters) ~ scrypt(6 printable ASCII characters) in terms of cost-to-crack on sophisticated (ASIC/FPGA/GPU) hardware.

-----


Not to quip, but surely some of the blame for that confusion has to lie on the cryptography experts who insist on using lung-spasm-inducing hairballs like "PBDKF" (edit: typo there was, I swear, unintentional!) to represent concepts better explained in English as "expensive to compute".

-----


"Password based key derivation function".

The problem here isn't that cryptographers are inflicting a terrible name at you, it's that security people have repurposed a function intended for one purpose (generating crypto keys) for another (storing passwords).

Also: don't blame us, blame the PKCS standards group.

-----


Sure sure. What I'm really saying is that the core concepts here are "collision resistance" and "susceptibility to brute force", and that those are comparatively easy to grok for a typical developer. But they get scared to death when they read stuff like (verbatim from cperciva above) "MD5's breakage as a hash function does not impact its security as a PBKDF, and not realizing that MD5 and SHA256, being not designed as PBKDFs" which makes it sound more subtle than it is.

-----


To clarify: the clear English way to write that would be something like "Colliding passwords aren't a problem, because it's cheaper to search the password space anyway, so you can use MD5 there securely. But MD5 and SHA256 were designed to be cheap to compute, so they're easy to brute force. Use a password function designed to be expensive instead." Charging off down the jargon road can only complicate things. And doing so in a post about people being confused (!) about the subject is frankly hurting, not helping.

-----


But what he's saying is very straightforward:

The MD5 vulnerabilities that this post talked about don't change the security of PBKDF-MD5 (or any iterated MD5 password hash), because password hashes aren't used to authenticate data. Similarly (this is a little mindbending): HMAC-MD5 has no known viable attacks, so many crypto protocols that use MD5 don't have viable attacks.

and

Because they all seem like acronyms, people get MD5 and SHA256 (which are core hash functions, "primitives" in a cryptosystem) with PBKDF, which is a standard for turning passwords into crypto keys.

(I know you know both, I'm just trying to restate).

-----


I don't see that it was repurposed; "key derivation function" != "encryption key derivation function", and I prefer the term "login key" over "password hash".

-----


Since there are good password hashes that are also bad KDFs, I'm not sure I agree with the equivalence.

-----


There are?

-----


... I think so?

They don't all work this way but this is one of those disagreements where you're more likely to be right than me.

-----


Good point. Isn't the only PBKDF requirement artificial and technically unnecessary computational delays? I mean far beyond what is necessary to actually produce a strongly collision free function.

-----


The key requirement for a password-based key derivation function is that performing a brute force search is expensive. The PBKDF2 and bcrypt functions approximate this by scaling the amount of computation required; but this is imperfect since they can both be implemented on small circuits, making highly parallel attacks using GPUs/FPGAs/ASICs feasible. The scrypt function scales the amount of RAM required -- and thus the circuit size -- as well, making it far more expensive to attack since you can't use the same sort of cheap highly-parallel crunching.

The best reference for this analysis is the scrypt paper: http://www.tarsnap.com/scrypt/scrypt.pdf

-----


Also PBKDF2 is not designed to be hard to crack on a GPU. PBKDF2 is actually notoriously easy to implement on a GPU or FPGA due to its small fixed amount of ram. It's probably not a huge deal since taking a large fraction of a second on a CPU to validate a login would translate into taking a large fraction of 1/150 of a second per password crack attempt on a GPU so it still is decent security.

For something designed to defeat specialized hardware, see scrypt.

[edit] Just noticed who I was replying to. Now that's funny.

-----


A bigger reason PBKDF2 is easy to implement on GPUs is that it's (almost always) built out of primitives that are already widely available in GPU implementations, like SHA1 and SHA256.

That's why I recommend PBKDF2 with the PRF set to the authentication tag from FEAL4 in GCM mode. Nobody's got that in a GPU! Arrrrr!

-----


Nitpicks. It seems that he's talking about password hashing:

> "Hashes are designed to be tamper-proof". Wrong. Cryptographically secure hash functions are.

> "Hashes, when used for security, need to be slow." Wrong. Password hashes needs this; not SHA1/MD5 etc.

-----


Well, I'd say a hash that has no need whatsoever to be tamper-proof (no attackers, ever) and cares only about speed is a checksum -- so maybe a terminology issue.

I agree there are certainly other uses for hashes, just trying to distinguish between hashes and checksums.

Re-reading what I wrote, I open with "Hashes are a bit like fingerprints for data. A given hash uniquely represents a file, or any arbitrary collection of data" so the context of this article is hash functions that are able to uniquely identify something in a reliable, trustworthy way -- either checksums (fast, no need for security) or hashing (slower, more reliable, possibly "secure" for some definition of secure), but always in the context of "can I trust this value to tell me that the data is really what I think it is?"

Of course there is a tradeoff with speed; a person's name can be good enough identifier (checksum) in some circumstances, but maybe other circumstances require more reliability like DNA or fingerprints (secure-ish hash), at the cost of being far slower to collect and validate and way more onerous.

-----


I am afraid the terminology you use is not canonical.

A hash (function) is any function that maps a big variable-length data structure to a smaller (fixed-length) data structure.

A checksum is a special hash (function) which has the purpose of detecting accidental errors during transmission or storage. (This makes them different from hash functions used for example in hash tables. For example relevant question: minimum how many badly transmitted bits are needed to break the error detection in the case of a specific checksum?)

A cryptographic hash function is another kind of special hash function which also has a very good definition on wikipedia.

(Basically these are the definitions on Wikipedia, and I think they are fine. If the terminology you use would be canonical we would use the name 'checksum table' instead of 'hash table', and it would be quite illogical also, because the term 'checksum' really includes purpose in its name 'checking something').

-----


I have hash functions that work on less bytes than the length of the hash they produce.

-----


Won't that be fairly common? Java's hashCode and the .Net GetHashCode both return 4 byte ints and there are presumably a lot of HashMap and Dictionary objects out there using short strings as keys.

-----


It's very common; the norm even. My point was that "A hash (function) is any function that maps a big variable-length data structure to a smaller (fixed-length) data structure." is incorrect; the source size isn't relevant.

-----


The source size is extremely relevant, in the property that the function is able to work with variable sized input including large input. This definition has obviously embedded its purpose unto itself. The reason we have such cryptographically secure hash functions corresponds to the need to digest large data.

-----


Secure hashes need to be fast. If you look at the ongoing SHA-3 competition, you will see an enormous focus on speed, on both software and hardware.

Why do they need to be fast? They're used everywhere. Opening up notepad.exe, you'll verify a handful of digital signatures, each of which will compute a hash of the binary as first order of business. HMAC is used in virtually every secure network protocol worth using --- it uses an underlying hash as building block. Need a fast (determistic) pseudorandom generator? Hashes it is.

Password hashing (and key derivation) is about the single use of hashes that requires computational hardness. It should be treated separately.

-----


I guess what I don't fully understand is just how computationally intensive (slow) a hash needs to be to produce a reliable fingerprint, that is, one that is as collision free as possible and extremely sensitive to any tiny change in the source data.

I do understand that password related hashes may have arbitrary computational delays added just to make them tougher to brute force, which would be bad in many other (all other?) situations?

-----


Strongly collision-free hash functions are generally in the range of 5-20 clock cycles per byte of data on modern hardware. It's significantly slower than memcpy, but probably significantly faster than your network.

-----


Correct me if I'm wrong, but my understanding is that cryptographic and non-cryptographic hash functions have similar avalanche properties (and therefore similar probability of accidental collision); the difference is just that cryptographic hash functions make it difficult to intentionally create a collision.

Is that right?

-----


Good non-cryptographic hash functions have "good" avalanching behaviour, but the definition of "good" depends on how randomly you expect your inputs to behave and how much a collision costs you. In hash tables, for instance, people often use quite weak -- i.e., poorly avalanching -- hashes since the cost of a collision is quite low.

Probably the best way to phrase it is that non-cryptographic hash functions cover a continuum with checksums at one extreme and cryptographic hash functions at the opposite extreme.

-----


Maybe the question is asked from the wrong perspective - it's not that designers of cryptographic hash functions ask themselves "Hmm, so how fast/slow do we want this thing to be?".

They go the other way round and set a "security parameter" that should hold for a particular instance (all the parameters fixed) of their algorithm. That "security parameter" determines how hard it should be for an arbitrary adversary to break the scheme by brute force, and in addition the scheme is only thought to be secure if there is no polynomial time algorithm that can break it in time significantly less than the brute force algorithm would take.

Given such a security parameter, the art is then typically to design the fastest primitive that upholds this same security margin. For example, if two (unbroken) algorithms would take O(2^80) for a brute force Birthday Attack, then you'd conceive the faster of the two as the "better" algorithm. ("Fast" of course can be subjective, e.g. fast in HW or fast in SW etc.)

For one thing, that's simply just because making an algorithm slower is always possible (PBKDF2), but in practice there are a lot of applications (probably more) where a hash should be as fast as possible rather than as slow as possible, as in digital signatures for example.

-----


The first question is an open one. Not even theoretical cryptography people understand collision resistance very well. From a practical standpoint, though, we know that it can be pulled off reasonably fast; SHA-256 is still standing, for example, and it "only" takes 14 cycles per byte hashed. Not great, but not bad either.

The first written mention of artificially slowing down key derivation was (AFAIK) here: https://www.schneier.com/paper-low-entropy.html The rationale was that, given a key (resp. password) that gives you s bits of security, you apply some function that gives you extra t bits of security, by making bruteforcing it 2^t times more expensive.

Also note that some applications of hash functions don't care about collisions. MD5 is still OK-ish for e.g. HMAC, where what matters is second-preimage resistance.

-----


Morris and Thompson beat Schneier et al. by about two decades; see Password security: a case history, CACM 1979.

-----


> Well, I'd say a hash that has no need whatsoever to be tamper-proof (no attackers, ever) and cares only about speed is a checksum -- so maybe a terminology issue.

Imagine someone trying to attack a BitTorrent tracker by injecting false data packets with "true" hashes. This is a case where you do want the properties of a hash, not a checksum (changing a single bit greatly changes the output; it's hard to figure out an input that will create a matching output) but you don't want the properties of a cryptographically-secure hash (slow to calculate.)

-----


> Well, I'd say a hash that has no need whatsoever to be tamper-proof (no attackers, ever) and cares only about speed is a checksum -- so maybe a terminology issue.

Cityhash, spookyhash, and murmurhash are three modern non-cryptographic hash functions which explicitly don't call themselves "checksums".

Indeed, we say "hashtable", not "checksumtable".

-----


Hash tables have no correlation to cryptography, and the "hash" function in most languages are more closely related to checksums---in fact, the `hash` function in python for ints is the identity. I would never assume that a hash is intended to be secure in any sense, just a function that produces an integer from an arbitrary data source.

-----


I would not call "a hash that has no need whatsoever to be tamper-proof and cares only about speed" a checksum. You don't use checksums to create a hash table. Now some checksum would be suitable for that use, but not all. So let's continue to call a hash a hash.

-----


Hashes are designed to be fast. Password hashes (MD5crypt / SHA1crypt / bcrypt / scrypt / PBDKDF2) are designed to be slow to make dictionary attacks harder, but the hashes used in SSL and the like are designed to be as fast as possible without sacrificing too much security.

-----


As often, he plays fast and loose with terminology and uses "checksums" for "hash functions" and "hashes" for "cryptographic hash functions".

> the hashes used in SSL and the like are designed to be as fast as possible without sacrificing too much security.

See also: map and set hashes. There, you're looking for speed and good distribution across the buckets, a slower hash function is not something to look forward to.

-----


I guess, but what's the value of a "checksum" that fails to detect certain "cryptographically secure" changes in a file? MD5 is clearly just a checksum now, since it's not secure, but it was designed to be originally. And would you ever want to use a checksum that others could manipulate at will? Curious.

I feel like the line between "this is a checksum" and "this is a secure hash" is kind of illusory, other than for pure performance reasons.

-----


> I guess, but what's the value of a "checksum" that fails to detect certain "cryptographically secure" changes in a file?

Sufficient. all hashes (cryptographically secure or not) will fail to detect some changes, that's a hash collision and that's an intrinsic and non-negotiable property of m:n mappings where len(m) > len(n). "Checksums" assume there is no attack being performed, as in nobody is trying to craft a collision. In "normal operations", even with pretty shitty hash functions (e.g. CRCs) hash collisions between subsequent file changes (or whatevere) are unlikely enough.

-----


E.g. ZFS uses non-cryptographical hashes to detect bitrot. TCP and IPv4 use a non-cryptographical hash to detect packet corruption. Etc.

-----


terminology issue, I guess, I'd call those checksums -- where speed is the overriding concern and you're not worried about attackers changing the data underneath you. (And you actually need to uniquely identify the data in a reliable way..)

-----


Sure, it's a terminology issue, but as several others have pointed out, you're the only one using your terminology. So expect people to be confused when you try to explain these things using your definitions.

-----


In the case of TCP/IP, everyone calls the checksum a checksum, and nobody calls it a hash.

-----


>A given hash uniquely represents a file, or any arbitrary collection of data. At least in theory.

It really bothers me when people misuse "in theory" like that. "In theory" means "we have a model that makes good predictions in some circumstances, but there are cases where it may fall short". But a model in which hashes uniquely represent arbitrary collections of data is a model which allows for infinite compressibility of strings. That is not a theory that merely falls short in some edge cases; it is a theory which is hilariously and catastrophically wrong.

-----


It bothers me too when people reverse "in theory" and "in practice", which are really quite different things! Clearly in this case he actually meant "in practice".

-----


I disagree. The theory is not that collisions don't exist. The (probabilistic) theory is that you will never find one. In practice hash functions tend to have flaws that are discovered over time, ruining this theory.

-----


It does not allow infinite string compression, because you obviously have to store the original data. For hashes of good length and quality, the theory stands. I would even use MD5 and say the theory stands for non-critical every day uses given there is no attacker.

-----


> It does not allow infinite string compression, because you obviously have to store the original data.

Nope. Say you have a hash function `h` which guarantees a unique, 128-bit output for any input. Then `h` is a function which compress any string into 128-bits:

If `y = h(x)` then it is trivial for me to write a program that will reconstruct `x` from `y`. I will simply iterate `x` through the possible input strings (which I can do because the set of strings is countable) until I find one that satisfies `h(x) == y`. Impractical, yes, but allowed by the theory, and that means the theory is invalid.

-----


By your very own definition, "In theory" means "we have a model that makes good predictions in some circumstances."

If I let "some circumstances" be "non critical use with no attacker" and the model be "very very very very very very very very very very very very very low probability of collision when applied on pre-existing files", I don't see any problem.

The problem is that you were commenting the article using an absurdly geek point of view, interpreting each sentence as it had a very strict mathematical meaning.

In the real world, peoples sometime do misuse the language a little with the resulting sentences being easier to understand for everybody. That is generally not a call for a smart people to point out that when translated word by word (i.e. when poorly translated) using quantifiers, the statement is mathematically false. Especially when everybody on earth understand that use of "in theory" in the first place.

If Jeff wanted to make a mathematical article, he probably had written mathematical statements in the first place.

If every time you see the world "theory" you are feeling you should correct the one using it by telling him about corner cases he very probably already knows about their existence in the first place, you will loose a lot of time for a lot of non-constructive comments.

-----


That's false. Nobody is ever saying that h(x) is unique, only probability that h(x) = y for any y should be about 1/size of y choice space.

-----


"A given hash uniquely represents a file, or any arbitrary collection of data. At least in theory."

-----


If the theory doesn't stand in all cases, then it doesn't stand. It should be "in practice".

-----


Nitpick: this statement is not exactly true: "Hashes are designed to be tamper-proof".

This only applies to cryptographic hash functions, like SHA-1 and Skein. There are non-cryptogaphic hash functions which are not designed to be secure, like Dan Bernstein's DJB-family of hashes (DJBX33A and DJBX33X), that are used e.g. for hash tables with string keys.

-----


I suppose I was thinking of Java and .NET where you don't generally get the chance to "pick" a hash function unless you're in the Crypto namespaces. There are certainly hash-based data structures like HashTables and the like, but the underlying hash algorithms are not exposed in any way, they're just magic.

-----


Not to pile on with the nitpicking, but for just the cases everybody here is talking about (eg non-crypto hashing) you get to "pick" exactly the hash function. You're forced, actually, to implement hashCode() in Java if you want to be able to put the object in a HashMap, HashSet, etc.

-----


You're right but that's still in a very small subset of possible hashing functions because you must produce a 4 bytes hash, which is then hashed again (to the length of the storage array). Of course you don't have as requirement to avoid collision, just to distribute as equally as possible.

-----


You can override the default hash producing methods in both Java (hashCode) and C# (GetHashCode) - so while the standard hash generators might not be exposed directly they can easily be replaced.

-----


Using chosen hash functions in your applications isn't so rare even outside cryptography field. For example I often use hash for storage addressing (ala git). For this kind of thing you need something fast and with practically no collision (which isn't a requirement for a common hash table).

-----


While long and random passwords are a good thing, there are still applications that make that very difficult to use.

I'm looking at you, Apple AppStore, letting me choose a new password after only the second wrongly-entered, disallowing copy&paste on that website to enter the new password, requiring JavaScript on that website to enforce the no-copy&paste rule and then not showing me the characters I enter.

That's ridiculous. After the fifth attempt to enter my 20-character password correctly (which was chosen randomly by my Password manager), I gave up and entered a 8-character password.

So the problem is not only the user, but also ridiculous application requirements.

-----


That is indeed annoying. As a work around, if you use a browser or browser extension that lets you modify the DOM on the fly, you can disable the paste blocking. For instance, in Safari, use "Inspect Element" on the password field, find the onpaste="..." attribute, double click the onpaste and change it to something that isn't a valid event name.

You can then paste in the password field.

It should be reasonably possible to make a bookmarklet that goes through all form elements on a page and does this automatically to all that have onpaste handlers, so you don't have to bust open the inspector and dick around every time you want to log in.

-----


That's good to know, I'll try it next time!

-----


There is a good motivation to prevent copy-and-paste - they might be storing the password in plaintext somewhere.

-----


No, there isn't. From Stack Overflow (http://stackoverflow.com/a/4760170/):

"No. Why stop the user from copy-pasting their own password?

Whenever you're looking at a security protection like this, it's important to ask yourself: Exactly what kind of attacks are am I trying to protect against? In this case, even if you prevent copy-paste, the user can just retype it if they really want to, after all. And if you're worried about Evil Spyware, that stuff can just install a browser extension and look at the password in the DOM directly, or install a keylogger and capture it as it's being typed.

Indeed, this can even reduce security. Consider if the user's using a password management program that can either put the password into the clipboard, or display it for retyping. If you prevent paste, that means the user must display the password on screen for any shoulder surfers to see."

-----


Agreed. If you stop me from copy and pasting, there is always pen and paper. Security auditors have fun making companies look dumb when they find passwords to critical systems on post-it notes.

-----


If you are using long passwords, you don't need to use purely random text, you can use a phrase.

-----


Unfortunately, my password manager is bad at generating phrases and I'm bad at generating random phrases.

-----


> If you could mimic another person's fingerprint or DNA at will, you could do some seriously evil stuff. MD5 is clearly compromised, and SHA-1 is not looking too great these days.

You cannot "mimic" another person's fingerprint with MD5. Currently you can only "create" two people with the same fingerprints.

This might sound the same but consider the case where you want to replace someone's notepad.exe with an evil notepad.exe (and make sure that it has the same MD5 so they don't notice). Currently, no attack exists to do that.

-----


I think it's also worth looking at the protocol used to perform the authentication. This is closely tied to what kind of verifier you can/can't store.

If you store a salted hashed password, the authenticating party needs to send you the actual password.

Sure, they could send you the password over an encrypted connection. But that implies that a security context (and shared secret) has already been established. There's a chicken/egg problem here. The most desirable approach is to use a protocol which performs mutual authentication as part of the initial setup of the security context.

There are protocols which can perform password-based authentication without actually sending the plaintext password. It's called a zero-knowledge password proof. In addition, some of these allow a "verifier" to be stored rather than the password itself.

In this scenario, not only do you not store a secret which would allow impersonation of a user, but that secret is also never even sent to you.

We need support for this in browsers :)

-----


Nobody mentioned pepper yet. Not the "static salt" variant which you might find while googling. That is just more security by obscurity. I'm talking about adding a random string of fixed length characters to the (salted) password that is not saved anywhere.

At login, it requires a bit of brute-forcing on the server to check the hash since we have to go trough all possible pepper strings. This adds a few ms (e.g. with a random string of length 4).

Now, on the attacker's side the picture looks drastically different. The amount of time required to brute-force through the already salted hashes grows exponentially with the length of the pepper string. If it takes a few days to crack the whole database without pepper, it might take a few years to do it with pepper. There is absolutely nothing the attacker can do about it. No access to any part of your system will help him or her.

-----


That's because it's a silly idea which is inferior to cryptographic adaptive hashing, as is done by bcrypt, scrypt, and PBKDF2. If you want to provide a work factor, use a real one.

-----


First of all, you come off as condescending when you say "it's a silly idea" and say that the work factors you mentioned are "real" ones, thereby implying that adding more combinations is not a real work factor.

Second, the actual usefulness of any work factor has to be proven first. Do we know that the iterative approaches offer "real" work factors? What if subsequent iterations are easier to compute by exploiting the structure of the input (i.e. password + result of previous iteration). We have to prove that this is not the case. The same argument holds for peppering, which is also based on computing hashes with a certain structure.

I'd like to hear your arguments as to why certain work factors are more "real" than others, especially peppering.

-----


I am absolutely intending condescension towards the idea. Since you're an anonymous user with almost no comments in your history, I do not concede the idea that it's even possible to condescend to you: I have no idea who you are.

The usefulness of work factors has been proven. You can start here: http://www.bsdcan.org/2009/schedule/attachments/87_scrypt.pd...

-----


> Now, on the attacker's side the picture looks drastically different. The amount of time required to brute-force through the already salted hashes grows exponentially with the length of the pepper string.

That isn't drastically different. The amount of time required scales exponentially with the pepper string for both the attacker and the legitimate authentication server. Not really different from increasing the work factor with bcrypt; and not as cool as increasing the circuit size with scrypt.

-----


You are right, I should clarify this. For the authentication server the time required jumps from close to nothing to a few ms. For the attacker, the time jumps from a few hours or days to a few years or decades. This asymmetry is what I meant when I said that it's a drastically different picture.

Peppering is applicable to all hashing algorithms unlike the specific parameters you mentioned.

-----


I haven't heard of this before, so I might be wrong, but I don't think it makes finding collisions any harder. It introduces expotentially many collisions in length of pepper. For example let's take xyzw and abcd such that h(xyzw) = h(abcd), now if you take peppered hash with length of pepper one you get hp(yzw) = hp(bcd).

-----


There is a reason why there are hashes and cryptographic hashes and CRCs. Some hashes are fast and some slow. Same hash algorithm isn't right choise for every usage.

Excellent article about hashes: http://home.comcast.net/~bretm/hash/

-----


People should be using stuff like PasswordMakerPro (https://chrome.google.com/webstore/detail/ocjkdaaapapjpmipmh...) or something similar to generate big ugly(using the whole character set) passwords.

However, secure passwords are still a crutch. We should be fixing the problem of asking for passwords. We already have solutions for this, but they are not widely used. StartSSL(http://www.startssl.com/) gives you a secure certificate which can be used to login to their site. I would love to see the browser makers make it easy for users to use client side certificates.

-----


Until that lovely site goes offline and you, nor anyone else, knows any of your passwords. Not for me.

-----


With passwordmaker pro, the passwords are not really 'stored' anywyere. It just computes a hash using the 'domain name' a Master Password and some other predefined stuff. So, it's not dependent on any site. So, it uses a different password for each domain (because the domain is an input to the hash). This way, even if a site you are using is compromised, and the passwords leaked, the effect is local to that site.

-----


In reality the usable space is substantially less; you can start seeing significant collisions once you've filled half the space.

More problematic, the chance of any collisions is 50% when you have filled just sqrt(space) - known as the birthday paradox.

-----


It looks like he just misread this article, which is kind of a disaster itself -- it recommends storing MD5 hashes of passwords, maybe with a salt if you want to be extra secure.

http://www.skrenta.com/2007/08/md5_tutorial.html

Edit: Oh dear, there's a followup post:

http://www.skrenta.com/2007/08/crypto_vs_the_working_coder.h...

-----


"salts alone can no longer save you"

I use two salts to hash a password: sha1(SALT . SALT2 . $password); the second salt is unique for every user and stored in a database. Why is it not secure?

-----


In reality you arent using 2 salts, you are using one unique salt per user, each users salt starts with the same few bytes though.

-----


yes, I put the first salt in database and the second salt under www-root. Hacker who hack the database only will not know the fist salt.

-----


I think you have to assume worst case: if they have access to your database, they have access to your web root. It might not be the case, but you should assume that.

-----


Because it's stored in a database. If an attacker has access to the database with the hashed passwords she will likely have access to the database with the salt too.

-----


but only the second salt is stored in a database, the first salt is stored under WWW-Root.

-----


It's still there on the server, if the attacker has her hands on the database you have to assume that the entire server might be compromised

-----


What is the best way to secure passwords?

-----


You need to assume that the attacker will have access to anything on the server. So first thing is clearly no plain text passwords but hash only. Second thing is make as hard as possible for the attacker to decode the hash. One salt helps preventing use of rainbow tables but more salt is useless since the attacker has them. So you are left with choosing a hard algorithm to crack and currently the best one is bcrypt which is already implemented in most programming language for you.

-----


Because sha1 is still super fast on a GPU. Why aren't you using bcrypt?

-----


I thought that hashing password with two types of salt (one of them is unique for every user) and two places to storage salts is secure enough.

-----


You thought wrong.

-----


Salts don't slow a GPU down: http://codahale.com/how-to-safely-store-a-password/

-----


Here is my problem with all of this. PBKDF really just means "take this hash function and apply it 1000 times in succession". That seems like a very butchery style to approach cryptography for me.

bcrypt seems to lack any research verification. Its a modified blowfish, but who of us can tell what modifications are safe without making the scheme ineffective and gauge its validity?

-----


PBKDF2 is more complicated than just iterating a hash 1000 times, but for most applications, iterating SHA1 many thousands of times is just fine.

I'm not sure what kind of research verification you're hoping to get for a password hash. These aren't the kinds of constructions where the possibility of spending 48 hours of cluster compute time to generate a collision will actually hurt you.

Sometimes, "research verification" is just a fig leaf for "too many Rails developers got uppity about using bcrypt and now I need to signal that I'm not one of them". Password hashes don't get a lot of research attention because they're pretty basic, cryptographically speaking.

-----


One very curious thing to me is that for the upcoming SHA-3 standard, Wikipedia lists the cycle timings for each hash method. I would have thought that slower hashing speed would be a good thing, but the faster the candidate algorithm the better it appears.

Perhaps the faster the hash is easier implement in hardware / less power for embedded devices?

-----


Imagine an implementation of git fsck, which checks that all the files in the repository match their sha1 based file names. The less time it takes to hash each file, the faster overall checking time, so it is better to use a hash which is designed to be fast.

-----


Writing hash functions is always a trade off between security and speed. SHA-3 contest has separate requirements for security, and they naturally want the fastest hash function that fulfills the requirements.

The complexity of a hardware implementation is also a factor.

-----


He is referring to password hashing where brute forcing is an issue. For a cryptographic hash function, the goal is to be as fast as possible while retaining security. There is currently an intensive competition between the SHA-3 teams to produce the fastest implementations. Check out http://bench.cr.yp.to/results-sha3.html for more performance data on a range of machines.

It's not always the case that a hash that is faster in software is easier or cheaper to implement in hardware. For instance Skein is very fast on x86-64 but apparently is less competitive in low power hardware implementations.

-----


I would have thought that slower hashing speed would be a good thing, but the faster the candidate algorithm the better it appears.

This is due to the confusion about the general purpose of a hash function versus the specific purpose of a "password hash"/PBKDF, which Jeff's article helps contribute to with it's loose terminology.

A general hash function should be fast as it needs to be invoked often.

-----


I'd be interested in hearing about what stack-specific measures website owners should take (outside of just pass phrases or using bcrypt) to secure their systems.

I run several sites, and I'm not a security expert. It'd be great to have some sort of checklist of things I should and shouldn't do.

-----


What about appending a significantly long random string to all passwords? Would that slow down re-computation of hashes significantly enough to thwart brute-force attacks? I don't know enough about GPU architecture to tell for sure.

-----


Not really.

The reason why longer passwords take long is not a property of the GPU or computation, it is because there are many more possible passwords as the length increases and you have to try sizable portion of them.

For example, there are (to make the math simple) 10 000 four number pins (0000, 0001, ... 9999). However, if you append 4 to every pin, you have made them longer, but there are still 10 000 of them, because the 4 is fixed. However if you use 5 number pins, there are 100 000 of possible ones.

Random string would work only if you can store it in a way that attacker can't get to them. However since the application that uses them needs them to verify the password, this is not very realistic assumption.

-----


I was thinking more along the lines of somehow forcing certain space complexity on every hash computation so it can't be as easily parallelized.

However, I see that my exact reasoning in the post above is not solid. Having longer strings as the final result of password comparison is irrelevant, because they are not part of the brute-force computation in the first place.

-----


Password security quick basics:

Developers -> use bcrypt

Users -> use a service like Lastpass with a long passphrase as master password; use unique 12+ character (including specials!) passwords through Lastpass

-----


For long, random passwords that can be easily remembered, try SHA1_Pass. It's free, with source code and runs on Windows, Linux and Macs.

-----


Why would you use MD5 as a checksum if different documents with identical hashes can be produced?

-----


Presumably because it is faster than many of the secure hashes (particularly SHA256), and still functions adequately as a checksum (bit errors are not going to produce meaningful collisions).

-----


Because it's very unlikely to produce an identical checksum without trying to. So unless you're worried about malicious users, MD5 is fine.

-----


Because it is fast, ubiquitous, and you can't produce a collision by accident. That means it's perfectly fine to use where no malicious behavior is possible. In fact, it should be used in those instances due to the speed and availability on all platforms. Where security is a concern, use SHA-256 or SHA-512.

-----


That's why you shouldn't use MD5 as a checksum.

-----


And to slightly improve salting:

    $pw_hash = sprintf('%s|%s|%s|%s|%s',
        SALT1, $user_name, SALT2, $pw, SALT3);
In fact, my salts usually include non-printables, including the NULL char, plus they are very long.

-----


How does this improve salting? The point of salting is to avoid rainbow tables; as long as the salt is random you should be safe.

-----


Where to store the salt is that's where the real problem is.

It doesn't really matter how much encryption and hashing we throw at anything if everything is stored on the same server.

What's the point of salt if salt is in plain sight?

What's the point of asking users to verify hash of downloadable files when hash is stored along side the file itself?

What's the point of cryptography when code points straight to all that's necessary to dispell the protection?

Ultimate shame is that we still lack the necessary infrastructure for minimum level of security despite all the cloud-related hype, leaving each server to stand-alone which is no security at all.

-----


> What's the point of salt if salt is in plain sight?

To prevent rainbow tables and to increase attack complexity. And to prevent someone who sees a hash from being able to drop it into a search engine and get the password (works for unsalted common passwords).

The point of salt has never been secrecy.

> What's the point of asking users to verify hash of downloadable files when hash is stored along side the file itself?

To verify data integrity. It isn't an question of security, or they would be signed properly.

> What's the point of cryptography when code points straight to all that's necessary to dispell the protection?

I have no idea what you're talking about.

> Ultimate shame is that we still lack the necessary infrastructure for minimum level of security despite all the cloud-related hype, leaving each server to stand-alone which is no security at all.

Still no idea what you're talking about.

-----


Your answers make sense in the context of servers for websites with low security needs, where privacy is of higher concern than security.

They do not make sense for servers with high security needs.

-----


I'm sorry, but you are simply incorrect. Salt is not a piece of secret data. Period. That was never the intention of salt, and it still is not. Secrecy of the salt does not improve its usefulness.

As for the download hash, this has nothing to do with security at all, as I already said.

-----


Re salt, please see project-rainbowcrack.com as an example of CUDA-driven hash cracking.

Re download hash, I think we disagree on what data integrity means. By data integrity, I mean trustworthiness which implies security, meaning file content hasn't been modified in-storage or in-transit. I'm guessing you are using those words to mean protection against transport-errors or typos (credit card entry) for which checksum is commonly used.

Given that HTTP is still the most common transport for file download and it's a reliable transport, file-hashing for transport-errors detection is I think meaningless. I know some browsers failed to report errors but comparing length is typically suffice to catch this.

That leaves only file-hashing as protection against tempering in-storage or in-transit which is why I am saying it's security-related. Keyed-hashing can mediate but, unfortunately, is not commonly used as it requires PKI.

-----


How does pointing to rainbow tables indicate that salt should be secret? That's one of the things that salt prevents, and it does it without being secret. A 32-char random salt could be publicly published for each of your users and it would still do its job. Secrecy doesn't matter.

Data integrity is not trustworthiness. Data integrity means I got what you said I should get, not that you are trustworthy (and therefore I can trust what I got from you). If you want secure assurance that you got what I really sent, then you need secure assurance that I am who I say I am, and no hash will get that. You need a signed file (or hash). A hash by itself is indeed nothing but a robust a checksum in this case.

There's no point in arguing about the md5 for downloads. You're assuming that it's for security, and it's really not. It's primarily for people who are paranoid about corruption. It is also useful for validating when downloading from a mirror, though.

-----


Let's just stop here. Given that we can't even agree on meaning of words, I think continuing this discussion will produce only further disagreements or misunderstandings.

I do, however, appreciate your cordial and elaborate responses for which I thank you.

-----


> What's the point of salt if salt is in plain sight?

If you by "plain sight" mean "the same place as the hash" you're misunderstanding what the salt is for. If you want to access X servers in order to verify a password, you don't need a salt; you just split the hash in X parts and store each on different servers.

Salts are for preventing rainbow tables.

-----


It doesn't matter if the hash is split into X parts and stored on X servers. We are not talking about Humpty Dumpty here and salt-specific rainbow tables can be built on-demand as long as the reward justifies the time and expense.

-----


Of course they can, but the point is that the time and expense is massively increased. Building a rainbow table to run against each separately salted account is way more work than building a single rainbow table and cross referencing it with a giant list of hashes to find some collisions. In general "as long as the reward justifies the time and expense" can be said about any security measure.

Salting has another benefit too: it makes it virtually useless to find a mere password collision.

Say, for example that I use two sites, Spacelook and Squitter, which both use a popular hash called SPARTACUS for their passwords. My password on both sites is "abracadabraSpanishFly_92@prosperity;". An attacker gets into Spacelook's database and does a cross reference with their SPARTACUS rainbow table. I happen to be very unlucky and my nice strong password has a SPARTACUS collision with the very short "a2e34". Now the attacker goes to Squitter, puts in my username and "a2e34" and logs in with no problem even though that's not my password.

Now, suppose that Spacelook and Squitter had each used a different per-site salt (not good enough, but better) when they hashed my password, and this time suppose the attacker builds a custom rainbow table for Spacelook's salt and finds that my salted password collides with salted "fr-_9x32;". If they go to Squitter and try to log in with "fr-_9x32;", Squitter's use of a different salt will cause it to hash differently than my password, and so only my Spacelook account will have been compromised.

-----


Yes, user-specific salt does raise the cost but given the usual length and complexity of passwords in-use IRL, raise is not massive enough for high security servers.

Besides, the point of my original comment was the problem of placing everything in the same basket which includes the code used to hash passwords and files. How would you protect them?

-----


Salt should be user-specific, requiring construction of many rainbow tables turning this kind of attack ineffective.

-----


> rainbow tables can be built on-demand

That's what we call bruteforce…

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: