Hacker News new | past | comments | ask | show | jobs | submit login
A Better Way to Store Password Hashes? (opine.me)
47 points by zaroth on July 9, 2012 | hide | past | favorite | 46 comments

Or just use bcrypt/scrypt/PBKDF2 in the typical manner. You're much more likely to screw yourself by mis-implementing this than you are to actually gain any real security benefit from it.

This won't make an appreciable difference if your entire db is captured by the attacker (ala LinkedIn). You are just asking them to buy a box (rent a virtual server) with more RAM because the equals operation just became memory intensive. A machine with 128GB of RAM will still give a near-constant result to "does my test hash equal the user's password?"

"Just add more fake hashes", you say. To get around the above, you have to use billions upon billions of fake hashes. The database abuse will eventually catch up to you.

This is trying to accomplish what bcrypt and scrypt do using tunable parameters, which is force the attacker to spend significantly more to break passwords than it costs you to test passwords (because they are doing it billions of times, you are doing it once). Personally, I trust the math more.

This is absolutely only to be used with bcrypt or scrypt, not instead of.

Facebook has 100PB of data, but all the password salts and hashes for their 800m active users fits on a single USB stick. One maligned employee and that data can walk right out the door.

The theft of all your users' passwords should make one heck of a racket on its way out of your data center. A thief should need the equivalent of a 747 to fly them out of there.

With this technique, Facebook could fairly easily make their salts and hashes fill 1PB of drives. You're not sneaking out of the data center with that tucked in your sock!

The problem is, even with the salted hash, it’s still too easy for an attacker to run dictionary attacks if they are able to retrieve a single user’s salt and hash. When your attacker can do 1 billion SHA1 hashes/second with cheap hardware, even slowing it down 10,000 times and they’re still cracking a lot of your users’ passwords.

False. If you use Colin Percival's scrypt to hash your passwords, then it's impossible for an attacker to crack them via brute force.

See previous discussion: http://news.ycombinator.com/item?id=4076197

I am the author of the article. Thanks for pointing out scrypt, I have heard of it before. PBKDF2, bcrypt, scrypt... they all share one thing in common:

They MUST be configured to run fast enough on your server so that your users can login even under high system load.

Let's say you need to support being able to test 1,000 passwords per second on your server (to support your projected peak login traffic). In that case, relatively weak passwords can be cracked, and by relatively weak I mean 95th percentile entropy bits for less than $1,000 on EC2.

This is why I'm trying to think differently about the problem. If we hide the hash in a haystack, the attacker will no longer be able to target an individual user, and theft of the entire hash database should be more easily detectable and preventable.

_Nobody_ is hashing thousands of passwords per second as part of a typical webapp load. Even if somebody was, they would need only ~50 cores for hashing. I'm sure anybody with that much traffic could easily shell out a few thousand bucks for the extra processors without blinking.

Also, an 8 character password hashed with scrypt with a very low (64ms) cost takes on average $4.8 million to brute force[1].

1: http://www.tarsnap.com/scrypt/scrypt.pdf

Thanks for citing your source! You rock.

They are claiming 6.57 * 8 = 52.5 bits of entropy would cost $4.8m to crack a single password in a year.

They also state 4.7 * 8 = 37.6 bits of entropy would cost $150 to crack a single password in a year.

Their own citation of 'A large scale study of web password habbits' http://research.microsoft.com/pubs/74164/www2007.pdf found that the average password has only 40.5 bits of entropy. I'm afraid even scrypt will not save you from 40 bit entropy passwords.

The point is that nothing will "save" a single user for whom there is a dedicated attacker. But you can make it uneconomical to attack the whole database. That's what bcrypt, scrypt et al do.

It's certainly not impossible. scrypt provides reasonable defense against using cheap highly parallel approaches to brute force, but still can obviously be cracked by brute force.

For passwords, current typical attacker does not care about cracking all of them, but cares about finding some (mostly weak) passwords that he can try with other services used by same user.

Ultimately, OP's strategy is just increasing the cost of brute forcing. Why not use the cost increasing features built into the hashing algorithms themselves?

Current algorithms allow you to increase cost in terms of CPU and RAM only. I want to increase cost on as many axis as possible, in this case, by requiring you to steal 1TB of mostly meaningless data, and not just 100 odd bytes.

[Edit] sillysaurus - It's a great image, but a little hard to parse since it doesn't actually show you the bits of entropy for each choice. If they did show a column for 40.5bits (average strength of a user's password) you would see the result is a lot closer to $150 for 'scrypt 64ms' than it is to the very sexy looking $4.8m.

Of course the technique could work even better with scrypt behind it (find/replace 20 bytes with 32 bytes).

It's worse than the average strength would suggest, too. You can't do an arithmetic average and get a meaningful figure because cracking difficulty doubles with every bit.

Say passwords were distributed evenly at 35, 40.5 and 46 bits. You may have a lot of difficulty cracking the 46-bit passwords, but the 35-bit ones will be easy. Now, you can't tell which users have easy passwords, but it's easy enough to try the "top 10,000 passwords" (from previous dumps) on all of the hashes you get and see if any match.

Without the scheme outlined in TFA many users' passwords can be compromised this way. At 64ms each you can run those top 10,000 passwords on each user in less than 11 minutes, and you're almost guaranteed to get some hits. With this new technique, though, you won't, because you'll be lucky to even pick a "real user" in the first place.

EDIT: Of course, if your system is going to remain at all practical to use you can probably filter out the fake passwords with a simple join. You won't have user-data for all of your fake users, and you don't want the username space taken up by trillions of realistic-sounding dummy names preventing bona fide users from registering.

You can't do a join because the hash table contains only hashes - nothing to relate them back to the user record.

This idea uncouples the password from the username and salt, which seems a good idea. But assuming you have access to the database, the additional work required is an indexed lookup instead of a simple equality - not actually a huge deal.

Having said that, when it comes to security I'll defer every time to someone with real chops in this area. Wake me up when Bruce Schneier comments on this.

Does it really matter that the hacker doesn't know which hash belongs to the user? He will still be able to do a dictionary attack using the same method you use to login.

Wouldn't this just make dictionary attacks easier? Now the hacker doesn't have to find one exact password but has the option to match any of his dictionary passwords to any of the password hashes.

I know that there are hardly any collisions and that in practise this wouldn't really change a thing. But in theory the dictionary attack would be faster this way.

1. The point of hashing passwords is to protect the password itself (the plaintext), so that users who use the same password over and over again (which is most of them) don't see all their accounts opened if one of the services they use has a security breach.

2. Collisions are not actually very likely (understatemeeeent)

3. > He will still be able to do a dictionary attack using the same method you use to login.

Sure, but that's not the point. The point is that validating a hash now requires a lookup into terabytes of data, meaning it's much harder to use ASICs or GPUs to brute-force the site, and the validation may even require hitting disk which is extremely expensive compared to even expensive hashings.

4. It also makes retrieving the data that much harder: a users table is not usually big and noticeable (especially just 3 columns thereof), a terabyte+ of data going out might show up on the network stats.

Note that I'm no cryptographer and do not recommend TFA's scheme as I can't judge one way or the other, but your objections don't hold as far as I can see.

Side-note (and weakness) for 4: on the other hand the retrieval is trivially shardable and parallelizable, so at the end of the day you probably don't gain much: the data from the GPU/ASIC hash-computer is fed into a sharded db server for matching against the hash data, it will have a cost impact but depending on the cost of the hashing function itself it may not even increase the overall operation time.

Doesn't this add a new vulnerability? If I can see the salt for a user, I can easily add a new password for that user _without anyone possibly being aware of it until it is used_ by just adding the hash of the desired password with the existing salt to the database.

That's true: in a classic system, the original user will notice its password had been overriden. With 'security through obesity' the original user AND the hacker will be able to login (each using their own password).

Seems like it'd paint you into a bit of a corner if you ever needed to change anything about your password hashing.

Say some mythical unbreakable hash algorithm comes along that's O(1) and you decide you want to move to it, you're forever going to have to drag along these terabytes of junk hashes as you're entire userbase wont log in (giving you the opertunity to change how their password is stored) and since you can't tell the noise from the signal you'll forever have that enormous table of hashes from that time you tried to be extra clever.

I'll leave crypto to the experts :)

This sounds interesting. But, in implementing this, what's the purpose of storing password hashes in your own database? Wouldn't it be safer (and maybe even faster) for a centralized service to handle checking the existence of password hashes and storing them?

Compromising this central database would be meaningless because it would only store password hashes. If an intruder accessed it, they'd just have billions of long strings of random characters -- they wouldn't know where it came from/who uses it.

I think the biggest downside would be that the larger number of password hashes, the more likely duplicates exist. The answer to that, I think, is more randomness. But, I don't know how soon/if that would be a problem.

Another potential downside is performance. Would making a request to this other server make logging in slower? I think the answer to this question depends on the implementation, but there's a guranteed latency of another server to contact before logging in/registering. Beyond that, I think, would be getting too much into the implementation details.

Note -- I know very little about cryptography, but I'd love to learn more.

I am skeptical that this method adds any protection to a brute force attack relative to a standard implementation of PBKDF2 hashes stored with the users.

With proper salts, we have no advantage cracking a password in a database of 10 users or 10 million users. LinkedIn didn't have any salts.

The effectiveness of brute forcing a password is determined entirely by how many passwords they can test for a particular user (because they are salted) in a given period of time. Sure you can add tens of gigabytes of extra data that they have to search through, but your system has to search through it too so it's no different than adding extra iterations to PBKDF2.

It's a question of sizing up what resources the attacker has, versus what they will need to succeed in an attack.

We want to make password cracking extremely difficult. It's a problem a big site should want to throw some money at.

Today, we make it hard by requiring lots of CPU. Another commenter said 'scrypt' -- yes, we want to make it hard by requiring lots of RAM too.

Here's another way to make it hard -- now you'll need to make a clean getaway with multiple TBs of data off my server too, just to hopefully steal ONE user's password.

There is a benefit here, I think.

> We want to make password cracking extremely difficult. It's a problem a big site should want to throw some money at.

Maybe we should start heading back in the other direction, like making our stacks secure and not expecting a data leak in the first place. Sure be prepared with your time/memory hard hash constructs, it's not a solution to writing insecure code in the first place.

Ok, requiring the attacker to download several terabytes could be an interesting advantage. Could we also achieve that by inserting billions of fake users in a traditional [user, salt, hash] setup?

You could, but I don't really see any difference between this and just slapping many tb of unrelated data in the database - the smart attacker will realise that a bunch of accounts are fake and just join to a table that holds only data for real users before extracting the info.

That's part of the elegance (well, it's something other than elegant because it uses a whole mass of data) of the original solution - all of the data could be significant in checking a single password so none can be discarded.

As the OP, I'd say it's as elegant as securing your bike by locking a cinder block to it! ;-)

Implemented correctly (with using e.g. scrypt as the hashing component, and making sure the hashes are large enough so that the chances are neglegible of an attacker finding a match to a different hash than that was originally generated from the users password), this scheme would be no less secure than the traditional way of storing one scrypt hash per user.

The only effective difference would be that the entire database would become a single unit instead of a collection of separate hashes. Both an attacker and your webapp need to carry this extra weight of a monolithic blob of un-dividable data. It probably won't really slow down an attacker trying to brute force it if he has the data, but it may be more difficult to get the data in the first place.

But if an attacker has access to, say 10% of the hashes, he'll still be able to brute force 10% of the user accounts with weak passwords.

A different way to get a similar result of requiring a huge amount of data to be able to start cracking, would be to treat the database like a huge bloom filter: treat the database as a huge bit array of (say) a petabyte, hash the user's password with a hundred different hash functions (but with scrypt-like slowness), and use those 100 hashes as 100 indexes into the array to set the corresponding 100 bits. To verify a password, create those same 100 hashes and check if all 100 bits are set. Now, if an attacker has access to a part of the database, he won't be able to determine with certainty of any of his guesses at the user's passwords are correct.

Yet a third way to accomplish the same goal: pre-generate a petabyte of random data. To hash a user's password, apply a standard scrypt, then based on the resulting hash, generate a 100 pseudorandom offsets into the petabyte of data. At each of those 100 offsets, read a few (say, 16) bytes from our petabyte of random data, and finally store a hash of (scrypt_result + huge_data[offset1] + huge_data[offset2] + ... + huge_data[offset100]). You'd still have one hash per user, but to check a hash you also need access to a huge block of random data. The block of data functions in a way as an additional system-wide salt.

Anyway, there are more ways to get to a similar result as the OP's proposal. I'm not sure if it buys any additional security or if it's just more of a hassle for the webapp implementing this, but at least it's fun to think about.

I think you need to make sure that each user has a unique salt.

If you happen to assign me the same salt as another user then either my or their password will unlock either account and the attacker only needs to guess the weaker of the passwords.

Absolutely correct. A 'salt' by definition, is always random.

For example, you could use a 32-byte salt with scrypt and you would get back a 32-byte hash. Both are equally unlikely to ever collide (see numbers in the article) even with trillions of entries in the table.

With this scheme random isn't quite enough though, it needs to be unique as well doesn't it?

Chances of generating two random 32-byte strings that collide are so tiny that you can as well say that they are unique.

Remember that assumption is the mother of all fuck-ups, although in this case the risk as you say is minimal.

Making an assumption that will fail approximately one time in the entire history of the universe is safe.

It's not an assumption, it's a fact.

For some reason I read that as 32 bits to start with. 32 bytes is a hell of a lot of salt.

One thing that stands out is, it seems like you can never purge this table. If it ever grows so large that it becomes a concern, you'd need to force your users to validate in another way and set a new pw using a fresh db

And what about using a pepper in addition to a salt? Then a potential attacker requires access to application code as well as the database tables.

I can also imagine some kind of DDOS against the database using the linked system.

Wouldn't this setup mean that I can hack into your system with almost any password? As there is no direct connection AND you fill up with dummy hashes - doesn't that effectively mean you drive up the chances of a hack? The more fake hashes, the higher the probability that any random password I brute force may result in a match against the set of hashes. And if that happens - boom, I am in. Or am I missing something?

Definitely not. The back-end takes a user's password and hashes it with that user's specific salt. Then the result must be in the table. Another user's password will not work.

What you are worried about is a result existing in the table, even with the wrong password. That can only happen if you have a hash collision. The article on collisions I cited (http://preshing.com/20110504/hash-collision-probabilities) says that even with 171 trillion hashes already IN the database, the likelihood of a collision is still only 1 in 10^18. And that's using SHA1 -- SHA-256 would be even better.

[Edit] Ooops, I forgot a zero. It's actually 1,710 trillion hashes in the database to make the risk of collision 1 in 10^18. OK, that's 1.71 quadrillion.

Ah OK. So effectively the salt becomes the foreign key ;-)

Actually not a bad idea. You could also hash the salt and use it everywhere instead of the user's ID - anonymize the user from their own data. (If you used the password hash, you would just have to remember to update it everywhere on password changes - and of course use a hash of the hash so you didn't give away which hash was used.)

For instance, let's say I was UserID 123 which had a FK to a table of bookmarks or history and normally it would be easy to link that user to personable data such as Cancer, Job searches, Pr0n, etc. Now instead, you have a lot of these bookmarks pointing to a hash that was used in the initial user login and not directly linked in the database.

Typically in highly sensitive databases you hash out a new "ID" entirely and reference that. Then you provide a different service and database entirely that correlates two different identities when you need identification. This is similar to how PCI requirements for credit cards store the actual numbers elsewhere and use a token against the system for consumption.

NO! This is a bad idea:

"When a user logs in, you retrieve the salt for the given user, re-compute the hash as you normally would, and simply check if the resulting value EXISTS in the Hashes table. If it does, you consider the login as successful."

This will cause false positives. Let's say that of the trillions of hashes, one is the one we want... But what if one brute forces the system? What if one of the brute attempts matches ANY of those trillions? Bam. Access.

This is a very very bad idea.

Do not roll your own security unless you know what the F&#@ you're doing. You are trading a compromised security issue (your attacker already has access enough to get the hashes/users) for a uncompromised security issue (your attacker does not have access to the hashes/users, and is just trying to get in via brute force).

You are making it easier for the non-secure access attacker by making it harder on the secure access hacker. You are making the most common threat bigger to make the least common threat smaller.

Let me explain it using the pigeonhole principle.

Say you have 2 pigeons (passwords) and 1 million pigeon-holes (hashes aka possible passwords). Assuming all the pigeons are in a hole, and you reach into a random hole, what is the chances of pulling out a pigeon (password/hash collision)?

Now, let's say you have one hundred thousand pigeons (What OP is suggesting) and the same million holes... What are your chances of pulling a pigeon out of a random hole this time?

I wish I could use font-size 1000px right now.

Do not do this. Do not do this. This is a compromised system, right off the bat. If you are securing people's private data or possessions in this manner, you are doing them harm.

As more users sign up, the chances of brute forcing actually increases (more hashes to possibly match!). Let me repeat that in a different way: EACH USER MAKES THE SYSTEM LESS SECURE.

Besides all that, let's say you have to do an EXISTS check on a million records. That means you will check each one until you find a match. That is a linear complexity, that is O(n), unless you index the column, which is expensive. Could take a while, depending, but most certainly will take longer than searching for something with a known key that is indexed efficiently (being an index on a primary key), which should give you something like O(log n) which is much faster.

I don't mean to be an ass, but please downvote this so that there is less chance of people implementing this.

OP; you are suggesting a very harmful idea. Remove it if you have any amount of goodwill towards the community.

I work in web security, and used to work for a credit card merchant processor. At one point, I had access to and had to secure millions of accounts with authorization information for their credit cards; this being information one could use to empty a bank, no questions asked.

I Really Know what I'm talking about. THIS IS A BAD IDEA.

Such a long reply repeating the same thing is very obnoxious. Don't do it. If I could downvote you I would.

Do you really know what you're talking about, huh? Your pigeon-hole example is pretty wrong because you can't use intuition (with completely wrong numbers!) to judge, you use math. Let's say there's a trillion pigeons and the hash we're using is 256 bits. That 2^256 ~ 10^77. Divide that by a trillion and you still have 10^65. Vastly more than enough to be secure all the while solving two problems. First that with a specific user's password one can often access other services. Two, and most importantly, that the password database blows up and isn't feasible to steal any more, either over internet or in real life.

This would increase the complexity of the hash verification by log2(N) times, where N is the number of user records. So even if you do have a billion hashes, you only slowed the attackers 30 times. You might as well increase the BCRYPT complexity and skip this scheme.

This should be used with scrypt or bcrypt set to the highest difficulty your application can support.

The point is to increase difficulty of targeting an individual user, and increase difficulty of stealing the data in the first place. This is separate and distinct protection from what scrypt/bcrypt gives you.

[Edit] In a typical GPU accelerated brute force attack, the target hash and salt are both known as fixed 20 or 32-byte values, and you iterate through candidate passwords until you find a match.

In the proposed technique, the target hash is NOT known! Instead, what you do know is a salt and a list of a few billion hashes, ONE of which can be obtained by combining the salt with some unknown password. This is a very different equation, and I believe, much harder to solve, since your target value can't just be kept in a register or L1 cache for comparison purposes.

Actually, I was wrong about O(log(N)), the lookup time is O(1), you just have to put all known hashes in a map structure, or even better, a rainbow hash.

So your technique only adds some constant lookup time.

Instead of

    if (hash == target_hash) ...
you will do

    if (map[hash]) ....
Sorry, I think you are wrong.

eh, this is amortized O(1) and you are almost guaranteed slow lookups for a large number of requests.

don't confuse true O(1) with amortized ;)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact