"Just add more fake hashes", you say. To get around the above, you have to use billions upon billions of fake hashes. The database abuse will eventually catch up to you.
This is trying to accomplish what bcrypt and scrypt do using tunable parameters, which is force the attacker to spend significantly more to break passwords than it costs you to test passwords (because they are doing it billions of times, you are doing it once). Personally, I trust the math more.
Facebook has 100PB of data, but all the password salts and hashes for their 800m active users fits on a single USB stick. One maligned employee and that data can walk right out the door.
The theft of all your users' passwords should make one heck of a racket on its way out of your data center. A thief should need the equivalent of a 747 to fly them out of there.
With this technique, Facebook could fairly easily make their salts and hashes fill 1PB of drives. You're not sneaking out of the data center with that tucked in your sock!
False. If you use Colin Percival's scrypt to hash your passwords, then it's impossible for an attacker to crack them via brute force.
See previous discussion: http://news.ycombinator.com/item?id=4076197
They MUST be configured to run fast enough on your server so that your users can login even under high system load.
Let's say you need to support being able to test 1,000 passwords per second on your server (to support your projected peak login traffic). In that case, relatively weak passwords can be cracked, and by relatively weak I mean 95th percentile entropy bits for less than $1,000 on EC2.
This is why I'm trying to think differently about the problem. If we hide the hash in a haystack, the attacker will no longer be able to target an individual user, and theft of the entire hash database should be more easily detectable and preventable.
Also, an 8 character password hashed with scrypt with a very low (64ms) cost takes on average $4.8 million to brute force.
They are claiming 6.57 * 8 = 52.5 bits of entropy would cost $4.8m to crack a single password in a year.
They also state 4.7 * 8 = 37.6 bits of entropy would cost $150 to crack a single password in a year.
Their own citation of 'A large scale study of web password habbits' http://research.microsoft.com/pubs/74164/www2007.pdf found that the average password has only 40.5 bits of entropy. I'm afraid even scrypt will not save you from 40 bit entropy passwords.
For passwords, current typical attacker does not care about cracking all of them, but cares about finding some (mostly weak) passwords that he can try with other services used by same user.
[Edit] sillysaurus - It's a great image, but a little hard to parse since it doesn't actually show you the bits of entropy for each choice. If they did show a column for 40.5bits (average strength of a user's password) you would see the result is a lot closer to $150 for 'scrypt 64ms' than it is to the very sexy looking $4.8m.
Of course the technique could work even better with scrypt behind it (find/replace 20 bytes with 32 bytes).
Say passwords were distributed evenly at 35, 40.5 and 46 bits. You may have a lot of difficulty cracking the 46-bit passwords, but the 35-bit ones will be easy. Now, you can't tell which users have easy passwords, but it's easy enough to try the "top 10,000 passwords" (from previous dumps) on all of the hashes you get and see if any match.
Without the scheme outlined in TFA many users' passwords can be compromised this way. At 64ms each you can run those top 10,000 passwords on each user in less than 11 minutes, and you're almost guaranteed to get some hits. With this new technique, though, you won't, because you'll be lucky to even pick a "real user" in the first place.
EDIT: Of course, if your system is going to remain at all practical to use you can probably filter out the fake passwords with a simple join. You won't have user-data for all of your fake users, and you don't want the username space taken up by trillions of realistic-sounding dummy names preventing bona fide users from registering.
This idea uncouples the password from the username and salt, which seems a good idea. But assuming you have access to the database, the additional work required is an indexed lookup instead of a simple equality - not actually a huge deal.
Having said that, when it comes to security I'll defer every time to someone with real chops in this area. Wake me up when Bruce Schneier comments on this.
Wouldn't this just make dictionary attacks easier? Now the hacker doesn't have to find one exact password but has the option to match any of his dictionary passwords to any of the password hashes.
I know that there are hardly any collisions and that in practise this wouldn't really change a thing. But in theory the dictionary attack would be faster this way.
2. Collisions are not actually very likely (understatemeeeent)
3. > He will still be able to do a dictionary attack using the same method you use to login.
Sure, but that's not the point. The point is that validating a hash now requires a lookup into terabytes of data, meaning it's much harder to use ASICs or GPUs to brute-force the site, and the validation may even require hitting disk which is extremely expensive compared to even expensive hashings.
4. It also makes retrieving the data that much harder: a users table is not usually big and noticeable (especially just 3 columns thereof), a terabyte+ of data going out might show up on the network stats.
Note that I'm no cryptographer and do not recommend TFA's scheme as I can't judge one way or the other, but your objections don't hold as far as I can see.
Side-note (and weakness) for 4: on the other hand the retrieval is trivially shardable and parallelizable, so at the end of the day you probably don't gain much: the data from the GPU/ASIC hash-computer is fed into a sharded db server for matching against the hash data, it will have a cost impact but depending on the cost of the hashing function itself it may not even increase the overall operation time.
Say some mythical unbreakable hash algorithm comes along that's O(1) and you decide you want to move to it, you're forever going to have to drag along these terabytes of junk hashes as you're entire userbase wont log in (giving you the opertunity to change how their password is stored) and since you can't tell the noise from the signal you'll forever have that enormous table of hashes from that time you tried to be extra clever.
I'll leave crypto to the experts :)
Compromising this central database would be meaningless because it would only store password hashes. If an intruder accessed it, they'd just have billions of long strings of random characters -- they wouldn't know where it came from/who uses it.
I think the biggest downside would be that the larger number of password hashes, the more likely duplicates exist. The answer to that, I think, is more randomness. But, I don't know how soon/if that would be a problem.
Another potential downside is performance. Would making a request to this other server make logging in slower? I think the answer to this question depends on the implementation, but there's a guranteed latency of another server to contact before logging in/registering. Beyond that, I think, would be getting too much into the implementation details.
Note -- I know very little about cryptography, but I'd love to learn more.
With proper salts, we have no advantage cracking a password in a database of 10 users or 10 million users. LinkedIn didn't have any salts.
The effectiveness of brute forcing a password is determined entirely by how many passwords they can test for a particular user (because they are salted) in a given period of time. Sure you can add tens of gigabytes of extra data that they have to search through, but your system has to search through it too so it's no different than adding extra iterations to PBKDF2.
We want to make password cracking extremely difficult. It's a problem a big site should want to throw some money at.
Today, we make it hard by requiring lots of CPU. Another commenter said 'scrypt' -- yes, we want to make it hard by requiring lots of RAM too.
Here's another way to make it hard -- now you'll need to make a clean getaway with multiple TBs of data off my server too, just to hopefully steal ONE user's password.
There is a benefit here, I think.
Maybe we should start heading back in the other direction, like making our stacks secure and not expecting a data leak in the first place. Sure be prepared with your time/memory hard hash constructs, it's not a solution to writing insecure code in the first place.
That's part of the elegance (well, it's something other than elegant because it uses a whole mass of data) of the original solution - all of the data could be significant in checking a single password so none can be discarded.
The only effective difference would be that the entire database would become a single unit instead of a collection of separate hashes. Both an attacker and your webapp need to carry this extra weight of a monolithic blob of un-dividable data. It probably won't really slow down an attacker trying to brute force it if he has the data, but it may be more difficult to get the data in the first place.
But if an attacker has access to, say 10% of the hashes, he'll still be able to brute force 10% of the user accounts with weak passwords.
A different way to get a similar result of requiring a huge amount of data to be able to start cracking, would be to treat the database like a huge bloom filter: treat the database as a huge bit array of (say) a petabyte, hash the user's password with a hundred different hash functions (but with scrypt-like slowness), and use those 100 hashes as 100 indexes into the array to set the corresponding 100 bits. To verify a password, create those same 100 hashes and check if all 100 bits are set. Now, if an attacker has access to a part of the database, he won't be able to determine with certainty of any of his guesses at the user's passwords are correct.
Yet a third way to accomplish the same goal: pre-generate a petabyte of random data. To hash a user's password, apply a standard scrypt, then based on the resulting hash, generate a 100 pseudorandom offsets into the petabyte of data. At each of those 100 offsets, read a few (say, 16) bytes from our petabyte of random data, and finally store a hash of (scrypt_result + huge_data[offset1] + huge_data[offset2] + ... + huge_data[offset100]). You'd still have one hash per user, but to check a hash you also need access to a huge block of random data. The block of data functions in a way as an additional system-wide salt.
Anyway, there are more ways to get to a similar result as the OP's proposal. I'm not sure if it buys any additional security or if it's just more of a hassle for the webapp implementing this, but at least it's fun to think about.
If you happen to assign me the same salt as another user then either my or their password will unlock either account and the attacker only needs to guess the weaker of the passwords.
For example, you could use a 32-byte salt with scrypt and you would get back a 32-byte hash. Both are equally unlikely to ever collide (see numbers in the article) even with trillions of entries in the table.
I can also imagine some kind of DDOS against the database using the linked system.
What you are worried about is a result existing in the table, even with the wrong password. That can only happen if you have a hash collision. The article on collisions I cited (http://preshing.com/20110504/hash-collision-probabilities) says that even with 171 trillion hashes already IN the database, the likelihood of a collision is still only 1 in 10^18. And that's using SHA1 -- SHA-256 would be even better.
[Edit] Ooops, I forgot a zero. It's actually 1,710 trillion hashes in the database to make the risk of collision 1 in 10^18. OK, that's 1.71 quadrillion.
For instance, let's say I was UserID 123 which had a FK to a table of bookmarks or history and normally it would be easy to link that user to personable data such as Cancer, Job searches, Pr0n, etc. Now instead, you have a lot of these bookmarks pointing to a hash that was used in the initial user login and not directly linked in the database.
Typically in highly sensitive databases you hash out a new "ID" entirely and reference that. Then you provide a different service and database entirely that correlates two different identities when you need identification. This is similar to how PCI requirements for credit cards store the actual numbers elsewhere and use a token against the system for consumption.
"When a user logs in, you retrieve the salt for the given user, re-compute the hash as you normally would, and simply check if the resulting value EXISTS in the Hashes table. If it does, you consider the login as successful."
This will cause false positives. Let's say that of the trillions of hashes, one is the one we want... But what if one brute forces the system? What if one of the brute attempts matches ANY of those trillions? Bam. Access.
This is a very very bad idea.
Do not roll your own security unless you know what the F&#@ you're doing. You are trading a compromised security issue (your attacker already has access enough to get the hashes/users) for a uncompromised security issue (your attacker does not have access to the hashes/users, and is just trying to get in via brute force).
You are making it easier for the non-secure access attacker by making it harder on the secure access hacker. You are making the most common threat bigger to make the least common threat smaller.
Let me explain it using the pigeonhole principle.
Say you have 2 pigeons (passwords) and 1 million pigeon-holes (hashes aka possible passwords). Assuming all the pigeons are in a hole, and you reach into a random hole, what is the chances of pulling out a pigeon (password/hash collision)?
Now, let's say you have one hundred thousand pigeons (What OP is suggesting) and the same million holes... What are your chances of pulling a pigeon out of a random hole this time?
I wish I could use font-size 1000px right now.
Do not do this. Do not do this. This is a compromised system, right off the bat. If you are securing people's private data or possessions in this manner, you are doing them harm.
As more users sign up, the chances of brute forcing actually increases (more hashes to possibly match!). Let me repeat that in a different way: EACH USER MAKES THE SYSTEM LESS SECURE.
Besides all that, let's say you have to do an EXISTS check on a million records. That means you will check each one until you find a match. That is a linear complexity, that is O(n), unless you index the column, which is expensive. Could take a while, depending, but most certainly will take longer than searching for something with a known key that is indexed efficiently (being an index on a primary key), which should give you something like O(log n) which is much faster.
I don't mean to be an ass, but please downvote this so that there is less chance of people implementing this.
OP; you are suggesting a very harmful idea. Remove it if you have any amount of goodwill towards the community.
I work in web security, and used to work for a credit card merchant processor. At one point, I had access to and had to secure millions of accounts with authorization information for their credit cards; this being information one could use to empty a bank, no questions asked.
I Really Know what I'm talking about. THIS IS A BAD IDEA.
Do you really know what you're talking about, huh? Your pigeon-hole example is pretty wrong because you can't use intuition (with completely wrong numbers!) to judge, you use math. Let's say there's a trillion pigeons and the hash we're using is 256 bits. That 2^256 ~ 10^77. Divide that by a trillion and you still have 10^65. Vastly more than enough to be secure all the while solving two problems. First that with a specific user's password one can often access other services. Two, and most importantly, that the password database blows up and isn't feasible to steal any more, either over internet or in real life.
The point is to increase difficulty of targeting an individual user, and increase difficulty of stealing the data in the first place. This is separate and distinct protection from what scrypt/bcrypt gives you.
[Edit] In a typical GPU accelerated brute force attack, the target hash and salt are both known as fixed 20 or 32-byte values, and you iterate through candidate passwords until you find a match.
In the proposed technique, the target hash is NOT known! Instead, what you do know is a salt and a list of a few billion hashes, ONE of which can be obtained by combining the salt with some unknown password. This is a very different equation, and I believe, much harder to solve, since your target value can't just be kept in a register or L1 cache for comparison purposes.
So your technique only adds some constant lookup time.
if (hash == target_hash) ...
if (map[hash]) ....
don't confuse true O(1) with amortized ;)