"When a user logs in, you retrieve the salt for the given user, re-compute the hash as you normally would, and simply check if the resulting value EXISTS in the Hashes table. If it does, you consider the login as successful."
This will cause false positives. Let's say that of the trillions of hashes, one is the one we want... But what if one brute forces the system? What if one of the brute attempts matches ANY of those trillions? Bam. Access.
This is a very very bad idea.
Do not roll your own security unless you know what the F&#@ you're doing. You are trading a compromised security issue (your attacker already has access enough to get the hashes/users) for a uncompromised security issue (your attacker does not have access to the hashes/users, and is just trying to get in via brute force).
You are making it easier for the non-secure access attacker by making it harder on the secure access hacker. You are making the most common threat bigger to make the least common threat smaller.
Let me explain it using the pigeonhole principle.
Say you have 2 pigeons (passwords) and 1 million pigeon-holes (hashes aka possible passwords). Assuming all the pigeons are in a hole, and you reach into a random hole, what is the chances of pulling out a pigeon (password/hash collision)?
Now, let's say you have one hundred thousand pigeons (What OP is suggesting) and the same million holes... What are your chances of pulling a pigeon out of a random hole this time?
I wish I could use font-size 1000px right now.
Do not do this. Do not do this. This is a compromised system, right off the bat. If you are securing people's private data or possessions in this manner, you are doing them harm.
As more users sign up, the chances of brute forcing actually increases (more hashes to possibly match!). Let me repeat that in a different way: EACH USER MAKES THE SYSTEM LESS SECURE.
Besides all that, let's say you have to do an EXISTS check on a million records. That means you will check each one until you find a match. That is a linear complexity, that is O(n), unless you index the column, which is expensive. Could take a while, depending, but most certainly will take longer than searching for something with a known key that is indexed efficiently (being an index on a primary key), which should give you something like O(log n) which is much faster.
I don't mean to be an ass, but please downvote this so that there is less chance of people implementing this.
OP; you are suggesting a very harmful idea. Remove it if you have any amount of goodwill towards the community.
I work in web security, and used to work for a credit card merchant processor. At one point, I had access to and had to secure millions of accounts with authorization information for their credit cards; this being information one could use to empty a bank, no questions asked.
I Really Know what I'm talking about. THIS IS A BAD IDEA.
Such a long reply repeating the same thing is very obnoxious. Don't do it. If I could downvote you I would.
Do you really know what you're talking about, huh? Your pigeon-hole example is pretty wrong because you can't use intuition (with completely wrong numbers!) to judge, you use math. Let's say there's a trillion pigeons and the hash we're using is 256 bits. That 2^256 ~ 10^77. Divide that by a trillion and you still have 10^65. Vastly more than enough to be secure all the while solving two problems. First that with a specific user's password one can often access other services. Two, and most importantly, that the password database blows up and isn't feasible to steal any more, either over internet or in real life.
This won't make an appreciable difference if your entire db is captured by the attacker (ala LinkedIn). You are just asking them to buy a box (rent a virtual server) with more RAM because the equals operation just became memory intensive. A machine with 128GB of RAM will still give a near-constant result to "does my test hash equal the user's password?"
"Just add more fake hashes", you say. To get around the above, you have to use billions upon billions of fake hashes. The database abuse will eventually catch up to you.
This is trying to accomplish what bcrypt and scrypt do using tunable parameters, which is force the attacker to spend significantly more to break passwords than it costs you to test passwords (because they are doing it billions of times, you are doing it once). Personally, I trust the math more.
The problem is, even with the salted hash, it’s still too easy for an attacker to run dictionary attacks if they are able to retrieve a single user’s salt and hash. When your attacker can do 1 billion SHA1 hashes/second with cheap hardware, even slowing it down 10,000 times and they’re still cracking a lot of your users’ passwords.
False. If you use Colin Percival's scrypt to hash your passwords, then it's impossible for an attacker to crack them via brute force.
I am the author of the article. Thanks for pointing out scrypt, I have heard of it before. PBKDF2, bcrypt, scrypt... they all share one thing in common:
They MUST be configured to run fast enough on your server so that your users can login even under high system load.
Let's say you need to support being able to test 1,000 passwords per second on your server (to support your projected peak login traffic). In that case, relatively weak passwords can be cracked, and by relatively weak I mean 95th percentile entropy bits for less than $1,000 on EC2.
This is why I'm trying to think differently about the problem. If we hide the hash in a haystack, the attacker will no longer be able to target an individual user, and theft of the entire hash database should be more easily detectable and preventable.
_Nobody_ is hashing thousands of passwords per second as part of a typical webapp load. Even if somebody was, they would need only ~50 cores for hashing. I'm sure anybody with that much traffic could easily shell out a few thousand bucks for the extra processors without blinking.
Also, an 8 character password hashed with scrypt with a very low (64ms) cost takes on average $4.8 million to brute force.
Current algorithms allow you to increase cost in terms of CPU and RAM only. I want to increase cost on as many axis as possible, in this case, by requiring you to steal 1TB of mostly meaningless data, and not just 100 odd bytes.
[Edit] sillysaurus - It's a great image, but a little hard to parse since it doesn't actually show you the bits of entropy for each choice. If they did show a column for 40.5bits (average strength of a user's password) you would see the result is a lot closer to $150 for 'scrypt 64ms' than it is to the very sexy looking $4.8m.
Of course the technique could work even better with scrypt behind it (find/replace 20 bytes with 32 bytes).
It's worse than the average strength would suggest, too. You can't do an arithmetic average and get a meaningful figure because cracking difficulty doubles with every bit.
Say passwords were distributed evenly at 35, 40.5 and 46 bits. You may have a lot of difficulty cracking the 46-bit passwords, but the 35-bit ones will be easy. Now, you can't tell which users have easy passwords, but it's easy enough to try the "top 10,000 passwords" (from previous dumps) on all of the hashes you get and see if any match.
Without the scheme outlined in TFA many users' passwords can be compromised this way. At 64ms each you can run those top 10,000 passwords on each user in less than 11 minutes, and you're almost guaranteed to get some hits. With this new technique, though, you won't, because you'll be lucky to even pick a "real user" in the first place.
EDIT: Of course, if your system is going to remain at all practical to use you can probably filter out the fake passwords with a simple join. You won't have user-data for all of your fake users, and you don't want the username space taken up by trillions of realistic-sounding dummy names preventing bona fide users from registering.
You can't do a join because the hash table contains only hashes - nothing to relate them back to the user record.
This idea uncouples the password from the username and salt, which seems a good idea. But assuming you have access to the database, the additional work required is an indexed lookup instead of a simple equality - not actually a huge deal.
Having said that, when it comes to security I'll defer every time to someone with real chops in this area. Wake me up when Bruce Schneier comments on this.
1. The point of hashing passwords is to protect the password itself (the plaintext), so that users who use the same password over and over again (which is most of them) don't see all their accounts opened if one of the services they use has a security breach.
2. Collisions are not actually very likely (understatemeeeent)
3. > He will still be able to do a dictionary attack using the same method you use to login.
Sure, but that's not the point. The point is that validating a hash now requires a lookup into terabytes of data, meaning it's much harder to use ASICs or GPUs to brute-force the site, and the validation may even require hitting disk which is extremely expensive compared to even expensive hashings.
4. It also makes retrieving the data that much harder: a users table is not usually big and noticeable (especially just 3 columns thereof), a terabyte+ of data going out might show up on the network stats.
Note that I'm no cryptographer and do not recommend TFA's scheme as I can't judge one way or the other, but your objections don't hold as far as I can see.
Side-note (and weakness) for 4: on the other hand the retrieval is trivially shardable and parallelizable, so at the end of the day you probably don't gain much: the data from the GPU/ASIC hash-computer is fed into a sharded db server for matching against the hash data, it will have a cost impact but depending on the cost of the hashing function itself it may not even increase the overall operation time.
Implemented correctly (with using e.g. scrypt as the hashing component, and making sure the hashes are large enough so that the chances are neglegible of an attacker finding a match to a different hash than that was originally generated from the users password), this scheme would be no less secure than the traditional way of storing one scrypt hash per user.
The only effective difference would be that the entire database would become a single unit instead of a collection of separate hashes. Both an attacker and your webapp need to carry this extra weight of a monolithic blob of un-dividable data. It probably won't really slow down an attacker trying to brute force it if he has the data, but it may be more difficult to get the data in the first place.
But if an attacker has access to, say 10% of the hashes, he'll still be able to brute force 10% of the user accounts with weak passwords.
A different way to get a similar result of requiring a huge amount of data to be able to start cracking, would be to treat the database like a huge bloom filter: treat the database as a huge bit array of (say) a petabyte, hash the user's password with a hundred different hash functions (but with scrypt-like slowness), and use those 100 hashes as 100 indexes into the array to set the corresponding 100 bits. To verify a password, create those same 100 hashes and check if all 100 bits are set. Now, if an attacker has access to a part of the database, he won't be able to determine with certainty of any of his guesses at the user's passwords are correct.
Yet a third way to accomplish the same goal: pre-generate a petabyte of random data. To hash a user's password, apply a standard scrypt, then based on the resulting hash, generate a 100 pseudorandom offsets into the petabyte of data. At each of those 100 offsets, read a few (say, 16) bytes from our petabyte of random data, and finally store a hash of (scrypt_result + huge_data[offset1] + huge_data[offset2] + ... + huge_data[offset100]). You'd still have one hash per user, but to check a hash you also need access to a huge block of random data. The block of data functions in a way as an additional system-wide salt.
Anyway, there are more ways to get to a similar result as the OP's proposal. I'm not sure if it buys any additional security or if it's just more of a hassle for the webapp implementing this, but at least it's fun to think about.
Doesn't this add a new vulnerability? If I can see the salt for a user, I can easily add a new password for that user _without anyone possibly being aware of it until it is used_ by just adding the hash of the desired password with the existing salt to the database.
That's true: in a classic system, the original user will notice its password had been overriden. With 'security through obesity' the original user AND the hacker will be able to login (each using their own password).
Seems like it'd paint you into a bit of a corner if you ever needed to change anything about your password hashing.
Say some mythical unbreakable hash algorithm comes along that's O(1) and you decide you want to move to it, you're forever going to have to drag along these terabytes of junk hashes as you're entire userbase wont log in (giving you the opertunity to change how their password is stored) and since you can't tell the noise from the signal you'll forever have that enormous table of hashes from that time you tried to be extra clever.
This sounds interesting. But, in implementing this, what's the purpose of storing password hashes in your own database? Wouldn't it be safer (and maybe even faster) for a centralized service to handle checking the existence of password hashes and storing them?
Compromising this central database would be meaningless because it would only store password hashes. If an intruder accessed it, they'd just have billions of long strings of random characters -- they wouldn't know where it came from/who uses it.
I think the biggest downside would be that the larger number of password hashes, the more likely duplicates exist. The answer to that, I think, is more randomness. But, I don't know how soon/if that would be a problem.
Another potential downside is performance. Would making a request to this other server make logging in slower? I think the answer to this question depends on the implementation, but there's a guranteed latency of another server to contact before logging in/registering. Beyond that, I think, would be getting too much into the implementation details.
Note -- I know very little about cryptography, but I'd love to learn more.
I am skeptical that this method adds any protection to a brute force attack relative to a standard implementation of PBKDF2 hashes stored with the users.
With proper salts, we have no advantage cracking a password in a database of 10 users or 10 million users. LinkedIn didn't have any salts.
The effectiveness of brute forcing a password is determined entirely by how many passwords they can test for a particular user (because they are salted) in a given period of time. Sure you can add tens of gigabytes of extra data that they have to search through, but your system has to search through it too so it's no different than adding extra iterations to PBKDF2.
> We want to make password cracking extremely difficult. It's a problem a big site should want to throw some money at.
Maybe we should start heading back in the other direction, like making our stacks secure and not expecting a data leak in the first place. Sure be prepared with your time/memory hard hash constructs, it's not a solution to writing insecure code in the first place.
You could, but I don't really see any difference between this and just slapping many tb of unrelated data in the database - the smart attacker will realise that a bunch of accounts are fake and just join to a table that holds only data for real users before extracting the info.
That's part of the elegance (well, it's something other than elegant because it uses a whole mass of data) of the original solution - all of the data could be significant in checking a single password so none can be discarded.
Absolutely correct. A 'salt' by definition, is always random.
For example, you could use a 32-byte salt with scrypt and you would get back a 32-byte hash. Both are equally unlikely to ever collide (see numbers in the article) even with trillions of entries in the table.
One thing that stands out is, it seems like you can never purge this table. If it ever grows so large that it becomes a concern, you'd need to force your users to validate in another way and set a new pw using a fresh db
Wouldn't this setup mean that I can hack into your system with almost any password? As there is no direct connection AND you fill up with dummy hashes - doesn't that effectively mean you drive up the chances of a hack? The more fake hashes, the higher the probability that any random password I brute force may result in a match against the set of hashes. And if that happens - boom, I am in. Or am I missing something?
Definitely not. The back-end takes a user's password and hashes it with that user's specific salt. Then the result must be in the table. Another user's password will not work.
What you are worried about is a result existing in the table, even with the wrong password. That can only happen if you have a hash collision. The article on collisions I cited (http://preshing.com/20110504/hash-collision-probabilities) says that even with 171 trillion hashes already IN the database, the likelihood of a collision is still only 1 in 10^18. And that's using SHA1 -- SHA-256 would be even better.
[Edit] Ooops, I forgot a zero. It's actually 1,710 trillion hashes in the database to make the risk of collision 1 in 10^18. OK, that's 1.71 quadrillion.
Actually not a bad idea. You could also hash the salt and use it everywhere instead of the user's ID - anonymize the user from their own data. (If you used the password hash, you would just have to remember to update it everywhere on password changes - and of course use a hash of the hash so you didn't give away which hash was used.)
For instance, let's say I was UserID 123 which had a FK to a table of bookmarks or history and normally it would be easy to link that user to personable data such as Cancer, Job searches, Pr0n, etc. Now instead, you have a lot of these bookmarks pointing to a hash that was used in the initial user login and not directly linked in the database.
Typically in highly sensitive databases you hash out a new "ID" entirely and reference that. Then you provide a different service and database entirely that correlates two different identities when you need identification. This is similar to how PCI requirements for credit cards store the actual numbers elsewhere and use a token against the system for consumption.
This would increase the complexity of the hash verification by log2(N) times, where N is the number of user records. So even if you do have a billion hashes, you only slowed the attackers 30 times. You might as well increase the BCRYPT complexity and skip this scheme.
This should be used with scrypt or bcrypt set to the highest difficulty your application can support.
The point is to increase difficulty of targeting an individual user, and increase difficulty of stealing the data in the first place. This is separate and distinct protection from what scrypt/bcrypt gives you.
[Edit] In a typical GPU accelerated brute force attack, the target hash and salt are both known as fixed 20 or 32-byte values, and you iterate through candidate passwords until you find a match.
In the proposed technique, the target hash is NOT known! Instead, what you do know is a salt and a list of a few billion hashes, ONE of which can be obtained by combining the salt with some unknown password. This is a very different equation, and I believe, much harder to solve, since your target value can't just be kept in a register or L1 cache for comparison purposes.