If I can get a list of compromised email addresses and passwords, I will be trying those email/password combinations across services, not passwords from the top 100.
> We have some ability at ArenaNet to watch hacking attempts live, and it tells a fascinating story. We watch as hackers use tens of thousands of different IP addresses to scan through millions of attempted account names and passwords, almost all of which are for accounts that don’t even exist in our database, looking for matches. They’re not guessing or brute-forcing passwords; they’re trying a very specific account name and password for each attempt. For example, account name “firstname.lastname@example.org”, password “alligator101”. If they don’t get a match immediately, they may try a variant like “alligator100” or “alligator102”, then they quickly move on to the next entry on their list. And it’s interesting to see that the passwords on these lists are mostly quite good passwords. For every one account on the hackers’ lists with a password like “twilight” (real example, ಠ_ಠ), there are dozens of accounts with good strong passwords. So the world at large clearly knows how to pick good passwords; the reason people are still getting hacked is because they use the same passwords on multiple sites.
It's not to protect against an external brute force, but a leaked database table being able to identify valid passwords, and then using that to access the service or check other services for a shared password.
It's unfortunately far too common that a SQL injection, bad employee, lost backup, whatever leaks a copy of the account table in the database, which even when using scrypt/bcrypt etc is plausible to slowly break accounts beyond the first few thousand common passwords. Depending on work factors, it's about 250ms to make a guess, which is about 345,000 guesses per cpu/core per day.
In general I'd say that most organizations need to get less fancy about password security not more. Most mechanisms that companies employ to make things more secure don't. They make things less secure because they add attack surface, or they fail to account for the behavior of actual users when faced with any additional effort.
For example: https://ai.google/research/pubs/pub43783
Adding a salt and hashing a password is not something someone should be doing in 2018. Use scrypt, or another one of the recommended algorithms, as it is a complete implementation that is well researched and can be simply dropped into most systems.
And my response was in reference to how many password you should blacklist, which using a good algorithm is still feasible to brute force easy to guess passwords in an offline attack.
You are virtually gaurenteed to get hits with this approach and its extremely difficult to block.
Slightly more sophisticated would be to use a botnet to do this, so that you can't block the source.
If the database gets breached for example then the attacker can do a lot more iterations, even if the password hash are salted.
If the database gets breached with write access, a user's password can be bypassed entirely, even if it can never be discovered.
If the database contains all of the information the hacker wants to know, they don't need to hack your account, they can just read the tables directly.
If you're ok with not allowing a lot of valid passwords, than this approach has some merit, but user beware.
You could also use this is a pre-check before querying the real haveibeenpwned api.
Yeah, bloom filters are a fantastic pseudo-caching layer.
A bloom filter will return whether something is in a set with potential false positives but never false negatives.
A cache will return whether something is in a set but with potential false negatives but never false positives (i.e. item is not currently in the cache but exists in the set).
So kind of a caching layer caching every password that has definitely not been breached.
i'd name this sort of thing a filter, rather than a cache.
You provide a username or email and a password and if you get it right it will let you know and lock their account and require a reset when they log on next.
The someone like haveibeenpwned can run a meta service to check every service that registers with them.
I haven't thought through the security implications of this but it seems pretty mild on the surface. I mean there's denial of service potential but only when their password has been compromised so... good?
Also, doesn't that just hand over the user's password to the site hosting major.service//password-check? Maybe the user is security conscious already, now their strong password is sitting in plaintext on major.service's misconfigured logs.
This just doesn't seem like a good idea to me.
Edit: Or a hacker gets into linkedin, now he's literally being fed the actual username/password combinations of every single person that signs up to any website anywhere. This just gets worse the more I think about it.
If it's rate limited more aggressively than the normal login endpoint, I don't see how it could be any more dangerous.
Anyone on the internet can already query any site with a guess for your username/password - if they're right, the request will succeed. How is that different?
Either everyone would use the same unsalted hashes (bad practice) or everyone would use the same salts (bad practice). In either case, this requires widespread use known bad practices.
If you're not familiar, these are considered bad practices because they allow for the construction of rainbow tables - https://en.wikipedia.org/wiki/Rainbow_table
These are bad because they allow hashes to be reversed.
> If it's rate limited more aggressively than the normal login endpoint, I don't see how it could be any more dangerous.
Given a little bit of time, rate limiting isn't actually that big of an obstacle. Especially if you don't need to compromise a given account and can use a botnet to trawl for accounts in general.
> Anyone on the internet can already query any site with a guess for your username/password - if they're right, the request will succeed. How is that different?
This imagines a central service querying N others with credentials to see if each service knows them.
Imagine compromising one service. Then people are sending you usernames and passwords for hundreds of other services. You won't need to query any other service, the credentials will come to you and the user will have no indication!
However, there might be some complexities. This is basically poor-password-handling as a well-intentioned service. It means anyone breaching any of the services the hub knows about would let you see a great many username/password pairings being checked. It would take make passwords as safe as the weakest link across all the services being checked. The odds of N services all being absolutely safe all at once aren't as good as those of a single service. This is why people shouldn't re-use passwords, right?
I can see your heart is in the right place. You're looking to protect users from themselves! That's a wonderful thing, and I'm so glad that there are people like you who care. But the above outcome might not be maximally desirable, you know?
See my other comment: https://news.ycombinator.com/item?id=17323068
Why? Bloom filters are designed to maintain a reasonable false positive rate while still allowing you to add items extremely quickly (high insertion rate).
Since the bad password list is presumably fixed (or almost so) you can do checks faster and with a lower FP rate by just truncating the hashes and prefix sorting the list ahead of time.
Cuckoo filters and bloom filters are different in how they handle increased load. As the cuckoo filter increases load, insertions are more likely to fail, and so its time complexity increases exponentially. However, the false positive rate remains the same. Bloom filters can keep inserting items into the filter at the cost of an ever rising false positive rate.
Normally for passwords you would use a secure hash, negating this
A Bloom filter is great when you need a probabilistic hash table that is fairly efficient and can have records added efficiently.
With a fixed list as I described above, the insertion cost is enormous (avg cost of moving 1/2 of the array elements), but the space/speed efficiency reaches the theoretical limit
Or, flipping this: every time you add a password, you check that it hasn't been used before. You can only set 256 passwords, total, before the space of truncated hashes is filled.
What I'm saying is I had always thought the point of hashes was to always compare them in their entirety not as subsets of the output characters. This tools seems to suggest (so a crypto novice) that is not a concern.
How would users feel that their exceptional password might collide with a bloom filter to a pownd password?
"In application areas where error-free performance is a necessity, these new methods are not applicable." (p. 422)
Imagine that password hashes are six characters, and `abcdef` is the only pwned password we know about. A smaller version of this tool would reject your password if it hashes to anything that starts with `abc` (such as `abc000`, `abcdee`, `abcdef`, etc). With this toy example, we can see that this is deliberately pessimistic -- it only is a "your password might be pwned" indicator.
It's like throwing out anything in your fridge that is past its expiration date, even if you don't know whether it's actually spoiled.
Doesn't that not work very well at all though? If these "abcXXX" strings are hashes, than just because they share a prefix doesn't mean the unhashed password are even remotely similar.
"good-password-123-dog-xxx..." could hash to "abcdeg" and "password" could be the thing that hashed to "abcdef".
Because there is no possibility of a false negative, you know that if the bloom filter does not include the hash you gave it, the password definitely does not exist in the original corpus. In this case, the password is probably at least semi-unique. If you're using this bloom filter in the context of a web application registration process, this would likely be the case for anyone trying to sign up with a strong/unique password, e.g. a random one generated by a password manager.
However, if the bloom filter maybe contains the hash, you would likely fall back to some more expensive strength check, like querying the full HIBP database to see if this exact password definitely exists in there.
- "abc" is in the bloom filter, and so your password _might_ be in the list (depending on what the remaining characters past "abc" are)
- "abc" is not in the bloom filter, so your password _is definitely not_ in the list.
This results in a high rate of false positives (times when a password is incorrectly thought to be part of the list, but it's not) but that's deemed to be an acceptable tradeoff for this algorithm.
Someone filed a PR asking to add a cache for HIBP's responses, and I did some quick math and came up with ~1200 as a birthday-paradox number for the first five digits of a SHA1 hex digest (assuming uniform/random distribution of hex digits).
The smallest is 381 (hash prefixes "E0812" and "E613D")
The largest is 584 (hash prefixes "00000" and "4A4E8")
It's not good to send clear text user password to yet another system
-> send it hashed
But full hash might still be vulnerable to the rainbow table attack
-> let's use only a part of the hash, AND, reject full hashes.
In summary, I believe, they tried to avoid creating more ways in which user passwords could be leaked.
This has bit me because I was playing with that file and to save space and fit it onto a RAM disk (to save time) I've converted it to binary and then when converting it back it had a different hash despite being 'exactly the same', only later I've realized it's using Windows line endings and trailing spaces padding (and that I didn't even have to bother with binary, since 21 GB comfortably fits in a RAM disk on a machine with 32 GB RAM while 30 GB didn't with browser, Windows, etc. running).
Troy said it's a quirk of the export, might be fixed in next version of the file but doesn't matter for the compressed 7z size (which is true, I've checked personally with 7z going at it for hours and it really did compress those line endings and spaces down to nothing).
Edit: Also see my other comment about how very overly eager at rejecting passwords this filter might be (unless I've made an error in my reasoning that someone has since pointed out): https://news.ycombinator.com/item?id=17323068
That a password has been used by someone else in the past, for some other reason, and that password is now on some database, does not make that password dangerous. What matters is whether your users are using the same passwords across multiple services, services out of your control. But ATM no tech out there would allow you to test for this.
If your system cannot detect someone running 500,000,000 passwords against a single account, you have far bigger problems.
I personally prefer the "zxcvbn" method, which is why I wrote a Java library for my company inspired by it called nbvcxz.
Checking against a list of known passwords from breaches does accomplish this, albeit with a high false positive rate.
I don't think I'm alone in using passwords across multiple sites. Every here lives in that glass house.
Yet. That you know of.
> Every here lives in that glass house.
No, some of us use password manager software for exactly this reason - so we don't use the same password across multiple accounts and have a smaller blast radius when/if a password is compromised.
Just making the possible passwords be 10 chars and contain chars out of a set of 62 (lowercase + uppercase + digits) gives me 0.05 sensitivity at 0.00000001 (0.000001%) FP rate.
(This is reply to w8rbt below, here since HN has decided I cannot post anymore for next who knows how many hours.)
I'm aware of this and wouldn't treat a bloom filter as the final decision maker if there is a hit but many people seem to act like it is intended to be one and bring up its FP rate as low enough, i.e.:
1. This comment and several of its replies: https://news.ycombinator.com/item?id=17322692
2. This comment telling me false positives don't matter: https://news.ycombinator.com/item?id=17323438
3. Both of the linked bloom filters don't mention that if there is a hit you should check with https://haveibeenpwned.com/ API.
4. (Of course, this is HN after all!) my comments pointing this paradox/pitfall/whatever out being at negative point scores. I've literally lowered my score for saying "hey, 1 in 1000 FP isn't actually super accurate and almost sure hit is a good hit". And there are one or two green named users who seem to be criticizing this bloom filter with freshly made accounts, exactly because of this BS.
Space/Time Trade-offs in Hash Coding with Allowable Errors BURTON H. BLOOM
If you have a list of passwords you want to exclude, you can load it into the library, or it comes with it's own list of common passwords from leaked datasets.
I think the main difference you'd see, is the quality of matches and the scores generated are superior to zxcvbn. I had put a lot of work into fixing some of the algorithms, and adding more than the original to cover all the corner cases zxcvbn didn't support well.
For example, I support finding separators within a password with nbvcxz, so passphrase detection is much much better. I also added support for levenshtein distance calculations for dictionary matches. So common misspelling, or slight alterations to a word will be caught, even if they are not included in the dictionary.
Also, internationalization support is nice for those who need it.
nbvcxz is java library (and standalone console program) which is heavily inspired by the work in zxcvbn.
read -s password && curl -4 https://www.bloomingpassword.fun/hashes/sha1/$(echo -n "$password" | shasum | cut -c 1-16)
Especially for some bullshit service that I have no real interest in having an account for in the first place. Love it or hate it, at least OAuth with Google or Facebook or Microsoft cuts down on that nonsense a little bit.
you have to skip that second hashing and you'll get a more decent performance. since you are alreading dealing with hashes.
Check my implementation: https://github.com/daedalus/bloomHIBP
With a 512MB bloom filter I can get 0 FP.
This site is a false positive rate calculator:
I just wonder how one could create such a program that the users will have some guarantees that it will not leak stuff.
You could do what Okta does with its PassProtect  browser extension and use the Have I Been Pwned API along with k-Anonymity .
PassProtect is a browser plugin that makes it easy for people to see in the moment whether or not their password was exposed in a breach. With a real time, as-you-type notification, PassProtect quickly alerts users of possible "riskier" passwords so they can take action immediately and without compromising privacy. By using k-anonymity, PassProtect ensures that your passwords are never seen, stored, or sent over the network during this checking process.
We’ve also made it easy for developers to add this functionality directly into their app or website. By also surfacing related information and breach details, PassProtect promotes security awareness for users while relieving developers of the burden of tracking breaches and maintaining a homegrown tool.
Rather than try to verify if a password is easy to guess, you could offer up strong password suggestions if the user can't create a good one in two tries using basic password heuristics.
Most online accounts are not critical to a user's life if they get hacked; the real danger is in reused passwords. If you generate random passwords for the users, there's no way they're going to reuse it. So that might be better for user security than just fighting to try to get a password that's impossible to guess/brute force.
So my suggestion is to just generate a super difficult password for the user and tell them to save it in their browser's password manager (make sure your site behaves well with browsers' password save feature), and if they forget it, hit that "e-mail me a password reset link" button. After that, implement non-SMS 2FA.
This has been tried. It doesn't work. Two factor is the real solution.
I don't know of any, and I suspect there may not have been any, widely used consumer facing systems running around those lines.
I suspect most of the problems can be mitigated in a scenario where the use gets a new password, rather than getting to create a new password.
But yeah u2f 2fa is an absolute basic limit for serious services these days (pretty much anything else is subject to phishing).
I'm talking about two completely different attacks. One is password reuse, one is not. 2FA does not protect against password reuse, because not every system in the world offers 2FA.
What do you mean "only"? That's the whole point!
2FA is a protection against password reuse only for the sites who use it.
I _almost_ (not quite, but almost) would rather get my investment/bank accounts hacked than this. I use a password manager, and just counting the ones I used in the last couple of months add up to like 150+ passwords. Total, I have hundreds for my active accounts across various sites, services, devices, etc.
If I had to change them every 6 months, I'd be doing almost nothing else. Could be mitigated with some API integrations where every 6 months my password manager could just update the password on its own, but that will never be standard across everything.
E.G. Right now, Google keeps looking at my browser string, and starts whining and making unwarranted demands when it sees something unexpected.
Oh noes! UR browser string seems goofy!
Time for mandatory 2 factor auth, buddy!
I decide my security stance. I decide what’s dangerous. I decide which risks are acceptable.
Why do you think that most passwords are just in the first 64 bits of a SHA1 hash?
>Why do you think that most passwords are just in the first 64 bits of a SHA1 hash?
I don't understand what you're trying to say. Are you asking why I think most passwords have less than 64 bits of entropy?
>According to one study involving half a million users, the average password entropy was estimated at 40.54 bits.
Hint: the more rules enforced, the more predictable the passwords become. Lazy people will find ways to be just as lazy, in a smaller password space. More passwords start to look the same.
What's the password for the password manager?
What do you do on a shared machine that doesn't have your password manager software installed?
What's the password for the place where you could download your database on a new machine?
Password requirements so complex humans can't remember them is just slang for DIY 2FA where the factors are a weak password and possession of your user's database. You might as well just require actual 2FA at that point and lax the requirements.
For example: let's assume all passwords (including all of the ones on haveibeenpwned list) are exactly 8 chars long and composed of 50 characters.
This gives a lot of headstart to this filter since the way possible passwords outnumber ones on the list is extremely more crushing in reality due to many possible lengths, unbounded length and there being 95 printable ASCII chars to use (even alnums themselves is 62 chars already).
passlen = 8
passchars = 50
badpasses = 501636842
allpasses = passchars ** passlen
falsepos = 0.001
print(badpasses / (badpasses + falsepos * allpasses))
This apparently comes up with FP rates for cancer and AV and such (most people dont' have cancer, most exes aren't viruses, etc.).
I might be wrong, feel free to point that out. I'm not a statistician or a doctor, I just had a statistics class as part of my CS degree. :)
Edit: I've found the English terms and explanations:
But actually, probably more like 1 in 10 or 1 in 50 users use bad passwords. If we use 1 in 50, we get that 19/20 positives are true positives and only 1/20 are false positives. With 100,000 users we'd expect to trip the filter 2,000 times, only 100 of which are false positives.
The number of false positives don't matter, and getting a false positive matters nearly not at all. 1 in 1000 good passwords are rejected. Good passwords are nearly random, and cheap to generate. That means that average person needs to re-pick a password once every, what, 100 years?
"But it's much more likely than running into a true positive" you might say.
Actually not, since the user runs into a true positive exactly when they choose a very common password. It's not random. If they only choose bad passwords they'll run into true positives evey single time.
On the other hand, that there exist more false positives than true ones is actually great, since this means you can't recreate the true set of passwords given the Bloom filter.
That being said, false positives will hopefully outnumber true positives, since not many people should use one of them.
On the use of a bloom, I would instead use a cuckoo filter. Then, since its error rate plateaus when it's close to full, you could get to the same error rate with a significantly smaller filter.