Hacker News new | past | comments | ask | show | jobs | submit login
300M Freely Downloadable Pwned Passwords (troyhunt.com)
318 points by urahara on Aug 3, 2017 | hide | past | web | favorite | 177 comments



I really would love we be done with passwords altogether. We're asking non power users to make their password unique, and then make it complicated, and then remember all of them in their head, not on a post-it. Nobody can do that, not even us who are telling them to do that. And then, we explain to them they're dumb if they didn't do that.

Currently, my way to generate a new password is this : `pwgen | md5sum`. And then, I use "lost password" everywhere (but for my mailbox, obviously), that is, the rare times my browser is not already prefilling the login form.

This makes me wonder why we don't just go with that : generate a random password for the user in registration form, allow the browser to save it. On the login form, check if fields are prefilled. If not, only display an email field and send an auth link as mail. User clicking it (once, and fast enough) is logged in.

You still have to remember your mailbox password, but that's the only one, quite akin the root password of a server.


I personally use KeePass, I generate 20+ character random passwords and it save them in a encrypted file. I just wrote my email password and hide it in case I lose my file.


I also use KeePass to generate long, pseudo-random passwords. The amount of different passwords I need is far beyond what I can hold in my head, especially when they have to be secure. I got my closest family to use KeePass too, which lead to a significant reduction in lost passwords and hacked accounts.


Or, you know, you can just use a password manager. Using forgot password all over the place is a bad idea, a lot of sites send you a temporary password that people can find in your emails if they ever hack your emails.

Pick a really good password for your password manager and remember that. Doesn't matter too much which password manager. If you're paranoid use KeePass, otherwise I personally use LastPass.


If someone access my mailbox, they'll use lost password to access other websites :)


Problems: 1) email is insecure, 2) it requires two logins, 3) it's not dual factor, 4) it's more error-prone. It's less work [and more secure] to just implement Google or Facebook authentication.

In terms of generating a strong password: most browsers have password generators (which I didn't even know about until recently). They aren't all enabled by default and they don't work on all forms, and not all browsers have them. Browsers also have supported user certificates [which are more secure than passwords, sort of] for like 15 years, but nobody ever uses them. The main reason (afaik) is how shit the browser's UX for them was, in combination with a burden of complexity - and of course you can't use them on random public devices.

I think we are close to reaching an authentication nirvana. If U2F came embedded in all new computing devices, and an open method of securely synchronizing all devices and service providers was used, we could effectively skip passwords, and rely almost entirely on backup codes for the few times they were needed. A lot of laptops and phones come with fingerprint scanners now. If those scanners were used as part of a U2F solution we would have a pretty solid authentication mechanism. (fingerprints are not foolproof, but IMHO they are about as secure as a password)


Email address is stupid, we should have randomly generated proxy email addresses.


A former coworker makes liberal use of American Express disposable credit card numbers -- proxy credit card numbers that you can request to give away to less than trustworthy merchants.


Privacy (https://privacy.com/) offers a similar service for those that want something that works with more than American Express cards or other such offerings from other card issuers.


Did not know about this! Checked it out, got really excited... and then discovered it's US only :-(


I use this all the time. It's great.


My Citibank card has the same feature, although the number generation applet requires Flash. On a banking website. In mid-2017.


Email addresses have been working fine for me. What I would love though is a way to easily alias my phone number.


So long as you're hashing the output, go straight to the source. And if you're going to hash, use a longer hash.

    dd if=/dev/urandom bs=2048 count=1 | od -A none -l |
        sha512sum
Or just generate a long password:

    pwgen 2048 1


Just use pwgen 32 instead


Why pipe to md5sum?


I could just pass an option to make pwgen generate longer passwords, really, but I find amusing the idea of generating the hash of 160 urandom level generated passwords :) Now that's random! Oh, and also, it looks like a hashed password, so special troll points if a cracker find the password and think it's its hash.

(note for people who may not know pwgen : it outputs 20 lines of 8 random passwords)


> (note for people who may not know pwgen : it outputs 20 lines of 8 random passwords)

Unfortunately, it only does this if STDOUT is a TTY. If it's a pipe, then it outputs a single 8 character long password. So I'm afraid you've been generating rather low entropy passwords by piping the output of pwgen through md5sum.

You can confirm this by piping pwgen through cat: pwgen | cat


Oh, you're right. Thanks for pointing that. Damn, that was an unexpected behavior (although, I can understand how it's helpful, when you want to get a password from a script/program that executes pwgen - but then I guess an option would have been enough).


i simultaneously love and loathe programs that do this, I long for some kind of shell pipe negotiation.

Most annoying trying to use "watch" or "less" and have the color work. "watch --color juju status --color", ahh. :-)


Having used isatty in some of my CLI programs to alter the behavior depending on whether STDOUT is a TTY, I see the appeal. It allows the program to do the best thing depending on how it's being invoked.

However, more recently I've grown skeptical. It leads to surprises, and if a person isn't familiar with the finer details of file descriptors, they may wonder why piping a program changes its behavior. I think this example of someone shooting himself in the foot with pwgen has convinced me never to do this again.


Yeah, that's a hard one. Of course, there are strong arguments for "sensible defaults". But here is how it went :

First, I used pwgen to generate temporary passwords for new unix users. I read the man page at that time, probably saw the mention about different behavior regarding output capabilities and thought : "nevermind, this is not my use case". Years later, when I decided to pipe it, I thought I already knew the program, and certainly not thought : "hey, let's check the man page again to see if the behavior may be altered on a pipe". Especially since I did not think it was a big deal, I was not putting that in a codebase, I just wanted to generate a random password to upload gifs or something.

And that's the interesting thing : should I have wanted to put it in a codebase, I would have read the man page again carefully.

All of this points to one conclusion : consistent defaults for end users are more important than sensible defaults for developers. You can expect developers to pay special attention, so that's ok to alter behavior for them through flags rather than detection of what stdout is plugged in.


You can use "simpler tools", such as `head -c 20 /dev/urandom | sha1`.

If head behaved differently when piped, you would probably know about that because it's a common thing to do.


md5sum is going to limit your generated passwords to 128-bits because you're limited to 32 hexadecimal digits, which are each represented by 4 bits. [a-z][A-Z][0-9] can be represented by ~6 bits, so 32 characters would allow for 190 bits of entropy.


To format the password I guess? pwgen sounds like a command that might be able to do that, too, though, but I wouldn't know.

Anyway, as far as password creation goes, here's another alternative based on more standard tools:

   head -c NN /dev/urandom | base32 # or base64, or md5sum if you so prefer


chrome already does that (but i still stick with keepass)


Can Troy or someone contact Google (bq-public-data@google.com) and push this to the GCP's BigQuery public dataset[1] for hosting and easier look up your password via SQL in BigQuery rather than some 3rd party site?

[1] https://cloud.google.com/bigquery/public-data/


I work for Google cloud. Will ping internally about us hosting it.


Thank you.


An interesting element to this is how resistant some people are too using torrents for legitimate purposes, even as a backup mirror.

It's something we've come to embrace in the Linux world. Much faster than a single server and saves bandwidth at individual sites. Surprised this pragmatism hasn't reached the rest of you yet.


Having an http option makes it better for those on restrictive work networks--I'm downloading the file because I want to experiment with it, but I certainly wouldn't want a call from Networking asking why I'm torrenting stuff.


Yes, I understand the reason for keeping a slow legacy option available.

Even in your situation, if you're doing this for work and IT calls you up, you just tell them what you're doing. "I'm downloading a very large file over Bittorrent because it's 50% faster than the HTTP download and I'd like to do some work today. Is that all? Thanks. Bye."

In some places you'd get that call from IT just for downloading a large file. Getting a call doesn't mean you're doing something wrong, they're just checking it's you, not malware, and that it's for work. If they haven't already blocked it, have at it.


I wonder how we force change with individual companies? Today I had to sign up for a UPS account. The password length was set to max 27 characters, and the form had disabled paste in the password field.

Who do we lobby to get them to fail their next PCI-DSS compliance test?


Someone made an Chrome extension to enable password pasting again.

Don't Fuck With Paste: https://chrome.google.com/webstore/detail/dont-fuck-with-pas...


And now we've moved the problem from "I used a weak password for $site because I couldn't paste it from my password manager" to "I've got an extension in my browser from some internet random that manipulates every form field including the password field on every webpage I visit"...

(And yeah, the "internet random" here has a github repo with the code, and the file that does this is an easily auditable 16 lines of javascript, so props to him for that. But it's still got the recently exploited attack vector that he or an attacker who takes over his account could push malicious updates to the extension, like the webdev extension from earlier this week...)



I've used this extension before, just a note that it will break legitimate onpaste events, for example websites that let you paste an image in, like Imgur or Twitter

https://github.com/jswanner/DontFuckWithPaste/issues/14



Thanks... :-) I hate that... I mean, pwd managers and paste are more secure than having to type them in.. not to mention, less prone to mistakes.


PCI-DSS still recommends password expiry, rotation, and complexity rules. If you undergo certification, send a mail to them and let them know that you prefer the NIST guidelines.

Might happen in the next version release.


Even worse than disable paste, I've noticed a lot of iPhone apps now require you to set a non-pasteable 4-6 digit pin to login (sometimes this is required to use the app at all, sometimes only if you enable touch id).

At least one bank says the pin is "for this device" but then accepts it on new iOS devices too (without ever prompting for the "real" password on the new device).


We've got a few apps where that's been a customer requirement. The thinking is that 1) there are enough people who either do not use a pin/fingerprint to secure their phone and 2) lots of people let other people use their phone (especially parents with kids) - sometimes the argument that securing the PII in the app is a reasonable choice.

(I occasionally worry that my phone's bitcoin wallet does not do this. There is occasionally a large enough balance in there that I'd really like to have to touchid to open it and transfer those bitcoin out... Not quite worried enough to investigate whether other wallet options do it, but sometimes I hand my phone to someone to show them a pic and think "Do I _really_ know this person well enough to trust they won't poke around my phone and try to distract me enough to swipe some bitcoin???")


You think that's bad? My damn BANK has the following password policy for online banking:

  The password you create here can be used to access Online, Mobile and Telephone Banking.
  All passwords must be six characters in length. Special characters (eg. *, %, $, etc) will not be accepted.


Banks are some of the worst - though oddly in many cases I've seen them support much more complex usernames than passwords. If you use a password manager you can generate large, random usernames to go with your tiny, weak password, potentially.


get a new bank, then tell them why.


I'm sure they will be heartbroken.


Send it in a certified letter. Address it to the CEO and send certified CC's to the FDIC, CIO, and a reporter for a local or national tech newspaper column.

It may sound archaic but you have to raise the visibility if you're concerned about changing the banks behavior.


People just tweet @ceo or @company these days, same effect


That is enough characters that if you restrict yourself to the ~95 normal printable characters you can obtain around 175 bits worth of password (as in, more than enough to store a SHA1 sum). Is that really not enough entropy for your use case?


A constraint on password length is often suggestive that the password is not being stored securely.


That's simply not true.


Yes it totally is true. Hashes are a standard length, and you can feed any length passphrase into the hash algorithm. It wouldn't surprise me to see passphrases limited to e.g. 256 chars anyway, but 27 smells very bad. What system limitation leads to this particular number? It smells like a DB column width to me.


Password hashes are specifically designed to be computationally intensive. You can feed any length of password into a hash, but the longer the password is, the more work you have to do.

"No length restriction on passwords" is a common and valid report on HackerOne, because servers that do store passwords securely can be DoSed by someone providing a long password and forcing the server to hash it.


That is an important consideration, but it still sets reasonable password length limits in the hundreds or thousands of characters.


NIST's latest recommendations say "at least 64 characters".

I doubt there's anyone who can make a strong argument that "64 characters isn't enough", and I doubt even intentionally computationally expensive password hashing is going to end up with significant resource usage with 64 or 128 character strings. I wouldn't want my shared hosting WordPress site with a password plugin to need to calculate the bcrypt hash of the entire text of War And Peace, bit I suspect the difference between bcrypting "password123" and a random 64 or 128 character string is insignificant enough to be ignored (but I've never tried benchmarking it, so I'm open to changing my mind here if anyone has links to benchmarks that show otherwise...)


I usually see recommendations for a 72 character limit. I doubt there's any particular reason for 72, but as you say, it's enough.

Login attempts (should) get rate limited independently of password length limits, which makes the difference between hashing 8 characters and hashing 72 characters even less meaningful.


I'd guess it's because bcrypt only supports 72-byte inputs (though there are workarounds, like pre-hashing)


Which is why you add a randomized wait/sleep before responding to an invalid login attempt, to, hopefully slow down "tries" having a cache for recent tries by IP as well would assist in mitigation. The strength of the hashing is also of issue... for example pbkdf2 is very compute heavy.

If you're working in an environment that has a FaaS (AWS Lambda, etc), it may be worthwhile to have this as an async function call so as not to block your primary application. Another option is to break authentication into its' own application, and have that return a signed auth token to your application.

There are lots of options.


Just because there is a limitation on password length doesn't imply they store the password in clear text.


He never said anything about 'clear text' you did and yes that is usually how it is with this 'tell'. It might be some other weakness. Smells like homebrew security. Nice. They are doing it wrong and the wrong shows. That is definitely going in my initial survey.


It does not mean they store the password in clear text, but the implication is surely there. (It does not mean what you think it means.)


It may not imply it, but it strongly suggests it.


Well, they're not wrong. Companies do store plaintext passwords as a customer service tradeoff. (A bad one, but they do it.) Chopping the length of a password to <32 chars is pretty correlated.


and storing plaintext passwords is unacceptable.


That's not true in every case. If you're just throwing it into a DB, then yes. But if you're encrypting it and storing it on an isolated server with the decryption keys on a separate server, it's not a huge deal.

Look, people on HN make a massive deal about passwords. One of my most shocking discoveries starting as a pentester was that "storing passwords in plaintext" would be a low-severity finding at best. Medium through critical vulns are reserved for findings that can own an app. That's how little password storage matters.

If you're relying on UPS preserving the secrecy of your 21-character master password that you're using across all your websites, you're doing it wrong. Yet the vast majority of users will do exactly that. The way to protect them is for critical services to use 2FA, which they do -- email, phone, insurance, etc all use 2FA or separate 4-digit passcodes now (USAA).

There have been so many password database leaks, yet the world moves forward. What is unacceptable is for Blue Cross to leak all your PII, yet the world moved on from that. CloudFlare leaked a huge amount of sensitive info. All of those matter way more than some password leaks.

If someone is going to target you, here's the most likely method: https://news.ycombinator.com/item?id=14919845


> as a pentester was that "storing passwords in plaintext" would be a low-severity finding

I suspect that's because you're viewing the situation as a pentester not a user. A plaintext password (on its own) doesn't do a pentester much good until they've already gained control of the system. However, once someone has control of the system then plaintext passwords are a threat to users because a lot of people are vulnerable to common password reuse.


But what's the point of even bothering to encrypt a plaintext password at all, let alone "storing it on an isolated server with the decryption keys on a separate server" unless there's an automated way for a human to see the plaintext?


If you're using the outlined method, you're very, very far from storing it in plain-text; much further than the woefully common `sha1(passwordtext) // voila, secure!`.


They're not plaintext, we ROT-13 encode them first!


You should ROT-13 them twice for extra security.


Do a bit shift after that.


Also for integrations with other, often older, systems.


The argument is transitive - why are these other systems limiting the password length?


Because they're older?


I've was integrating with an application recently where some of the header file copyright dates are in the '80. Sometimes there are _very_ old but business critical systems that everybody knows need replacing, but which "just work" and for which the risk and/or opportunity cost of replacing the old system is so high it's not been done.

(This application has had three failed rewrite/replace projects over the last 10 years. They're now onto their fourth one, and they're running 32 months over deadline on a "2 year" project timeline...)


Unless you are using a passphrase, which is becoming more and more common.

With a limitation of 27 characters it's not possible to make my password be "if i forget this i'm in a lot of trouble".


KeePass uses an auto-type feature; wouldn't that simulate individual key presses, and defeat anti-paste mechanisms?


I find that on some sites if I click in the username field and alt tab to keepass or any other function I'm suddenly not in the txt field on the site... thus I can't trigger auto type at times.


It's unfortunate that this workaround is needed, but in KeePass you can set a custom keystroke sequence to define what actions are performed during auto-type. Right click > "Edit/view entry" > "Auto-Type" tab > Override default sequence:

I then entered {DELAY 3000}{PASSWORD}

Now I can log in to a full screen game that doesn't allow pasting (I type in the username by hand first). If I'm alt-tabbed out of the game with KeePass in focus, the three second delay is enough time to go from triggering auto-type to restoring the full screen game window and clicking on the password field to give it focus. I was unable to trigger KeePass autotype with the game already full screen.

If the account you're trying to log in to also has a long random username that needs to be auto-typed, you'll probably want a sequence like "{DELAY 3000}{USERNAME}{TAB}{PASSWORD}". Or, if the form doesn't allow you to tab from the username to the password field, you could use "{DELAY 3000}{USERNAME}{DELAY 3000}{PASSWORD}" and you can click on the password field during the second delay.


Thank you!


Well for that you can probably turn off JavaScript or use the web inspector to enable paste.


I can. My wife, who I've taught to use a password manager, probably can't. And neither of these excuses a 27 character limit that strongly suggests my password is being stored unencrypted somewhere.


Not sure how that suggests that at all. User inputs have some sort of cap (don't want someone using a 2GB string as their password). So naturally there's a conversation at some point about "what's our maximum password size".

If that conversation starts off at 1000 characters you're fine, but more often then not it looks more like:

"Make the requirement 8-12 characters"

"12 is too short"

"fine make it longer, like"

"Ok" [18 char implementation]

"Hey, Bob in accounting says he uses 20 char passwords"

"Fine, bump it by another 50%"

[27 char implementation]

[no further internal complaints]

[Specs never updated or reviewed again]


It may well be encrypted, but not hashed...


use X windows - middle-click-to-paste usually gets around these 'no paste here' thingies. :)


And then the rest of the login stops working (and the character limit might be enforced again serverside, or worse, silently truncated). Plus, go try that on a smartphone.


I don't get it.

>Do not send any password you actively us to a third-party service - even this one.

So I can only test password that I am not using (and by extension that I am not going to use in the future).

>oh no - pwned!

>This password has previously appeared in a data breach and should never be used. If you've ever used it anywhere before, change it immediately!

If I cannot (shouldn't) submit any password I am actively using, what does it matter if I used it before? Now I already changed it.


I believe the idea is to ensure no one can use the listing to brute force.


I think gp is complaining that the second you type your password into the form, you've "used it", hence you should change it.

The gp makes a good point, but that's also why you can submit the `sha1($your_password)` instead. The only question is why did Troy allow un-hashed passwords to be submitted.


Maybe - even better - if you could submit only - say - first 8 characters of the SHA1 (and NOT the complete hash) and provide - still say - max 10 "whole" hashes found with that 8 char beginning (if more than 10 ask for a ninth char).

I mean, here is the SHA1 of my password (not really):

d012f68144ed0f121d3cc330a17eec528c2e7d59

This site:

https://hashkiller.co.uk/sha1-decrypter.aspx

>We have a total of just over 312.072 billion unique decrypted SHA1 hashes since August 2007.

Took exactly 221 ms to reverse it to "pippo".


Going to generate a bloom-filter from this dataset tonight.

Troy mentions some arguments against torrents, but it is better to have a authoritative torrent than none, imo.


A signed minimal perfect hash function may be a better bet. You can get down to around 2.68 bits per key plus w sign bits where the false positive rate is 2^-w.

For false positive rate of 2^-9 (0.00195) that's 447 MB, which is slightly less than an optimal bloom filter for the same number of items, and lookups will be considerably faster.

Construction time and the fact that you can't add to it without rebuilding the whole thing are the downsides. But given the application I don't think they matter much.

http://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/Minimal...


Is a bloom filter worth it in this case? With the optimal "k" hash functions of 10 and a "p" error rate of 0.001% (false positives of approximately 1 in 1000), a bloom filter for the 306,259,512 items will take 538 MB. Increasing the error rate to 0.01% (1 in 100) is still 358 MB. That's a sizeable filter to maintain in memory (then again... RAM is cheap).

I'd probably just shove the passwords into a database, limiting the index prefix to the first X characters to reduce index size.


Distributing a 538 MB file (which can be compressed further) is much easier.


What are the actual use cases where this size difference matters?

I distributing to a general audience, 0.5GB and 10GB isn't that much of a difference, and most people are more equipped for handling lists of strings than for handling bloom filters.


>which can be compressed further

Can it? I think of a bloom filter as similar to a lossy compression scheme and wouldn't expect it to be further compressible to any significant extent using a general purpose lossless scheme. Similar to how general purpose compressors generally don't do very well with mp3s or jpgs.


Reducing the size by ~95% in exchange for a 0.001% error rate seems like a pretty nice tradeoff to be able to make for some uses.

The nature of the data means it can never really be "perfect" anyway (there are certainly some password breaches that exist but aren't included in the list), so massively reducing the resources required in exchange for a bit of artificial error seems pretty reasonable to me.


If it doesn't have to be really fast, you can simply binary search fetching no more than 29 lines from the file. Or you can interpolate the expected location of the hash and read a block around that location. This could get you down to reading only one or two disk blocks if the actual position is never more than half a disk block away from the interpolated position. It could be more but given that we are dealing with cryptographic hashes I would expect the interpolated position to never be too far away from the actual position.


It's also a bummer there are only sha1's in the file. It would be good to block things within hamming distance 2 of a leaked password (so p@ssw0rd€ would also be blacklisted...)


You could go the other way, and check likely candidates from the new password - e.g. if user enters "hunter2", you check hunter and hunter0 and hunter1 and hunter3 and hunt3r2.



I do agree with Troy that this could be useful to send to relatives and family:

> I'm envisaging more tech-savvy people using this service to demonstrate a point to friends, relatives and co-workers: "you see, this password has been breached before, don't use it!"

But I can't be the only one whose family would be baffled by the term "pwned". I wish it said something like "Your password has been hacked!" which we all know not to be technically correct but would resonate a lot more.


Too alarmist, and in wrong tense: people would start freaking out "by typing my password into this box, my password has been hacked". Resonance is not always a good thing, see Tacoma Narrows bridge ;)


>If a password is not found in the Pwned Passwords set, it'll result in a response like this:

Wait, so I test my password to see if it's "good" and now you have a copy of a password I will be using. Am I just being paranoid?


From the article: "It goes without saying (although I say it anyway on that page), but don't enter a password you currently use into any third-party service like this! I don't explicitly log them and I'm a trustworthy guy but yeah, don't."


You can post the sha1sum instead.

  $ sha1sum
  SooperSekretPassw0rd^D
  SooperSekretPassw0rddc0d3504b259a92dce59b850969601d12c06a75f  -


If you're on Windows, you can calculate it in PowerShell like this:

$password = "foobar"

([Security.Cryptography.SHA1CryptoServiceProvider]::Create().ComputeHash([Text.Encoding]::ASCII.GetBytes($password)) | %{'{0:x2}' -f $_}) -join ""


And people say that PowerShell isn't readable or intuitive.


You can also do it like this:

  "foobar" | Out-File password.txt -Encoding ASCII -NoNewline

  Get-FileHash password.txt -Algorithm SHA1

  del password.txt
That's readable and intuitive, but the downside is it puts your password in a file.


sha1sum is giving different results.

    /tmp$ echo "p@55w0rd"  | sha1sum
    8633c4a8b38a8826132414d8861af7b6a8371976  -
This is a different value from the one given in the blog post: "ce0b2b771f7d468c0141918daea704e0e5ad45db".

The python sha-1 hexdigest comes out right, though:

    In [13]: import sha

    In [14]: sha.new('p@55w0rd').hexdigest()
    Out[14]: 'ce0b2b771f7d468c0141918daea704e0e5ad45db'
In case anyone else has passwords they want to check, this will binary-search them: https://gist.github.com/coventry/5df7885fb0d5caeabb39fcd0e2b...


You need to use `echo -n` in order to not have `echo` generate a newline.


Ah, thanks.


You can also binary search with "look -b". The version included with Debian Unstable doesn't support large files, but it will search multiple files at once, so you can split without breaking lines, eg. with "split -C 1999m", and search with "look -b SHASUM splitfile*".


Thanks, I knew there had to be an easier way, but couldn't find it.


Make sure you clean your .history if you're using echo to pipe the password...


Thanks. I had done so.


This is Troy we're talking about - I strongly doubt he'd do anything like that without full disclosure.


I think the point of gp is to say that once you submit your password, you have no control of where it goes.

Maybe a malicious copy of the website exists at lots of LevenshteinDist=1 domains. Accidentally typo the domain and get pwned, thinking you are submitting it to an ethical security researcher's tool, but actually getting phished.


One really unfortunate aspect of the passwords being hashed is that there's no info available about their lengths. Knowing the lengths could allow you to reduce the size considerably when you enforce a minimum password length.

For example, if I have a site that requires passwords to be at least 10 chars long, I don't need any of the data for breached passwords that are shorter than 10 characters. People can't possibly use them anyway, so that's probably a huge chunk of the data that's completely useless to be storing and checking.


I've cracked* just under 99% of them so far (including the 14 million added in Update 1). Statistics are here:

https://gist.github.com/roycewilliams/b1de2afbfe5cb71bea16c9...

Regardless of composition, the top 12 lengths are:

    8: 32% (102260862)
   10: 14% (45084047)
    9: 13% (41525797)
    7: 10% (33632055)
    6: 06% (20211176)
   11: 05% (18275968)
   12: 04% (14052958)
   15: 02% (8291459)
   13: 02% (8042452)
   14: 01% (6321198)
   16: 01% (4201765)
    5: 00% (3054291)
In other words, requiring a minimum length of 12 would make 80% of the passwords in the corpus inapplicable.

... and the top 12 masks are:

  ?l?l?l?l?l?l?l?l,47823614
  ?l?l?l?l?l?l?d?d,7005728
  ?d?d?d?d?d?d?d?d,6212778
  ?l?l?l?l?l?l?l?l?l?l,6023602
  ?l?l?l?l?l?l?l?l?l,5379482
  ?l?l?l?l?l?l?l?d?d,5169013
  ?l?l?l?l?l?l?l?l?d?d,5090400
  ?l?l?l?l?l?l?l,4998896
  ?d?d?d?d?d?d?d,4798329
  ?l?l?l?l?d?d?d?d,4798124
  ?d?d?d?d?d?d?d?d?d?d,4754401
  ?l?l?l?l?l?l?d?d?d?d,4377841
Almost 48 million of them are 8 lower-case characters.

* And to be clear, "cracked" is an overstatement. Many of his sources are public. Simply using those sources as wordlists makes "cracking" these like shooting fish in a barrel.


Oh, very cool, thanks for posting. I had actually written a really basic cracker and started seeing if I could figure out how many of them belonged to shorter passwords, but you're doing a much, much better job of it than I am (I was just brute-force generating short passwords, no wordlists or anything).

Are you planning to make a blog post or anything "final" with the info you find out, or will you just keep updating those gists?


You're welcome! And "¿por que no los dos?" :) I'll keep the gists updated and will also do a blog post, I think.

With a tool like hashcat, a modern GPU or two, and some publicly available wordlists, you can get the vast majority of them without breaking a sweat.

In other words: there is almost no value in hashing them with SHA1.


If I test my passwords, aren't they also now pwned?


Not if you grep locally. How big is this data set? It can't be much bigger than a AAA video game download.


It's 11.9 GB of text (5.3 GB zipped). So smaller than quite a few video game downloads.


fairly low compression rate for zipped ascii, but i guess it is mostly a giant pile of nearly random strings.


In a word: Yes


test the SHA1 of your passwords, not your passwords, and you are safe.


A weak password can still be cracked given the sha1 hash and a dictionary.


If you the type of person to read the article, you aren't likely to submit a basic dictionary password. Either your passwords have more entropy or you know better than to submit them.


Is it safe to test my password on this website? (because I just did)


From the article:

"It goes without saying (although I say it anyway on that page), but don't enter a password you currently use into any third-party service like this! I don't explicitly log them and I'm a trustworthy guy but yeah, don't."

Safe: probably. Good practice: no.



Hehe, I was just about to build something similar. I like the extra touch of it not even being served over https.


Also the option to test another password is nice, and the unsatisfied curiosity you're left with if you go down that route.


worth reading the T&Cs for a chuckle


I should have read that before agreeing to it...

Well, I better get prepped to fight the Estatis Inc. Retaliatory Creature to retain full ownership of my soul. Thanks for the heads up.


If you only have one password to try, I'd say you have more important issues to tackle. Install a password manager, start using generated passwords, and stop using 'your password'.


Well, in that case, “your password” might be the only one you have – the one use for your password manager.


Well, in that case, it should already be something that's staggeringly improbable to be used by anyone else, and for your use, you should never ever ever send this password through any network interface at all (including this check page).


salt "your password" with https://www.passwordcard.org/en


"Pick a password length. Eight is pretty secure and usually acceptable." Wait, what? Unless you're suggesting that I append this to a password.


you can choose any area of the card, say 5 letters and a few keywords.


Disclosing a password weakens that password, since passwords only work when they are kept private.


You're probably okay, but Troy probably has the query stored somewhere, hence, he doesn't recommend that.

It is much, much safer to download the data and search for your passwords from inside the local copy of the data, and that's exactly what I'm gonna do later tonight.


I used one password for a decade or so in the 90s and early naughts and though I've since moved on to use LastPass and two factor for everything this is the first time that password appears in one of these databases.

Guessing it was in MySpace..

Ironically I used another password for sites I trusted less and that one isn't in there.


As others have pointed out, the use case for pasting plain text passwords is not quite clear. Maybe it would be a good idea to allow searching for hashes only, or at least hash the password in js on the client.

Also, I'm genuinely curious as to why SHA-1 is used and not SHA-256. Surely the one-time additional cost of using SHA-256 would've been negligible for Troy? If at some point somebody manages to do preimage attacks on SHA-1, I have to assume my password is broken if I've submitted its hash to his API. Although I guess you'd have to actually be able to enumerate preimages, preferably from small to big. Still, I don't understand why Troy doesn't account for the possibility by using a hash function widely considered to be stronger.


I believe the sources for the data breaches were mentioned in the article, so if someone wanted to get those sources anyway, it wouldn't be a big deal.


He stops well short of saying "here's a link to the pastebin dumps!", but yeah - it wouldn't take too much google-fu to build your own version of this - perhaps not a 300M entry one, but I doubt it'd take more than a weekend to get halfway there if you wanted too.


What would be the best data structure for using this in, say, a Python script? I imagine just putting it into a dictionary (hash map etc.) won't work because of the size.


Bloom filter :-)


If you want it lossless, a trie can be great. Not sure if one implemented in Python would be worth it though; should be a package with it in C.


Someone should apply deep learning to this and check how it compares with brute-forcing passwords. E.g. https://github.com/thoppe/5baa61e4c9b93f3f0682250b6cf8331b7e...


That's the point here. We know 300M passwords leaked, but we can't investigate these passwords and learn from this data. It's sad that researchers should look into black market if they want to get plain passwords. I'm pretty sure everyone who needs these passwords can find them.


Can't, as the passwords aren't available in plain-text. Only as sha-1 hashes.


Many of these passwords are one or two characters in length. I think the 300 million number is inflated for publicity. Who allows a password that only has one character?

Go here and type the character 'a':

https://haveibeenpwned.com/


There are 9120 printable ASCII strings of 1-2 characters in length. Cunning, including that many to bulk out the list :)


;) OK. But there are a ton of 3 and 4 char passwords too. I mean, what sort of site would allow that? It's just hard to believe these were actually passwords.

BTW, I upvoted your comment. It made me laugh. Point granted.


FWIW - I do that all the time.

If someone's website forces me to set up an account without me already being convinced there's enough benefit to me in return for my personal information, they're likely to get a signup for test@example.com with password "foo". (And if they then respond with "please click the confirmation link in the email we just sent", I'll sign up again with $sitename@$spare-domain-I-own.com and pick up the link from the spam filtered catchall account).

I probably do this at least once a month when there's hints something useful in a web forum I want to read, but I'm not (yet or ever) convinced I'll ever become a member of that forum's community...


Quick script to binary-search for passwords locally: https://gist.github.com/Freaky/4cb7ce8c107c3da2e4a8210356e8d...


Interesting - "correct horse battery staple"[0] is flagged as not being in the data set. I was sure someone should have used that by now.

[0] https://xkcd.com/936/


It shows up if you remove the spaces.


And with spaces and uppercase first letters.


I'm enjoying typing in things that I've never used as passwords (my own name, for example) to see if anyone else at some point thought these things would make for good passwords.


HIBP provides a REST API to check if a password has been found in a breach, Is there a disadvantage of using it in applications and restricting users not to use the breached password?


It's not ideal to send every new user's password to a 3rd party service.


Hence the downloadable file, and even a suggestion to use that as an in-house checker. It's in the article.


you can still send the SHA1


Without salt, meaning the majority of passwords can be reversed with brute-forcing or rainbow tables.

The second google result for rainbow tables lets me download software and tables to efficiently reverse any sha1 whos plaintext fits [a-zA-Z0-9]{1,9} or [a-z0-9]{1,10}. That's likely the majority of passwords an attacker would observe


Woha, wait a second:

> Each of the 306 million passwords is being provided as a SHA1 hash.

That's it? Without any salting? This would make it trivial to recover the plain text using rainbow tables.


Troy explains where the majority of them came from - it'd be even more trivial to Google for the dumps yourself.

It's not like everyone who is curious cant go findabout stuff like this doesn't have the Rockyou dump already, and there's easily enough links to start your own list here: https://www.google.com/search?q=password+lists


Would it make sense to host the file on a cheap OVH/Scaleway VPS with unlimited bandwidth? I guess CloudFlare doing it for free beats that though!


"Unlimited" as in, "as soon as you start becoming a problem we drop you".


OVH lets you use the advertised bandwidth 24/7, if you wish to do so. Naturally, this doesn't replace a CDN, and their peering isn't the best, so some routes might be congested and won't be that fast.

I have no experience with Scaleway on this, but based on what I've heard about them in the past, I imagine their policy is roughly the same.


AFAIK these services are operate by ISPs with lopsided peering agreements.


I'm confused. If a website salted their hashes, wouldn't it not matter if he password alone was "pwned"?


If a website salts their hashes, that only helps protect their users if that website is hacked and their password database stolen.

But 306 million passwords have already been exposed by data breaches at other sites. And users have a tendency to reuse passwords across multiple sites. Just because your website wasn't hacked doesn't mean an attacker can't go look up one of your users in someone else's data breach and try the same password on your site.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: