0. This is a file of SHA1 hashes of short strings (i.e. passwords).
1. There are 3,521,180 hashes that begin with 00000. I believe that these represent hashes that the hackers have already broken and they have marked them with 00000 to indicate that fact.
Evidence for this is that the SHA1 hash of 'password' does not appear in the list, but the same hash with the first five characters set to 0 is.
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 is not present
000001e4c9b93f3f0682250b6cf8331b7ee68fd8 is present
Same story for 'secret':
e5e9fa1ba31ecd1ae84f75caaa474f3a663f05f4 is not present
00000a1ba31ecd1ae84f75caaa474f3a663f05f4 is present
And for 'linkedin':
7728240c80b6bfd450849405e8500d6d207783b6 is not present
0000040c80b6bfd450849405e8500d6d207783b6 is present
2. There are 2,936,840 hashes that do not start with 00000 that can be attacked with JtR.
3. The implication of #1 is that if checking for your password and you have a simple password then you need to check for the truncated hash.
4. This may well actually be from LinkedIn. Using the partial hashes (above) I find the hashes for passwords linkedin, LinkedIn, L1nked1n, l1nked1n, L1nk3d1n, l1nk3d1n, linkedinsecret, linkedinpassword, ...
5. The file does not contain duplicates. LinkedIn claims a user base of 161m. This file contains 6.4m unique password hashes. That's 25 users per hash. Given the large amount of password reuse and poor password choices it is not improbable that this is the complete password file. Evidence against that thesis is that password of one person that I've asked is not in the list.
I thought 16k entries might be reasonable but that doesn't even last 3 weeks for me. I think there might have been some issue with slow disk seeks so at some point I restricted it to that many.
I guess it probably it would be better to regularly backup the history file to deal with possible some accidental truncations and issues when running multiple shells concurrently, but probably the overall effort to set up such a system would outweight the benefits.
If you want coverage, generate a few hundred thousand SHA1 hashes along with your password.
Actually, running a trickle query of random SHA1 hashes from your box might be a fun exercise, along with a trickle query of random word tuples (bonus points for using Markov chains to generate statistically probable tuples).
If you search for 'sha1 foo', that's being sent across the network to DDG's servers. And sure, if you're using SSL then it's not going across in plain text, but it's decrypted and handled on their servers in plain text; it'll probably even end up in logs and/or tracking databases somewhere. You're giving DDG your password.
At worst you're giving the attacker a hash target to try brunting. He still has to brute it, and that takes time. Select your plaintext from a large enough keyspace and it's astronomical time.
I'll need to review their policy more closely, but DDG claim fairly minimal tracking. At best someone might be able to correlate hash lookup with some IP space. That's a long way from handing over passwords. And as I already indicated, you could cradled the queries to make the search space much larger.
No, no, no. You're 100% completely misunderstanding this.
When you search for 'sha1 foo', that query ("sha1 foo") goes up to the server. They know your password is "foo" and that you're attempting to "sha1" it. They don't have a hash, they take that data and perform the hash, then send that down to you.
[xargs] allows you to pipe the output of one command as an argument to another command. By default it will show up at the tail end of the second command's arg list, but if you want to interleave it you can use -I flag:
I'm surprised at the backlash to what I thought was fun code golfing. No one called me names after I posted a simple Python solution that didn't check the file. For what it's worth I've changed my LI password and I haven't bothered downloading the actual hash file.
node has a neat API for quickly knocking out stuff like this; it's a useful tool for more than just server code. Calling that comment fanboyism is just displaying the opposite of fanboyism, prejudice against hyped-up tools that nevertheless are good tools.
"Which brings us to the most important principle on HN: civility. Since long before the web, the anonymity of online conversation has lured people into being much ruder than they'd dare to be in person. So the principle here is not to say anything you wouldn't say face to face. This doesn't mean you can't disagree. But disagree without calling the other person names. If you're right, your argument will be more convincing without them."
Some people actually do call names to others when face to face.
Personally, while I don't, I do tend to get a little aggressive and then I'm often surprised with the backlash, because I get that way when I'm genuinely enjoying the conversation, not when I'm irritated.
The first one ramps up memory use like crazy (which I was trying to avoid) and the second one is much better with memory, but you need to move the sha1_hex into the BEGIN block or you're recomputing the hash for every line parsed, thrashing your CPU. Interesting use of 'shift' though, I didn't know you could modify the file argument to -n like that.
dups is indeed a little helper of mine. Like uniq it only handles sorted input. Update: I see you edited your answer to include uniq -d. I wasn't aware of the option, thanks. Now I can simplify the implementation of dups. But I find the name valuable, and I think it's perverse to say uniq when you mean its opposite.
I'm not sure what you're suggesting. I'm supposed to echo |cut ...? But I have a whole file, not just one line. So I have to cat ... |cut ... -- which is what I did. So what's your point?
I could keep the file first by saying:
$ < combo_not.txt cut -c7-40 |sort |dups |wc -l
To which I reply, "Yuck!"
Perhaps we should stop here. You seem to have made this account just a few hours ago for the express purpose of poking at people's code fragments in this thread. You're making stylistic nitpicks (they don't affect correctness, do they?) and you're making them in a tone that I'm not sure I would take from Randal Schwartz himself (you actually edited http://news.ycombinator.com/item?id=4076556 to be ruder than the original). It's a drag, man.
I disagree with #5, I had a few of my coworkers check their sha1 against the DB and most of them were not in the dump. I also checked for truncated hashed, none of which were found. I have the feeling this is a subset of the full database
So I have a funny wild theory...remember back when the Gawker database was compromised? And LinkedIn forced a password reset for users who (according to what I read) used email addresses that matched the Gawker leak?
What if they also (or actually) compared password hashes from their database to the ones released in the Gawker breach? In that case, they likely wouldn't have pulled data straight from the database but actually might have pulled passes from the db, output to text files, cut the text files up to parcel out for processing via Hadoop or something? And somehow one of those text files got loose somehow...or someone MiTMed the actual process (I'd vote for a floating text file just because it's been so long; the Gawker breach was in December 2010).
my fairly complex alphanumeric+symbol password IS in the dump, though not prepended truncated with 0's and the other one I found, which my coworker admitted was too short and alpha only, was in the dump with prepended 0's.
This could validate the fact that the truncated hashes are actually already cracked.
My password is in the dump. I use the Forget Passwords Chrome extension , which is based on pwdhash.com, and generate site-specific passwords based on a master password -- i.e. my password is only used on LinkedIn and it's unlikely that I share it with someone else.
I think I have changed to this password during the last year.
Hmm. My truncated password (for my now-deleted account) is not in the list of hashes -- so it's not just a uniq'd full DB. Also, the original forum thread where the file was first posted only managed to break around 600,491 passwords before it went offline ... so 3,521,180 broken passwords could mean that the original hacker has had access to some LinkedIn accounts for more than just a few minutes today.
This yielded success on some known passwords and a bunch of obvious passwords. Not mine, but I assume this dump is a list of the passwords they've cracked so far (i.e., even if your password isn't on this list - change it).
A stock JtR 1.7.9-jumbo5, using the default rules, is finding quite a few of the non-zeroed ones pretty quickly. This surprises me; I would have expected them to have run the list through the JtR mill before passing it on to others.
Likewise, my password (MybXy836YCza), which wasn't used anywhere except my LinkedIn account created 29-Jan-2012, and has been stored securely at my end, wasn't on the list (either as a full SHA1 sum, or as part of the SHA1).
As you probably guessed from the fact that I posted my old password, I changed it just in case the list that was shared is only a partial list of what was obtained.
Do you remember when you first used this password at LinkedIn? It could help narrow the dates of the breach. Especially useful would be the presence of a strong password in the list that was subsequently changed. That might help determine its freshness, if the new password isn't present (although this may be an incomplete list from an ongoing breach).
I couldn't find my password on the list and I've been using the same password for LinkedIn since I registered. I was trying to remember when was that. If someone know how to find out the last time you changed your pass or when you registered for linkedIn please let me know. I'd guess I use linkedIn for over 4 years at least.
A "member since" date is available on the "Account & Settings" page. Choose "settings" in the drop down that appears when you hover over your (account) name in the upper right corner of any LinkedIn page.
I'm not a math person either, but here's some fodder for someone who is.
Mark Burnett's extensive password collection (which he acknowledges is skewed, because it's largely based on cracked passwords, he only harvests passwords between 3 and 30 chars, etc.). Here's how some of his stats shake out:
* Although my list contains about 6 million username/password combos, the list only contains about 1,300,000 unique passwords.
* Of those, approximately 300,000 of those passwords are used by more than one person; about 1,000,000 only appear once (and a good portion of those are obviously generated by a computer).
* The list of the top 20 passwords rarely changes and 1 out of every 50 people uses one of these passwords.
So it's conceivable that 6M unique passwords could cover a very significant portion of a 120M user namespace.