It's indeed weird that "00000" would be the hash prefix with the highest number of entries. I think it must be a hidden variable. Like some sources put an all-zeroed-out hash in the database for testing or in case of a registration error or for deleted users, and these show up here.
Great thought, but it doesn't seem to be the case - as the number of unique suffixes is the large number here -- in fact, none of the values in the range are simply all zeroes.
I wonder if the hidden variable is something to do with how the passwords are leaked. First, let's suppose that a very commonly used broken password hash is plain SHA-1 (I think that's a valid assumption-- unfortunately!). Then, let's figure that amongst the many data dumps / extracts done by hackers, some of them are only able to extract part of the database, or save part of the database, or whatever....and they are fetched / saved / uploaded in lexical order?
Can't think of anything else.
EDIT: Ooops. The other thing is, that these actually are sha-1 hashes of real plaintext passwords. So it's definitely not a test-row in that sense.
Maybe crypto people who have brute-forced up some typeable passwords that hash to low numbers on the first SHA-1 pass, for a fun-and-games equivalent to a Proof of Work? (It'd only show up in actual DB dumps for backends that use "SHA-1 with no salting" for password hashing, which might also serve as a useful canary value.)
Great idea! I ran a quick hashcat against the range00000 list on my laptop. In 1 minute I cracked 79 of them, and not too many of them look very odd - that is, they look sorta like normal cracked passwords.
I'm asking my friend to run a more thorough crack on his dedicated GPU, especially for hash value 000DD7F2A1C68A35673713783CA390C9E93:630
which does stick out to me!
Note that in the description below, I refer to any keyed involution as a block cipher. One may make a semantic distinction, but any keyed involution could be used as a block cipher (though, of course, most involutions would contain trivial cryptographic weaknesses).
SHA-1 is based around a 160-bit unbalanced Feistel block cipher. The input in broken into blocks, where the final block contains padding and a final count of the amount of data processed. A copy of the 160-bit state is made, the 160 bit state is encrypted using a block of the input as a key, and the initial copy is added back (without carries between 32-bit words) to the original copy. This is repeated for each input block in turn. This is called a Davie-Meyer construction for making a hash function out of a block cipher.
For any Davies-Meyer hash function, the block cipher is invertible and therefore unbiased. The addition is invertible and unbiased. Any bias would therefore have to come from non-zero correlation between addition and encryption. For any moderately complex block cipher, these correlations would be very complex. Real world design of Davies-Meyer hash functions focuses on absolutely minimizing any patterns present, and cryptanalysis focuses on characterizing and approximating any and all minute patterns that escape the design process.
There are some patterns (weaknesses) in SHA-1, but all known weaknesses are way more complex (and minuscule) than could explain the sort of bias seen in this data set, so the bias must be coming from a higher-level source than SHA-1 itself.
On a side note, the addition in Davies-Meyer is to intentionally make the round function non-invertible. If the round function were invertible, there's a trivial birthday attack on the intermediate state between rounds that square roots the strength of the hash function. MD4, MD5, SHA-224, SHA-256, SHA-384, and SHA-512 are all Davies-Meyer constructions using unbalanced Feistel ciphers. RIPEMD-160 is a parallel application of two Davie-Meyers hashes with different initial values, followed by XORing the two outputs to obtain the final output. SHA-3 is the most notable cryptographic hash function that's not a Davie-Meyer construction.
In case you're wondering, one could make a Davies-Meyer hash function using AES. The designers of AES took AES, doubled the word size, doubled the number of words, and fixed a deficiency discovered in the nonlinear byte substitution. The resulting hash function is called Whirlpool, and the underlying block cipher is called Anubis. I'm not aware of any use of Anubis outside of Whirlpool.
The Salsa/ChaCha families of stream ciphers and the Blake family of hash functions are all very similar to each other. They all use a very similar family of (unnamed) block ciphers internally that are twice the size of the desired output. They achieve non-invetibility by breaking the block cipher output into two halves and XORing the two halves together.
Before MD5 was broken, I did read briefly about an attempt (not by Ron Rivest) to use the inner block cipher from MD5 for encryption, but the performance wasn't competitive. Now we've characterized the hidden patterns in the block cipher well enough to break it relatively easily. I forget the name the authors retroactively gave to Rivest's inner block cipher from MD5.
Thanks for the correction. Note that a couple of times I spelled Davies as Davie. Also note that after editing, my description of the Davies-Meyer addition step got mangled. The original copy is added to the result of the encryption.
I'm a bit confused - why not distribute a serialized Bloom filter representing these passwords? That would seem to enable a compact representation (low Azure bill) and client-side querying (maximally preserving privacy).
You could do a Bloom filter on each bucket, each of which has about 500 items. This would reduce the size of the response from about 16k to < 1k. But it would be a lot harder to use since all clients would have to use the Bloom filter code correctly.
A Bloom filter with >500M items, even when allowing for a comparatively high rate of false positives such as 1 in 100, is still in the hundreds of MBs, which would not be that much more accessible than the actual dump files.
The compressed archive here is over 8 GB. An uncompressed 2 GB Bloom filter with 24 hash functions and half a billion entries has a false positive rate of less than 1 in 14 million.
75% space savings, with no decompression necessary for use, and a 1 in 14 million false positive rate is nothing to sneeze at.
Counting bloom filters are only marginally more difficult to implement. To increment a key, find the minimum value stored in all of the slots for the key, and then increment all of the stored values for that key that are equal to the minimum value. To read, return the minimum value for all of the values stored in slots for the key.
For these purposes, however, you probably instead want to store just separate Bloom filters for counts above different thresholds, since the common use case would be accept/reject decisions based upon a single threshold.
If it doesn't return anything than your password isn't in the list. You should probably start your line with a space so that it isn't recorded in your bash_history.
If someone else can make it better or shorter, be my guest.
Does anyone else not get results when searching for 'asdf' and 'hunter2', and 'lauragpe'(which appears in the article) not return results using the shell script provided?
edit: Ok, so my `openssl sha1` (version 1.0.x) outputs '(stdin) <hash>', whereas the script expects just <hash>. add ' | cut -f2 -d" "' after the 'openssl sha1' call to fix this if you have the same problem.
You don't submit to the API either a full password or a full hash (which, since Troy produced the hashes is identical). A hash is pretty much perfect for K-anonymity because if you use a prefix like this then it's extremely likely your data will be spread across the buckets, so no 5-character prefix is close to uniquely identifying a password.
As stated in the post, it's a simple solution to help with anonymity.
"The password has been hashed client side and just the first 5 characters passed to the API As mentioned earlier, there are 475 hashes beginning with "21BD1", but only 1 which matches the remainder of the hash for "P@ssw0rd" and that record indicates that the password has previously been seen 47,205 times."
But's it's not always that hash. The password you're checking may not be on the list. This is just a quick check to see if the password in question is on the list, in which case it may be a poor choice depending on how often it's seen.
For example, say I want to check "gSAey27tgGsaEG". That hashes to c2e5dfb023cd42df94751581cba33b24bc011027. https://api.pwnedpasswords.com/range/c2e5d has no entry for fb023cd42df94751581cba33b24bc011027, so it's not even in the list of passwords.
Put another way, it averages a few hundred hashes per prefix based on the total password list size (~500M), but there's 2^136 possible has suffixes per prefix. There's no point in guessing that.
Forgive my ignorance but why is submitting a hash a problem? Because Troy knows which passwords have been checked? Why should I care about that? I get that it’s like submitting your password in the clear if it’s in the DB, but in that case surely you have bigger problems.
One way that sites can use this service is to check whether a password has been leaked when users sign up. By handing over the SHA-1 hash of the password you're effectively trusting this service (and anyone who might have compromised it) with all your user's clear text passwords. Connecting the right password with the right user can be trivial in some circumstances, say because a site has a publicly visible sign-up date on profiles, or even if it just hands out sequential IDs to users.
You cannot access their support in any way without logging in. Trying to contact them via their contact/sales page won't work. They won't respond.
This means that if you lose your phone (2FA) and can't log in, you're royally screwed and will have to go to your registrar to recover access to your domains/DNS.
All of that is a good thing in my book. I've been the victim of the "customer service backdoor" on Amazon multiple times. It's ridiculous that someone can just about credentialize as you without even having to log in. They made off with whatever sensitive data the customer service rep had in front of them just from chatting to someone on that anonymous support chat widget.
Meanwhile, all you have to do is backup your 2FA secrets. Why not make it a part of your regular computer backup routine?
You should never use only 2FA for something you don't want to be locked out of. You need a 3rd authentication method to replace the 2nd when you lose it, such as backup codes, that as well as a 4th one to recover a lost password.
> You should never use only 2FA for something you don't want to be locked out of.
Tell that to... everyone.
> You need a 3rd authentication method to replace the 2nd when you lose it, such as backup codes, that as well as a 4th one to recover a lost password.
That's on Cloudflare. If they don't offer backup codes, what can an end user do about that?
Manually record the seed key when you set up 2FA (usually this is contained in a QR code). Keep it somewhere safe and offline. It can be used to recreate your 2FA setup.
That's not an inherent problem, that's poor implementation.
Procedures like this could work:
Person contacts support requesting a bypass of the 2FA due to whatever reason.
1. Cloudflare sends email to persons account notifying of the request.
2. Person is required to upload photographic proof of two govt-issued id's.
3. Cloudflare calls person (phone number on file from 2FA or account setup).
4. 30 day delay initiated.
5. 30 days layer, Cloudflare emails and calls person to confirm they requested 2FA bypass.
6. Access is granted.
With procedures like this, it's no longer about convincing a support rep.
And check out Cloudflare's detail post too:
https://blog.cloudflare.com/validating-leaked-passwords-with...