do *not* skip the section on "Cloudflare, Privacy and k-Anonymity" ... it is a g...

skykooler · on Feb 21, 2018

Why does 0000 have the largest number of hashes? Does SHA-1 not distribute hash values evenly?

jobigoud · on Feb 21, 2018

It's indeed weird that "00000" would be the hash prefix with the highest number of entries. I think it must be a hidden variable. Like some sources put an all-zeroed-out hash in the database for testing or in case of a registration error or for deleted users, and these show up here.

cmurphycode · on Feb 22, 2018

Great thought, but it doesn't seem to be the case - as the number of unique suffixes is the large number here -- in fact, none of the values in the range are simply all zeroes.

https://api.pwnedpasswords.com/range/00000

I wonder if the hidden variable is something to do with how the passwords are leaked. First, let's suppose that a very commonly used broken password hash is plain SHA-1 (I think that's a valid assumption-- unfortunately!). Then, let's figure that amongst the many data dumps / extracts done by hackers, some of them are only able to extract part of the database, or save part of the database, or whatever....and they are fetched / saved / uploaded in lexical order?

Can't think of anything else.

EDIT: Ooops. The other thing is, that these actually are sha-1 hashes of real plaintext passwords. So it's definitely not a test-row in that sense.

derefr · on Feb 21, 2018

Maybe crypto people who have brute-forced up some typeable passwords that hash to low numbers on the first SHA-1 pass, for a fun-and-games equivalent to a Proof of Work? (It'd only show up in actual DB dumps for backends that use "SHA-1 with no salting" for password hashing, which might also serve as a useful canary value.)

cmurphycode · on Feb 22, 2018

Great idea! I ran a quick hashcat against the range00000 list on my laptop. In 1 minute I cracked 79 of them, and not too many of them look very odd - that is, they look sorta like normal cracked passwords.

I'm asking my friend to run a more thorough crack on his dedicated GPU, especially for hash value 000DD7F2A1C68A35673713783CA390C9E93:630 which does stick out to me!

Xeanort · on Feb 22, 2018

00000000DD7F2A1C68A35673713783CA390C9E93: 89378305686

KMag · on Feb 21, 2018

Non-uniformity in SHA-1 would be major news.

Note that in the description below, I refer to any keyed involution as a block cipher. One may make a semantic distinction, but any keyed involution could be used as a block cipher (though, of course, most involutions would contain trivial cryptographic weaknesses).

SHA-1 is based around a 160-bit unbalanced Feistel block cipher. The input in broken into blocks, where the final block contains padding and a final count of the amount of data processed. A copy of the 160-bit state is made, the 160 bit state is encrypted using a block of the input as a key, and the initial copy is added back (without carries between 32-bit words) to the original copy. This is repeated for each input block in turn. This is called a Davie-Meyer construction for making a hash function out of a block cipher.

For any Davies-Meyer hash function, the block cipher is invertible and therefore unbiased. The addition is invertible and unbiased. Any bias would therefore have to come from non-zero correlation between addition and encryption. For any moderately complex block cipher, these correlations would be very complex. Real world design of Davies-Meyer hash functions focuses on absolutely minimizing any patterns present, and cryptanalysis focuses on characterizing and approximating any and all minute patterns that escape the design process.

There are some patterns (weaknesses) in SHA-1, but all known weaknesses are way more complex (and minuscule) than could explain the sort of bias seen in this data set, so the bias must be coming from a higher-level source than SHA-1 itself.

On a side note, the addition in Davies-Meyer is to intentionally make the round function non-invertible. If the round function were invertible, there's a trivial birthday attack on the intermediate state between rounds that square roots the strength of the hash function. MD4, MD5, SHA-224, SHA-256, SHA-384, and SHA-512 are all Davies-Meyer constructions using unbalanced Feistel ciphers. RIPEMD-160 is a parallel application of two Davie-Meyers hashes with different initial values, followed by XORing the two outputs to obtain the final output. SHA-3 is the most notable cryptographic hash function that's not a Davie-Meyer construction.

In case you're wondering, one could make a Davies-Meyer hash function using AES. The designers of AES took AES, doubled the word size, doubled the number of words, and fixed a deficiency discovered in the nonlinear byte substitution. The resulting hash function is called Whirlpool, and the underlying block cipher is called Anubis. I'm not aware of any use of Anubis outside of Whirlpool.

The Salsa/ChaCha families of stream ciphers and the Blake family of hash functions are all very similar to each other. They all use a very similar family of (unnamed) block ciphers internally that are twice the size of the desired output. They achieve non-invetibility by breaking the block cipher output into two halves and XORing the two halves together.

Before MD5 was broken, I did read briefly about an attempt (not by Ron Rivest) to use the inner block cipher from MD5 for encryption, but the performance wasn't competitive. Now we've characterized the hidden patterns in the block cipher well enough to break it relatively easily. I forget the name the authors retroactively gave to Rivest's inner block cipher from MD5.

orlp · on Feb 22, 2018

Salsa/ChaCha does not halve the output and XOR it together, they add the input block to the output to get non-invertability.

Salsa/ChaCha also does not have a block cipher, just an unkeyed permutation function which is applied to the key plus a constant and counter.

KMag · on Feb 22, 2018

Thanks for the correction. Note that a couple of times I spelled Davies as Davie. Also note that after editing, my description of the Davies-Meyer addition step got mangled. The original copy is added to the result of the encryption.

cup-of-tea · on Feb 22, 2018

Both 00000 and 4A4E8 contain the largest number of hashes so it could just be coincidental that the former looks recognisable.

jkaptur · on Feb 21, 2018

I'm a bit confused - why not distribute a serialized Bloom filter representing these passwords? That would seem to enable a compact representation (low Azure bill) and client-side querying (maximally preserving privacy).

Ajedi32 · on Feb 21, 2018

There are half a billion passwords in the list. A bloom filter with even a 1 in 10 false positive rate would still be 286.59 MB.

richdougherty · on Feb 22, 2018

You could do a Bloom filter on each bucket, each of which has about 500 items. This would reduce the size of the response from about 16k to < 1k. But it would be a lot harder to use since all clients would have to use the Bloom filter code correctly.

fhenneke · on Feb 21, 2018

A Bloom filter with >500M items, even when allowing for a comparatively high rate of false positives such as 1 in 100, is still in the hundreds of MBs, which would not be that much more accessible than the actual dump files.

KMag · on Feb 22, 2018

The compressed archive here is over 8 GB. An uncompressed 2 GB Bloom filter with 24 hash functions and half a billion entries has a false positive rate of less than 1 in 14 million.

75% space savings, with no decompression necessary for use, and a 1 in 14 million false positive rate is nothing to sneeze at.

rrobukef · on Feb 22, 2018

But no count of how often the hash is used. Counting bloom filters are till a bit harder to implement.

KMag · on Feb 22, 2018

Counting bloom filters are only marginally more difficult to implement. To increment a key, find the minimum value stored in all of the slots for the key, and then increment all of the stored values for that key that are equal to the minimum value. To read, return the minimum value for all of the values stored in slots for the key.

For these purposes, however, you probably instead want to store just separate Bloom filters for counts above different thresholds, since the common use case would be accept/reject decisions based upon a single threshold.

_wldu · on Feb 22, 2018

I agree. Just need to set some bits and test them. This is too big really for a tree or a hash table.

mino · on Feb 21, 2018

Just added an extra line to the bash wrapper to print how many time the given password appears in the dump:

https://gist.github.com/mino98/8aa240fa55a8182198fba58fb810b...

Ixio · on Feb 22, 2018

If you prefer a one-liner like me, the following line works for me:

VARPWD=P@ssw0rd; HASH=`echo -n $VARPWD | sha1sum`; curl --silent https://api.pwnedpasswords.com/range/`cut -b 1-5 <(echo $HASH)` --stderr - | grep -i `cut -b 6- <(echo $HASH) | cut -d ' ' -f 1`

If it doesn't return anything than your password isn't in the list. You should probably start your line with a space so that it isn't recorded in your bash_history.

If someone else can make it better or shorter, be my guest.

oh_sigh · on Feb 21, 2018

Does anyone else not get results when searching for 'asdf' and 'hunter2', and 'lauragpe'(which appears in the article) not return results using the shell script provided?

edit: Ok, so my `openssl sha1` (version 1.0.x) outputs '(stdin) <hash>', whereas the script expects just <hash>. add ' | cut -f2 -d" "' after the 'openssl sha1' call to fix this if you have the same problem.

frumiousirc · on Feb 22, 2018

Here's how I tested:

    echo -n 'hunter2' | sha1sum

    f3bbbd66a63d4bf1747940578ec3d0103530e21d -

https://api.pwnedpasswords.com/range/f3bbb

C-f d66a6 finds

    D66A63D4BF1747940578EC3D0103530E21D:16092

Lxr · on Feb 21, 2018

Can you clarify what problem this solves?

mcintyre1994 · on Feb 21, 2018

You don't submit to the API either a full password or a full hash (which, since Troy produced the hashes is identical). A hash is pretty much perfect for K-anonymity because if you use a prefix like this then it's extremely likely your data will be spread across the buckets, so no 5-character prefix is close to uniquely identifying a password.

ChrisSD · on Feb 21, 2018

As stated in the post, it's a simple solution to help with anonymity.

"The password has been hashed client side and just the first 5 characters passed to the API As mentioned earlier, there are 475 hashes beginning with "21BD1", but only 1 which matches the remainder of the hash for "P@ssw0rd" and that record indicates that the password has previously been seen 47,205 times."

febed · on Feb 22, 2018

But Troy could still very easily guess the complete hash. It's the one with the 47,205 hits.

kbenson · on Feb 22, 2018

But's it's not always that hash. The password you're checking may not be on the list. This is just a quick check to see if the password in question is on the list, in which case it may be a poor choice depending on how often it's seen.

For example, say I want to check "gSAey27tgGsaEG". That hashes to c2e5dfb023cd42df94751581cba33b24bc011027. https://api.pwnedpasswords.com/range/c2e5d has no entry for fb023cd42df94751581cba33b24bc011027, so it's not even in the list of passwords.

Put another way, it averages a few hundred hashes per prefix based on the total password list size (~500M), but there's 2^136 possible has suffixes per prefix. There's no point in guessing that.

kingvash · on Feb 22, 2018

Yes but Troy doesn't learn the hashes of uncompromised passwords

Lxr · on Feb 21, 2018

Forgive my ignorance but why is submitting a hash a problem? Because Troy knows which passwords have been checked? Why should I care about that? I get that it’s like submitting your password in the clear if it’s in the DB, but in that case surely you have bigger problems.

pfg · on Feb 22, 2018

One way that sites can use this service is to check whether a password has been leaked when users sign up. By handing over the SHA-1 hash of the password you're effectively trusting this service (and anyone who might have compromised it) with all your user's clear text passwords. Connecting the right password with the right user can be trivial in some circumstances, say because a site has a publicly visible sign-up date on profiles, or even if it just hands out sequential IDs to users.

markdown · on Feb 21, 2018

A warning about Cloudflare:

You cannot access their support in any way without logging in. Trying to contact them via their contact/sales page won't work. They won't respond.

This means that if you lose your phone (2FA) and can't log in, you're royally screwed and will have to go to your registrar to recover access to your domains/DNS.

always_good · on Feb 22, 2018

All of that is a good thing in my book. I've been the victim of the "customer service backdoor" on Amazon multiple times. It's ridiculous that someone can just about credentialize as you without even having to log in. They made off with whatever sensitive data the customer service rep had in front of them just from chatting to someone on that anonymous support chat widget.

Meanwhile, all you have to do is backup your 2FA secrets. Why not make it a part of your regular computer backup routine?

lopmotr · on Feb 22, 2018

You should never use only 2FA for something you don't want to be locked out of. You need a 3rd authentication method to replace the 2nd when you lose it, such as backup codes, that as well as a 4th one to recover a lost password.

markdown · on Feb 22, 2018

> You should never use only 2FA for something you don't want to be locked out of.

Tell that to... everyone.

> You need a 3rd authentication method to replace the 2nd when you lose it, such as backup codes, that as well as a 4th one to recover a lost password.

That's on Cloudflare. If they don't offer backup codes, what can an end user do about that?

rfugger · on Feb 22, 2018

Manually record the seed key when you set up 2FA (usually this is contained in a QR code). Keep it somewhere safe and offline. It can be used to recreate your 2FA setup.

markdown · on Feb 22, 2018

I've never looked into that possibility. Thanks.

C14L · on Feb 22, 2018

Pretty sure backup codes are just a part of 2FA.

mulmen · on Feb 22, 2018

If support can bypass 2FA why even have it?

markdown · on Feb 22, 2018

What a silly question. One can prove who they are with documents, but nobody can prove who they are with 2FA.

It goes like this: If you can prove who you are, you get access to your account. That's what this is all about.

The more offline, human touch we go, the greater the security.

aianus · on Feb 22, 2018

It's way more likely that a hacker can convince a customer support rep that he's me than that hacker can steal my 2FA codes.

This isn't a hypothetical, this happens all the time including to people I know personally: https://www.forbes.com/sites/laurashin/2016/12/20/hackers-ha...

markdown · on Feb 27, 2018

That's not an inherent problem, that's poor implementation.

Procedures like this could work:

Person contacts support requesting a bypass of the 2FA due to whatever reason.

1. Cloudflare sends email to persons account notifying of the request. 2. Person is required to upload photographic proof of two govt-issued id's. 3. Cloudflare calls person (phone number on file from 2FA or account setup). 4. 30 day delay initiated. 5. 30 days layer, Cloudflare emails and calls person to confirm they requested 2FA bypass. 6. Access is granted.

With procedures like this, it's no longer about convincing a support rep.

davchana · on Feb 22, 2018

Can you just not change nameservers in registrar control panel for domain from CFs to somebody elses?