
Why Searching Through 500M Pwned Passwords Is So Quick - pstadler
https://www.troyhunt.com/i-wanna-go-fast-why-searching-through-500m-pwned-passwords-is-so-quick/
======
zaroth
Ummm... because it’s an O(1) array lookup not a search at all? Infuriating.

It’s read-only static data. Spending even 60ms on the response is ridiculous.
Reading from files in blob storage... WTF?

Ctrl-F Redis - was disappointed.

Actually, even forget Redis. Pre-generate each of the 1 million possible HTTP
responses and store in a string array. The 5 character hex is the index into
the array. Write < 100 lines of Go to load the data structure and serve it.
What am I missing?

This is like “Hello World” in those HTTP Framework Benchmarks that used to
make the rounds every few months.

~~~
djhworld
The 250mb redis instance on Azure costs $0.055 an hour[1], so nearly $500
p/year

Going down the small hosting a 100 line Go program route, the cheapest "B1"
instance type with 1GiB of RAM costs $105.12 a year, and then if you want the
service to be HA you need probably another instance in a different zone, maybe
a load balancer as well.

I suspect Azure Functions + Blob Storage costs a lot less. Although he didn't
mention how much CloudFlare was costing him.

[1] [https://azure.microsoft.com/en-
us/pricing/details/cache/](https://azure.microsoft.com/en-
us/pricing/details/cache/) [2] [https://azure.microsoft.com/en-
us/pricing/details/virtual-ma...](https://azure.microsoft.com/en-
us/pricing/details/virtual-machines/linux/)

~~~
dchuk
You could host this whole thing on the cheapest Digitalocean droplet if it
were written as the person you're replying to described.

~~~
manigandham
And then it's a single server that has none of the scaling, ease of
maintenance, or high availability of just using functions and table storage.
This is the perfect scenario for serverless, why would you want to run a VM
instead?

The current system works great, most of the hits will be cached, and
optimizing the origin to be that milliseconds faster isn't worth it at all.

~~~
dchuk
It's actually NOT a perfect system for serverless, because it's a read-only
(no writes), simple search "web app". The one thing that you don't get from a
VM in your list is high availability I'll give you that, but otherwise this
solution is an overengineered monstrosity for a simple problem if done in a
more traditional way, as many many other people are calling out here.

If the only reason your web app is fast is when its data is cached, you need
to take a long hard look at how your web app is built. It's not that hard to
build a performant query system like this even with that amount of data.

~~~
manigandham
Read vs write has nothing to do with serverless, it's about well-encapsulated
logic that doesn't need any server management and has granular pay-per-use
billing.

It doesn't get much simpler than a function that does a key/value lookup or
returns a text file; what exactly is over-engineered about that? The app _is_
fast, it's < 100ms which is perfectly fine for non-cached access. This isn't
mission critical real-time, it's a completely free service to lookup a
password hash.

Everyone here is caught up in the usual "we can do better" perspective without
actually thinking about _why_ it was done this way. In reality the current
method is cheaper, more reliable, more scalable and has 0 maintenance compared
to anything else suggested so far.

------
dx034
Just putting this on a vps or cheap dedicated server with 16gb ram would've
led to sub-ms response times at much lower costs (if you don't get Azure and
Cloudflare for free like him). At those response speeds, scalability is also
not really an issue if you cache aggressively at the edge.

Argos is then nice to have but not really necessary. If the server responds
back <1ms, those 30% saved RTT are probably not detectable for the user.

~~~
eropple
I'm not sure that you quite realize that Troy's intention here is to
eventually be soaking a whole lot of zeroes with this. Add on to that that, if
people are relying on this as a service, _it had better not go down under any
remotely plausible circumstances_ , and some growth-hacker's dedicated server
from Hetzner or whatever is unsuitable. Edge caching only blunts sufficiently
this if everybody's searching for the same passwords and proving that that's
the case is a burden you have not shouldered.

To that end, "just use a VPS or a cheap dedicated server" sounds a whole lot
like the middlebrow thing we're supposed to not be fans of around here.

~~~
deelowe
> Troy's intention here is to eventually be soaking a whole lot of zeroes this
> this

huh?

~~~
eropple
Sorry, shorthand common in my neck of the woods. Used for any Thing per Your
Favorite Time Unit, measured logarithmically. In this case, requests-per-
second.

(Edit: ah, typo. s/this this/with this. Clearer?)

~~~
33W
@pbhjpbhj - I believe the meaning of the sentence is that Troy Hunt is
expecting/planning for a few orders of magnitude of growth.

"Soak a few zeros" for example going from 1,000 calls per second to 1,000,000
calls per second.

~~~
optimuspaul
what an odd way to say that.

~~~
stcredzero
"Soak" is often used for "take" in engineering circles. "Zeroes" is often used
for orders of magnitude in engineering circles. I've never before heard these
put together in that fashion, however.

------
skrebbel
Couldn't he just pregenerate all 1.048.576 responses, load them somewhere in
RAM (or just a bunch of HTML files on an nginx with caching on) and be done
with it? I mean he writes that a single response, gzipped, averages 10kb so
that's only 1GB of RAM in total.

Even better: host this on a service like Netlify and not even have the 30 day
cache timeout Troy has here (which means 30 days old info in case of new
breaches). Just regenerate the entire set on the dev box whenever there's a
new breach (should be fast enough, it's a linear search & split) and push it
to Netlify, it'll invalidate all edge caches automatically.

~~~
Bedon292
The breach data isn't updated that often. It was 6 months between v1 and v2,
so a long cache is totally fine. And then invalidate the cache if there is an
update.

~~~
skrebbel
That.. sort of supports my argument :-) He could cache it indefinitely instead
of 30 days if he's going to invalidate the cache anyway.

~~~
Bedon292
I do appear to have misinterpreted what your point was. Sorry about that.
Although I assume the idea was limiting how much is cached at any point on
their part. Not sure it really matters though its not that much data.

------
barrkel
When you don't need transactions, don't have a cache invalidation problem, and
are querying read-only data, then this architecture - or really any
architecture that takes advantage of what makes HTTP scalable, mostly
idempotent, cacheable responses to GETs - makes sense.

~~~
endorphone
I do think it is a bit disingenuous that the author compares this to a classic
web application and its data needs, when the needs here are so trivial. It is
very well designed for what it does (the security needs of such a checker
almost dictates the design), but is a model usable by very few applications.

As an aside, the article talks a bit about Brotli and it's worth noting that
Brotli is nothing more than LZ77 with a dictionary pre-seeded with a 119KB
static dictionary of commonly seen web text. It is of course going to be
fantastic for compressing an HTML document, where much of the content is
verbose and common, but would do nothing above gzip for the hash result data.
I would be surprised if it yielded a single byte of savings in that case.
Brotli is supported by most browsers as it was snuck into the WOFF 2.0
standard, so browsers that support the new web font standard automatically
have to support Brotli.

[https://dennisforbes.ca/index.php/2016/01/28/eat-your-
brotli...](https://dennisforbes.ca/index.php/2016/01/28/eat-your-brotli-
revisiting-why-you-should-use-nginx-in-your-solutions/)

~~~
tjoff
Depends on whether you compare this to what a classic web application should
be doing versus what it actually does.

------
sleepychu
Surprised Troy is so pro-cloudflare. I feel like they create a lot of security
headaches.

~~~
tomalpha
Troy is clearly very pro Cloudflare and Azure. I’m only familiar with them at
a high level and it would be great to hear counterpoints to his praise and get
(at least the impression of) a debate that encompasses their pros and cons.

+1 for being interested in expanding this

~~~
Daycrawler
Doesn't know about Cloudflare but as a Microsoft Regional Director he's
certainly pro Azure.

~~~
copper_think
Such a weird title; it makes me think he works for them, but he doesn't

~~~
ptman
didn't he work for microsoft previously?

~~~
sghi
No, he worked for Pfizer - [https://www.troyhunt.com/today-marks-two-
important-milestone...](https://www.troyhunt.com/today-marks-two-important-
milestones/)

------
darkport
I love the k-Anonymity model. Makes it actually feasible to check passwords
against HIBP when carrying out password audits for clients. Shameless plug but
I've added it to my Active Directory audit password tool:
[https://github.com/eth0izzle/cracke-dit](https://github.com/eth0izzle/cracke-
dit)

------
manigandham
Lots of suggestions in this thread about better architecture but they all seem
to forget that this is designed to be minimal in cost, complexity and
maintenance while delivering 100% availability and great performance.

While Redis or a VM would be faster, that's way more overhead compared to a
few cloud functions and table storage. This whole thing is event-driven and
easy to build with just your browser, along with having cheap and granular
billing. Cloudflare also already caches the responses so there's really no
need for the origin to be perfect.

------
yupyup
Somewhat related (and nitpicky), but there are some spelling errors
(derrivation, seperate...) on the Cloudflare post that explains k-anonimity:

[https://blog.cloudflare.com/validating-leaked-passwords-
with...](https://blog.cloudflare.com/validating-leaked-passwords-with-k-
anonymity/)

P.S.: As a non-native speaker had to look those words up to check them, as I
trusted the spelling from an official blog post.

~~~
jgrahamc
Thanks. I'll get them corrected.

~~~
jmiserez
“dependended” is another.

~~~
jgrahamc
Fixed

------
StavrosK
Here's a slightly more easily auditable version of the checker, in Python:

[https://www.pastery.net/wwzqua/](https://www.pastery.net/wwzqua/)

The bash one was fine, I just prefer the readability of Python to make sure I
know that only my truncated hash version is ever sent.

~~~
jwilk
What bash checker you are referring to?

~~~
StavrosK
The one in this article:

[https://blog.cloudflare.com/validating-leaked-passwords-
with...](https://blog.cloudflare.com/validating-leaked-passwords-with-k-
anonymity/)

------
zcam
Wouldn't using a simple bloom filter make sense in their case? Just build the
thing offline and your app loads it in RAM at startup.

~~~
jgrahamc
Here's a quick table of sizes of the Bloom Filter with the FP rate

    
    
        False positives            Size (MB)
        0.1                              285
        0.01                             571
        0.001                            857
        0.0001                          1120
        0.00001                         1390
        0.000001                        1670

~~~
Ajedi32
That's only if you're sending a bloom filter for the entire hash table though.
If you just use the existing buckets you could reduce the API response sizes
considerably.

The real problem is that the current API returns a count of the number of
times each particular password has appeared, and AFAIK there's no good way to
do that with a bloom filter.

~~~
jgrahamc
I don't understand why everyone's obsessed with using a Bloom Filter for this.
The median response size on the API is 12.2KB with 305 entries. A Bloom Filter
with 305 entries and a 0.000001 false positive rate would be about 1KB. It
seems to me you're introducing a lot of complexity (Bloom Filter vs. simple
string match) and possibility of false positives to save 11KB.

~~~
ericfrederich
A bloom filter can be pushed to clients. Everything is static and could be
served from a CDN. If there is a hit you could then do a secondary request to
perform an actual lookup. For passwords which have not been pwned there'd be a
100% savings on CPU.

~~~
jgrahamc
True, although there's an interesting side effect. With the false positive
rate tuned low any call to the actual API would likely be saying "My password
is one of the ones you already know about" with quite high probability.

------
euroclydon
Anyone know, if we permute all 6-16 character length alpha numeric strings,
how many would would have their sha-1 hash be a match for a given five
character prefix?

I’m certainly not saying I think this is an issue! I’m just academically
curious about the number and how to go about calculating it.

~~~
dsacco
A SHA-1 hash digest is a 160-bit hexadecimal string. These are 40 digits long,
with 16 possible values per digit, yielding a total search space of 16^40
possible values.

If we splice off the first five digits (which Troy originally used as the
database partition key), we get 1,048,576 five digit values each (16^5). Then
we continue calculating with the sixth position as the new first position in
the string.

The math from here is a straightforward function mapping 16^ _n_ -> 16^( _n_
\- 5):

* 16 six digit values match any given five digit partition key, _p_ ,

* 16^2 = 256 seven digit values match any _p_ ,

* 16^3 = 4,096 eight digit values match any _p_ ,

* 16^4 = 65,536 nine digit values match any _p_ ,

* 16^5 = 1,048,576 ten digit values match any _p_ ,

* 16^6 = 16,777,216 11 digit values match any _p_ ,

* 16^7 = 268,435,456 12 digit values match any _p_ ,

* 16^8 = 4,294,967,296 13 digit values match any _p_ ,

* 16^9 = 68,719,476,736 14 digit values match any _p_ ,

* 16^10 = 1,099,511,627,776 15 digit values match any _p_ ,

* 16^11 = 17,592,186,044,416 16 digit values match any _p_.

So in general, to calculate how many _n_ digit SHA-1 digests correspond to any
_m_ digit prefix, we simply calculate (16^ _n_ )/(16^ _m_ ), which yields 16^(
_n_ \- _m_ ). Thus we have (16^[6..16]) / (16^5) for the five digit prefix
case. Hopefully that elucidates it for you!

~~~
tzs
Shouldn't the size of the alphanumeric alphabet used for the passwords be in
there somewhere? The question was about how many permutations of all 6-16
character alphanumeric strings map to a given 5 digit SHA-1 prefix.

Your answer seems to be for the case where the alphabet of the input string is
hex digits.

I.e., I think we want Sum[A^i,{i,6,16}]/16^5, where A is the size of the
alphanumeric alphabet.

For A=62, this is 4.6x10^22 or 2^75.3.

For A=95, this is 4.2x10^25 or 2^85.1.

~~~
KMag
Note that there's no need to actually perform N-M+1 exponentiations and N-M
additions. Two exponentiations, a subtraction, and a division will give you
the same answer.

    
    
      Sum[A^i,{i,1,N}] = (A^(N+1) - 1) / (A-1).
      Sum[A^i,{i,M,N}] = (A^(N+1) - A^M) / (A-1).
    

Consider A=10, M=6, N=2. Sum = (10,000,000 - 100) / 9 = (9,999,900) / 9 =
1,111,100, which is easily checked mentally.

Yes, I've stared at permutations for far too long, but I can't have been the
first to notice the pattern. It must be in plenty of textbooks.

Another cute combinatorics question: using an alphabet of size A, generate the
Nth largest M-character sequence where no two adjacent characters are equal. A
simple count-and-check algorithm takes O(N) time, but there's a simple O(M)
algorithm (that is, constant time if the number of characters is fixed),
assuming constant-time basic arithmetic operations.

------
_pdp_
...or CloudFlare/CloudFront plus DynamoDB table with primary key of the
first/last n-number of characters from the hash with potential secondary index
for filtering.

Btw, indexing can be done cheeper with Google I think.

There is also another way (probably better) and that is to use s3. 1tb can be
stored for as little as $20 - the rest is endpoint caching.

Luckily for all of us it is easier than ever to single-handedly scale to
millions of users at minimal cost.

~~~
always_good
You'd be paying a lot for bandwidth on S3 and Cloudfront.

------
TorKlingberg
Minor complaint: Start the blog post with a link to the service you are
talking about. I actually have trouble finding it.

~~~
jwilk
It's two clicks away:

* first paragraph links to [https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...](https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/) ;

* first paragraph of that links to [https://haveibeenpwned.com/](https://haveibeenpwned.com/).

~~~
TorKlingberg
That's the wrong link though. I eventually found it at
[https://haveibeenpwned.com/Passwords](https://haveibeenpwned.com/Passwords)

------
jwilk
TL;DR why brotli is HTTPS-only: some middle-boxes mangle responses with
content encodings they don't know.

------
aplorbust
What does he do with the logs of all the passwords submitted in searches?

~~~
reificator
> _What does he do with the logs of all the passwords submitted in searches?_

> _imagine if you wanted to check whether the password "P@ssw0rd" exists in
> the data set. [...] The SHA-1 hash of that string is
> "21BD12DC183F740EE76F27B78EB39C8AD972A757" so what we're going to do is take
> just the first 5 characters, in this case that means "21BD1". That gets sent
> to the Pwned Passwords API and it responds with 475 hash suffixes (that is
> everything after "21BD1") and a count of how many times the original
> password has been seen._

~~~
aplorbust
What about the logs from queries submitted via the HIBP website form?

"Another idea I'm toying with is to use the Cloudflare Workers John mentioned
earlier to plug directly into Blob Storage. Content there can be accessed
easily enough _over HTTP_ (that's where you _download the full 500M Pwned
Password list_ from) and it could take out that Azure Function layer
altogether. That's something I'll investigate further a little later on as it
has to potential to bring cost down further whilst pumping up performance."

How to read this? The full list will be downloadable? Users can do queries
locally on the 500M file instead of over the internet? It would be nice to
avoid having to submit queries over an untrusted network (the internet), but I
doubt that is what is being considered in this paragraph.

~~~
manigandham
The form on HIBP uses the same JS client hashing, you can check the HTTP
requests yourself in dev tools.

Yes, the whole dataset is available. The first paragraph mentions the release
of the v2 dataset and you can read the full blog post here:
[https://www.troyhunt.com/ive-just-launched-pwned-
passwords-v...](https://www.troyhunt.com/ive-just-launched-pwned-passwords-
version-2/)

You can get the 8.8gb file directly here:
[https://haveibeenpwned.com/Passwords](https://haveibeenpwned.com/Passwords)

~~~
aplorbust
Thanks for the answer.

That page acknowledges the issue, which is all I was curious about:

"Getting back to the online search, being conscious of not wanting to send the
wrong message to people, immediately before the search box I put a very clear,
very bold message: "Do not send any password you actively use to a third-party
service - even this one!""

------
frogpelt
How am I supposed to pronounced 'pwned'?

Can't we find something other than 4chan language to describe this?

~~~
extra88
I'm not a fan of the term but "pwned" predates 4chan.

[http://knowyourmeme.com/memes/owned-
pwned](http://knowyourmeme.com/memes/owned-pwned)

------
Quarrelsome
That password header warning is the coolest security improvement I've seen
online.

------
kpennell
Was expecting an Algolia ad but was delightfully surprised.

------
Sir_Cmpwn
Starting to get a little uncomfortable with how hard this is being pushed on
HN right now.

[https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefi...](https://hn.algolia.com/?query=troyhunt.com&sort=byDate&prefix=false&page=0&dateRange=all&type=story)

~~~
wyldfire
It's good to be skeptical, but everything I've read so far makes me believe
Hunt's motives are good. His efforts are beneficial to HN readers and beyond.

This search shows four results within the last week and then it starts to drop
off. How different are these search results from other popular sites? Ars
Technica, Techcrunch, Anandtech, Bloomberg, LWN? Even if you focus on
individuals, maybe it's not too different from the articles of Bruce Schneier,
ESR, Linus, Theo de Raadt, etc?

~~~
Sir_Cmpwn
>How different are these search results from other popular sites?

Generally the sites you listed have unrelated articles posted which are all on
the subject of some distinct topic. The articles posted here in the past week
have all been about the compromised password tool.

------
zxcmx
CloudFlare can read every password submitted through their service and here is
why that's so great...

It's beatifully elegant, because...

What? This is also the same company that spilled memory all over every cache
everywhere.

~~~
jgrahamc
Nope. 100% incorrect, Troy's service is using Cloudflare but you don't send
the password to his API: [https://blog.cloudflare.com/validating-leaked-
passwords-with...](https://blog.cloudflare.com/validating-leaked-passwords-
with-k-anonymity/)

~~~
zxcmx
My apologies, I was wrong. I was assuming that you were terminating ssl and
therefore in a position to read the plaintext of user requests.

This would allow you to simply look up whether the password was pwned based on
form submits.

[edit] ok, I understand what you are saying, Troy's service allows some
password privacy due to api design and happens to use cloudflare stuff. Sorry
I was so slow.

I was conflating two unrelated things; a) what happens when you submit a
password through cloudflare and b) what happens when you happen to use a
specific password checking api which uses cloudflare.

~~~
jgrahamc
Troy's API doesn't work like that. Doesn't need to send the password at all.

~~~
zxcmx
Wonderful, thank you [edit] thinking this through.... but regardless of how
the api works, is it not possible that you (cloudflare) could just have the
list and check yourselves since you know the submitted password?

It could be a value added service for all your customers.

[Sorry for the late edit, not being evil here].

~~~
jgrahamc
If you read the details of Troy's API you'll see that at no time does the
password or even a hash of the password leave the computer of the person using
the API.

~~~
zxcmx
Yep, I was sort of on the wrong tack. The api is fine, it just seems like an
unnecessary dance for cloudflare integrated services since you have the
password anyway.

------
ianhawes
This is slightly OT and probably not a popular opinion, but does anyone else
feel that Troy having this massive dataset of emails is unethical?

I definitely believe it is illegal and was surprised that during his recent
visit to the US that the FBI did not arrest him.

~~~
mi100hael
Simply possessing the information (which is already freely available online)
is not on its own unethical. It depends entirely upon what he does with the
information, which in this case is protect the owners from malicious
individuals who also have access to the information because it's freely
available online.

~~~
2close4comfort
It is if its stolen property...which most of that data belonged to the company
that was hacked.

~~~
emodendroket
I'm not going to pretend to understand the legal niceties here, but the
analogy to "stolen property" rings false because 1) he's not depriving anybody
of their passwords 2) the very fact they've been leaked means they aren't
really of value anymore.

