
A journey to searching Have I Been Pwned database in 49μs - stryku2393
http://stryku.pl/poetry/okon.php
======
simonw
This is a really informative write-up and an excellent learning exercise.

It's worth noting that haveibeenpwned's API has a really clever design for
allowing people to look up their passwords without transmitting them to the
site.

It's explained here: [https://www.troyhunt.com/ive-just-launched-pwned-
passwords-v...](https://www.troyhunt.com/ive-just-launched-pwned-passwords-
version-2/)

The short version is that you can take the first 5 characters of a SHA-1 hash
and hit this endpoint:

[https://api.pwnedpasswords.com/range/21BD1](https://api.pwnedpasswords.com/range/21BD1)

The endpoint returns (right now) a list of 528 full hashes along with counts.
You can compare your full calculated SHA-1 hash to that list to see if the
password is present in the dump.

The trick here is called k-Anonymity - I think it's a really elegant solution.
This technique is written up in more detail here:
[https://blog.cloudflare.com/validating-leaked-passwords-
with...](https://blog.cloudflare.com/validating-leaked-passwords-with-k-
anonymity/)

~~~
ComputerGuru
Honestly they’re being super mathematical about it but it’s really nothing
fancy at all. SHA is designed to be used like this, CloudFlare hasn’t done
anything remotely ingenious (and I wouldn’t mind except they go out of their
way to talk about how much better their fancy new algorithm is compared to
other multi-set intersection theories).

E.g. 512-bit SHA-2 and SHA-3 may be truncated at 128 or 256 bits if that’s all
the entropy you need (and you don’t need to be compatible with the formal
SHA2/SHA3-512/256 spec). Here, CloudFlare is truncating to an intentionally
low entropy of just 20 bits, not to reduce the security but rather to
intentionally increase the collisions. It’s ultimately just a glorified hash
table and as any CS student can tell you, the bucket size is just a function
of the hash size (v1 of the api: 128-bit hash, v2 of the api: 20-bit hash).

(I don’t like pretension.)

~~~
rmwaite
For someone who claims to not like pretension (sic), you're being pretty
pretentious.

~~~
post_below
After seeing this I reread the post you're replying to. I don't see the
pretention.

Pretentious: adjective. characterized by assumption of dignity or importance,
especially when exaggerated or undeserved: a pretentious, self-important
waiter. making an exaggerated outward show; ostentatious. full of pretense or
pretension; having no factual basis; false.

If you're suggesting that their analysis is false, you should probably point
out why. The self importance part I'm not seeing. Aren't they just stating the
facts as they see them?

~~~
Gravyness
The "really clever design" as noted by simonw was called "nothing fancy at
all" by ComputerGuru, and that haveibeenpwned's API design "hasn’t done
anything remotely ingenious", this is what made the comment pretentious. (in
this case it doesn't matter wether he is or not factually correct)

The reply could very well be interpreted as "You think that Kk-Anonymity is
fancy? How primitive of you. This way is the natural way of doing that thing
and I would know because I'm the smartest person in the room. Oh by the way I
dislike people who are pretentious."

~~~
alias_neo
You're mis-representing, because nowhere did ComputerGuru say what you're
claiming you quoted, of HIBP.

The not-remotely ingenious part was regarding Cloudflare.

~~~
pbhjpbhj
Consider the possibility that those of us who found it pretentious aren't
lying, and that you not seeing it is perhaps more related to how you
personally see things.

~~~
alias_neo
Now you're mis-representing my comment, I didn't anywhere discuss whether or
not it was pretentious, I just pointed out that the comment was factually
incorrect.

~~~
pbhjpbhj
Why did you do that then?

------
jiggawatts
This and other similar solutions seem _awfully_ over-engineered.

a) Hashes are constant size (20 bytes for SHA1)

b) You only care if they're present or not in the database. There's no
associated variable length data.

The simplest yet _very efficient_ format is simply a sorted array of
"byte[20]", with binary-search as the lookup. No headers, no pointers, no
custom format of any type. Literally just 20 x n bytes where 'n' is the number
of hashes. Lookup of the 'n-th' hash is _just_ multiplying 'n' by 20.

If you really, _really_ want a B-Tree (why?), just stuff it in _any_ database
engine. Literally anything will handle a single fixed-length key lookup
efficiently for you.

    
    
        CREATE TABLE "HIBP" ( "SHA1" BINARY(20) PRIMARY KEY );
        SELECT 1 FROM "HIBP"
        WHERE "SHA1" = 0x70CCD9007338D6D81DD3B6271621B9CF9A97EA00
    

There. I solved the blogger's problem in literally under 5 minutes without
having to write a custom binary.

You can trivially query databases from both web apps and CLI tools, and you
can do this with batch queries too. E.g.: WHERE "SHA1" IN (... list... )

PS: Text-based tools (such as most shell tools) suck at this type of binary
data. The newline terminated hex representation is 41 bytes per hash, so just
over double the required size. Clever 5-10% prefix compression tricks pale in
comparison to _not doubling_ the data size to begin with.

PPS: A pet peeve of mine is older Java database "enterprise" applications that
use UCS-2 "nvarchar" text columns in databases to store GUID primary keys. The
16 byte GUID ends up taking a whopping 78 bytes to store!

~~~
derefr
> If you really, really want a B-Tree (why?)

Cheap writes, in an OLTP sense?

I had a similar problem recently: discovering "everything" in a DHT (= asking
each node for everything it has), and feeding the resulting documents into a
message queue for processing, without wasting resources processing each
object's thousands of duplicate copies found distributed in the DHT.

These objects had no natural key, so I had to use a content hash for
deduplication. And there was no way to time-bound the deduplication such that
I could bucket objects into generations and only deduplicate within a
generation. I needed a big fat presence-set of all historical hashes; and I
needed to add new hashes to it every time I found a new unique object.

B-Trees have a good balance of read- and write-performance, which is why
they're used for database indices and (usually) as the lower-level persistence
strategy of key-value databases.

> PPS: A pet peeve of mine is older Java database "enterprise" applications
> that use UCS-2 "nvarchar" text columns in databases to store GUID primary
> keys.

Endorsing this point and boosting it: DBMSes really haven't thought out how to
efficiently store large identifiers.

For example, a DBMS _could_ suggest that clients feed it UUIDv1s, and then
break them down into separate hidden {mac:48, timestamp:60, clockseq:14}
columns, enabling per-column compression. (In most installations I've
encountered, there's only one client generating UUIDs for the DBMS anyway, so
these would compress _really_ well, in fact enabling a default packed format
where the timestamp + 5 bits of clockseq are kept in a 1-bit-tagged uint64;
and the full value-size is only needed for exceptions.)

~~~
jiggawatts
> Cheap writes, in an OLTP sense?

Sure, but HIBP is most certainly _not_ an OLTP workload. It is updated
infrequently as a batch process. The official mirror was last modified in July
2019!

Whenever a "new set" of millions of passwords are leaked, the HIBP
maintainer(s) merge it with their existing data set of millions of passwords.

The "update" process is to download the new data set... and that's it. It's
already sorted.

The only step I'm suggesting is to simply convert the pre-sorted HIBP SHA1
text file to a flat binary file. This takes a constant time and requires only
a tiny buffer in memory.

------
marcan_42
This is bad benchmarking. There is no way you're doing a b-tree lookup in
microseconds on an on-disk file... Unless the parts you care about are already
cached.

So either the whole file fits in RAM and you pre-load it (in which case you
have to account for that memory usage), or you have to run benchmarks on
random hashes, which would yield much slower numbers (on the order of 30ms for
an HDD).

Personally, when I implemented this in a web service, I used a bloom filter.
It has some false positives (tunable) and requires a few extra disk reads per
check, but the resulting file is also smaller and the code to generate it and
check it is very, very simple.

[https://gist.github.com/marcan/23e1ec416bf884dcd7f0e635ce5f2...](https://gist.github.com/marcan/23e1ec416bf884dcd7f0e635ce5f2724)

P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort`
command. No, it does not load it all into RAM. It knows how to do chunked
sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools
are smarter than you think.

~~~
ascar
> Personally, when I implemented this in a web service, I used a bloom filter.

When I was working on blocking leaked passwords, I found projects using
bloomfilters for this (e.g. Keycloak). I find false positives completely
unacceptable from a user experience perspective.

Blocking a perfectly fine password is playing with the users mind. If he cares
even a little, he will wonder what is wrong with his password (was it
leaked???) and if the answer "false positive" isn't available to him it's just
evil.

A bloom filter is a good tool to filter out the negatives (which should
hopefully be the majority of passwords), but please run all positives against
the full set.

~~~
Deimorz
It depends what purpose you're using this for. If it's something like a
password manager where you're checking the user's existing passwords to see if
any have been breached, then yes, you should make sure not to hit false
positives.

But if you're using it as a way to prevent people from using known-breached
passwords on a site/service, it's really not worth worrying about. False
positives would only be a problem when the user is using bad password
practices anyway. If it's a non-shared, random password like it should be, a
tiny chance of blocking an acceptable one is fine. They can just generate
another one, it's an extremely minor inconvenience at worst.

Just include a note in the error message saying something like, "In very rare
cases, this could be a false positive. Even if it is, you must choose a
different password." The 0.1% of users it impacts (or whatever your error rate
is) will be fine.

~~~
ascar
While I absolutely agree you shouldn't reuse passwords across services, it's a
reality for many users and I'm convinced it's not an acceptable point of view
to tell your user his password might be leaked, if there is no indication for
it. This is not a minor inconvenience, this might trigger a major panic and
it's inconsiderate to ignore it.

Even if you display there is a minor chance of a false positive, your user
must now think 0,1% this was a false positive or 99,9% my password was leaked!
I don't want users to panic if there is no reason to and I don't want users to
blame false positives if there is reason to panic. I think it's definitely
worth worrying about.

Also research has shown that forcing users to regularly change passwords leads
to weaker passwords. And as many (most?) users are not using password managers
(yet) expecting a different secure (i.e. long enough, non dictionary) password
is too far from reality.

~~~
dwaite
You probably shouldn't indicate to them that their password has been leaked.

The database only lists the SHA hash of passwords which have been found in
various datasets and the occurrence count. It does not indicate that the
matching password (one of half a billion) has ever been associated with the
user account.

The fault in knowledge schemes is oversharing the secret. However, there is no
way to know for sure that the user has done this by reusing passwords - you
will have false positives (say, one other person on the internet chose this
password by chance) and false negatives (the user has reused the password all
over the place, but has managed to not be part of a known breach dataset).

------
lalaland1125
You can actually do much better than binary search due to the uniform
distribution of hashes.

[https://en.wikipedia.org/wiki/Interpolation_search](https://en.wikipedia.org/wiki/Interpolation_search)
for example can achieve O(log log n) performance under the uniform
distribution assumption which is order of magnitudes faster for this scale of
data. Another trick is to start with interpoluation search and then switch to
binary search once the sample size gets small enough.

~~~
zamadatix
If you're going to sort the hashes then might as well make a jump table of the
first n bits and a binary search from there.

~~~
thedance
Why do you even need the jump table? If the hash function is working you
should get quite close just by dead reckoning.

------
jonstewart
In digital forensics we often have to do a hash set lookup as in this article.
When the set is constant, you can sort it as the author did, and then perform
a linear scan to determine the maximum error—i.e., how far away a hash value
is from its expected location—and then use a reduced binary
search/interpolation search, where the expected index is used as the midpoint
and the maximum error is used to determine the window. On large hash sets of
this size, the maximum error is still often measured in KB.

It’s probably not the fastest possible algorithm (though likely faster than
what the author obtained), but it plays much better with memory than naive
binary search and the storage format doesn’t have any overhead.

~~~
iomintz
I'm trying to implement this myself. How is the expected location computed? Is
it just hash_as_int / max_sha1_hash * file_size?

------
haberman
There are 555,278,657 passwords in the database. With a Bloom Filter, you
could quickly rule out potential inputs. Even better, there is no need to hash
the input, because... it's already a cryptographic hash.

The input SHA-1 provides 160 bits of hash. If we divide that up into 5 hash
values of 32 bits, we can get a 3% false positive rate with a 483 MiB Bloom
Filter (which will easily fit in memory).

[https://hur.st/bloomfilter/?n=555M&p=0.03&m=&k=](https://hur.st/bloomfilter/?n=555M&p=0.03&m=&k=)

This will be blindingly fast. We're talking 5 random reads from memory. Even
in the worst case of 5 cache misses, we're still well under 1us. This will let
us return "not found" for 97% of inputs that aren't in the database.

If we get a hit there, then we could turn to a larger bloom filter for greater
accuracy, but we'd have to actually hash the key to get more hash bits.

Of course if you get hits for all your bloom filters, you still have to do a
real lookup to positively confirm that the key is in the database.

~~~
ascar
> Of course if you get hits for all your bloom filters, you still have to do a
> real lookup to positively confirm that the key is in the database.

As I replied elsewhere, please do not ignore this part for good user
experience. I've seen it ignored in open source projects (Keycloak). You don't
wanna block perfectly fine passwords for reasons unknown to the user because
of false positives. That might cause unwanted reactions at your user's side
("was my password leaked!?!?").

~~~
ThePhysicist
The false positive rate can be made arbitrarily small. I have an
implementation with a fp rate of 1:1.000.000 with a filter size of just 1.8 GB
([https://github.com/adewes/have-i-been-
bloomed](https://github.com/adewes/have-i-been-bloomed)). The cost of lowering
the rate is logarithmic so you could go to much lower values for little cost
(1:1.000.000.000 would be 2.8 GB) so false positives are not really a problem
in practice. If you really want you could still perform an exact check against
a DB if the filter returns true to rule out false positives with certainty,
though at one false positive for one billion requests this might be
exaggerated.

------
glangdale
Any good literal search algorithm could do a one-off search for a single long
literal 'needle' way faster than the roughly 1GB/s that the author attained
with grep.

A single string of that length is extremely easy to search for with a range of
different algorithms - I would be surprised if a decent approach couldn't keep
up with memory bandwidth (assuming your 22GB file is already, somehow, in
memory). The mechanics of simply reading such a big file are likely to
dominate in practice.

We implemented some SIMD approaches in
[https://github.com/intel/hyperscan](https://github.com/intel/hyperscan) that
would probably work pretty well (effectively a 2-char search followed by a
quick confirm) for this case.

Of course, that begs the question - presupposing that any kind of whole-text
search is actually the answer to this question. The end result - assuming that
you really _do_ have more than a few searches to do - of keeping the results
in any kind of prebuilt structure - is _way_ superior to an ad hoc literal
search.

------
ocfnash
I love this write up, and while the solutions discussed are excellent, I think
the general-purpose FM Index data structure might work even better. I confess
I'd have to read this post more closely to be sure, but I find the FM Index
data structure so appealing, I'm always looking for excuses to promote it!

The FM Index is ideally suited to the problem of repeatedly searching a large
fixed corpus for many different short substrings, and achieves optimal time
complexity: linear in the length of the substring, with excellent constants
independent of the length of the corpus (!).

Some years ago undertook a very similar exercise to that of the author except
using the leaked Adobe password data rather than the HIBP data, and found the
FM Index worked well: [http://olivernash.org/2014/01/03/dna-of-a-password-
disaster/...](http://olivernash.org/2014/01/03/dna-of-a-password-
disaster/index.html)

------
tylerchr
I undertook a similar endeavor a while back. My solution[1] rested on the
observation that you don’t need to have a B-tree to do a binary search; one
need only be able to calculate the correct byte offset of the Nth hash. With
some optimizations, this approach produced a 9.9GB file with similarly fast
lookups.

[1]:
[https://github.com/tylerchr/pwnedpass/blob/master/README.md#...](https://github.com/tylerchr/pwnedpass/blob/master/README.md#file-
format)

~~~
fwip
It looks like the link to your blog post is broken (at the bottom of the
README).

~~~
tylerchr
Awkward! Thanks for the tip. I added a copy of that post to the repo and fixed
the link.

------
krackers
If you were to just throw the file into a database, wouldn't the database's
index essentially lead to the same result (b-tree, compacted using bulk-
loading procedure).

~~~
nightfly
I briefly had a Rocket/IRC bot that talked to a postgres instance with the
HIBP DB loaded into it, and yes it worked great.

------
dvasdekis
I respect that the author learnt the underly concepts, which are not simple.
But is the net result truly that the default Postgres index method was
perfectly suitable for this use case?

~~~
stryku2393
Thanks (: About the Postgres, I wanted to create a library and a CLI without
dependencies. I wanted them to be a complete tools for doing this one thing.
Tools that you can just grab and use, without installing anything.

------
emmelaich
Just to remind people of `look`, which does a binary search.

It might be superior to the articles methods if you only want to search for a
few.

~~~
xurukefi
Didn't know about look. Seems good enough (tested on a HDD):

    
    
        $ dd of=pwned-passwords-sha1-ordered-by-hash-v5.txt oflag=nocache conv=notrunc,fdatasync count=0
        0+0 records in
        0+0 records out
        0 bytes (0 B) copied, 9.5864e-05 s, 0.0 kB/s
        $ time look `sha1sum <(echo -n password) | tr [a-z] [A-Z] | cut -d" " -f1` pwned-passwords-sha1-ordered-by-hash-v5.txt
        5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8:3730471
    
        real    0m0.137s
        user    0m0.002s
        sys     0m0.013s
    

dd is used to drop the file from the fs cache. Something the author probably
didn't do given the unrealistic 49μs. It's simply not possible to fetch data
from a HDD that fast.

~~~
stryku2393
True, the benchmarks are bad. I'll rewrite them (to drop the cache every time)
and update the results.

------
_wldu
You may also consider using a bloom filter to do this:
[https://github.com/62726164/bp](https://github.com/62726164/bp)

------
chinesempire
using ETS with Erlang (or Elixir) I get sub 50μs (30μs on avergae) lookup
times.

Memory usage is quite high, around 95 bytes per element, bu I'm sure that by
spending more than 5 minutes on it, like I did, one can take it down
considerably

For reference, this is the code I used

I converted the SHA hashes to MD5 to save memory, given we don't care about
collisions (which are very unlikely anyway), we just want to know if the
password was there or not.

    
    
        defmodule Pwned do
          def load do
            :ets.new(:table, [:named_table, :set])
    
            File.stream!("pwned-passwords-sha1-ordered-by-hash-v5.txt")
            |> Stream.each(fn line ->
              <<hash::binary-size(40), ":", _rest::binary>> = String.trim_trailing(line)
              hash = :crypto.hash(:md5, Base.decode16!(hash))
              :ets.insert(:table, {:binary.copy(hash), true})
            end)
            |> Stream.run()
          end
    
          def lookup_hash(hash) do
            hash = :crypto.hash(:md5, Base.decode16!(hash))
    
            case :ets.lookup(:table, hash) do
              [] -> false
              _ -> true
            end
          end
    
          def lookup_password(password) do
            lookup_hash(:crypto.hash(:sha, password) |> Base.encode16())
          end
        end

~~~
geocar
I agree these performance numbers don't seem great. Using q (another
interpreted language; not compiled) I get 5µsec on my Macbook Air:

Here's my load script:

    
    
        \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
        \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
        `:hibp 1: `s#0N 20#read1 `:hibp.input
    

I can then shut down this process, and start a new one:

    
    
        q)hibp:get`:hibp; / this mmaps the artefact almost instantly
        q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
        5
    

It's so fast I need to run it 1000 times to take just 5msec (5µsec average
lookup time!). I imagine converting to md5 would be substantially faster since
there's a 16-byte scalar type in q I would be able to use.

------
ttt111222333
I did something similar when I wanted to search the HIBP database and if you
are okay with some false positives you can do better than your results, both
in terms of speed and size.

If you are okay with false positives, you can use a bloom filter and tune the
number of false positives you want. I chose a false positive rate of 1 in a
million so my data structure was still very accurate in determining whether a
password was already hacked.

It only took 30 microseconds to determine if a password was in the list and
for size, was at the theoretical limit of 22 bits per element or ~1.5gb.

I originally used a bloom filter which made it 2gb but given a bloom filter
was just a sequence of 0s and 1s, I was able to use a golomb coding to shrink
it down to 1.5gb.

The time to process the original 24gb however, is something that I could have
improved, but I kinda lost interest once I already had something that was at
the theoretical minimum size, as well as able to determine a password exists
within 30 microseconds.

Anyways take a look if you're interested in trying a different approach:
[https://github.com/terencechow/PwnedPasswords](https://github.com/terencechow/PwnedPasswords)

------
bArray
Surely if you know that the hashes will have an ~even distribution you can
quite quickly make some assumptions about roughly where the key will be?

I'm not entirely sure I'm sold on the speed gained by splitting files vs doing
a simple seek operation to an offset [1]. There's probably a bunch of time
lost searching the filesystem through a file/folder structure?

Also the simple act of converting the numbers from ASCII to binary should save
a bunch of disk space too (and make searching quicker)?

Great write-up though, good to see a bunch of solutions tried.

[1]
[http://www.cplusplus.com/reference/cstdio/fseek/](http://www.cplusplus.com/reference/cstdio/fseek/)

~~~
dana321
Thinking about it, you would need a 64-bit value to point to the offset
because of the size of the file.

But an index file of 64-bit offsets could easily be seeked to read the value
of the offset based upon the first 2 or even 4 byte offset.

Though with 4 bytes, that becomes a 4 gigabyte index file! But that would
probably be much faster as you only do one seek in one file, then another seek
to the main file, then search a much shorter distance to the result!

If the system has enough ram, the operating system will cache the files anyway
and will be pretty fast i think.

(can you tell my first job involved writing ad-hoc database systems?)

------
viraptor
Another solution for cases where you don't add new entries all the time and
can reindex the whole database every once in a while instead: use cdb. There's
a nice description of the internals and how it works.
[http://www.unixuser.org/~euske/doc/cdbinternals/index.html](http://www.unixuser.org/~euske/doc/cdbinternals/index.html)
It guarantees access in two disk reads.

The original version has the limit of 4gb, but there are 64b versions as well
- for example
[https://github.com/pcarrier/cdb64?files=1](https://github.com/pcarrier/cdb64?files=1)

------
ThePhysicist
Great writeup! Another way to query the DB are probabilistic filters. I wrote
a Bloom filter based query API for this a while ago:

[https://github.com/adewes/have-i-been-
bloomed](https://github.com/adewes/have-i-been-bloomed)

Very fast and highly space efficient as well, 17.000 requests per second on a
conventional laptop with 1.7 GB memory required at a false positive rate of
1:1.000.000 (and no dependencies on databases or anything else).

------
saagarjha
> A node is a simple structure of sixteen 32 bit values. The values are
> 'pointers' to next nodes, at given character of the SHA-1 hash. So, one node
> takes 16 * 4B = 64B.

I have often used a dictionary to store tries to prevent this kind of memory
usage–it's auto-resizing, if slightly slow. But hey, you're chasing pointers
anyways, so it's not like going through the tree was going to be fast anyways…

(I'm also curious about the "scumbag Steve" hat on the B-tree, but I digress.)

------
DmitryOlshansky
> Trie structure sucks if you have pretty random words.

Classic uncompressed trie sucks pretty much in all cases.

Now if we go for half-decent implementation of packed variation, it does get
significantly better:

[https://en.wikipedia.org/wiki/Radix_tree](https://en.wikipedia.org/wiki/Radix_tree)

------
QuadrupleA
Cool learning exercise and fun read. Can't help but think SQLite could do this
screamingly fast and very easily, with nice compact storage (blob primary key
with the hashes, without-rowid table to avoid a hidden integer per row).

That said, 49us is very impressive. Hard to beat low level custom coded
solutions.

~~~
geocar
> Hard to beat low level custom coded solutions.

Challenge accepted!

I used the following in q to download and load the data into a disk object I
could mmap quickly:

    
    
        \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
        \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
        `:hibp 1: `s#0N 20#read1 `:hibp.input
    

This took about an hour to download, an hour to 7z|cut|xxd, and about 40
minutes to bake. At complete, I have an on-disk artefact in kdb's native
format. I can load it:

    
    
        q)hibp:get`:hibp; / this mmaps the artefact almost instantly
    

and I can try to query it:

    
    
        q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
        5
    

Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's
entirely possible my MacBook Air is substantially faster than the authors'
machine, but I think being ten times slower than an "interpreted language"
suggests there's a lot of room to improve!

------
aquadrop
Why was data sorted by usage count in the first place? Hash of the password
doesn't give you the password, so you just get "something was used x amount of
times". Seems like you always want to look up by hash and then sorting by hash
from the beginning makes more sense.

------
Avamander
There was an app for Android phones that searched for default router passwords
based on SSID using this method. It indeed was very fast, even on very bad
hardware.

------
barrkel
B-trees, with their trade-off between slow disk seeks and fast in-memory
scans, make just as much sense for slow memory accesses and fast in-cache
scans.

------
7532yahoogmail
Nicely done and explained.

