Hacker News new | past | comments | ask | show | jobs | submit login
A journey to searching Have I Been Pwned database in 49μs (stryku.pl)
333 points by stryku2393 12 months ago | hide | past | favorite | 123 comments



This is a really informative write-up and an excellent learning exercise.

It's worth noting that haveibeenpwned's API has a really clever design for allowing people to look up their passwords without transmitting them to the site.

It's explained here: https://www.troyhunt.com/ive-just-launched-pwned-passwords-v...

The short version is that you can take the first 5 characters of a SHA-1 hash and hit this endpoint:

https://api.pwnedpasswords.com/range/21BD1

The endpoint returns (right now) a list of 528 full hashes along with counts. You can compare your full calculated SHA-1 hash to that list to see if the password is present in the dump.

The trick here is called k-Anonymity - I think it's a really elegant solution. This technique is written up in more detail here: https://blog.cloudflare.com/validating-leaked-passwords-with...


Honestly they’re being super mathematical about it but it’s really nothing fancy at all. SHA is designed to be used like this, CloudFlare hasn’t done anything remotely ingenious (and I wouldn’t mind except they go out of their way to talk about how much better their fancy new algorithm is compared to other multi-set intersection theories).

E.g. 512-bit SHA-2 and SHA-3 may be truncated at 128 or 256 bits if that’s all the entropy you need (and you don’t need to be compatible with the formal SHA2/SHA3-512/256 spec). Here, CloudFlare is truncating to an intentionally low entropy of just 20 bits, not to reduce the security but rather to intentionally increase the collisions. It’s ultimately just a glorified hash table and as any CS student can tell you, the bucket size is just a function of the hash size (v1 of the api: 128-bit hash, v2 of the api: 20-bit hash).

(I don’t like pretension.)


It still is pretty clever and non trivial to come up with.

I bet most people who haven't thought about it beforehand or heard the solution, would fail an interview question as "So, we have a database of 100s Million passwords and we want the user to check if his password is inside; without the user actually sending his password.. or his hash; without us outright revealing all our passwords/hashes to the user."

Their solution is a pretty good tradeoff imho.


This is pretty much the case for probabilistic membership functions. It's kind of a broken bloom filter where you take only the front few bits without hashing. I wouldn't expect someone to come up with that under the interview stress, but if you know a few basic algorithms, this should be pretty simple/familiar.

It's also similar to how we spread mail folders in the past to avoid large directories. Or how Debian splits repo by package prefix: http://ftp.us.debian.org/debian/pool/main/


Exactly and good examples. I believe the general concept is just sharding, for anyone that wants to google it further. Typically naive sharding on the first few bytes is useless because it results in uneven distributions but (and this is what I mean when I said SHA was designed for this) a cryptographic hash should result in a uniform random-appearing distribution across all its bits, meaning the resulting hash is uniquely suitable for basic sharding by prefix.

But then they wouldn’t have been able to give it a fancy name if they called it what it is.


For someone who claims to not like pretension (sic), you're being pretty pretentious.


After seeing this I reread the post you're replying to. I don't see the pretention.

Pretentious: adjective. characterized by assumption of dignity or importance, especially when exaggerated or undeserved: a pretentious, self-important waiter. making an exaggerated outward show; ostentatious. full of pretense or pretension; having no factual basis; false.

If you're suggesting that their analysis is false, you should probably point out why. The self importance part I'm not seeing. Aren't they just stating the facts as they see them?


The "really clever design" as noted by simonw was called "nothing fancy at all" by ComputerGuru, and that haveibeenpwned's API design "hasn’t done anything remotely ingenious", this is what made the comment pretentious. (in this case it doesn't matter wether he is or not factually correct)

The reply could very well be interpreted as "You think that Kk-Anonymity is fancy? How primitive of you. This way is the natural way of doing that thing and I would know because I'm the smartest person in the room. Oh by the way I dislike people who are pretentious."


I’m sorry if that’s how you understood it. This is not about the HIBP api but rather the braggadocio in Cloudflare’s post. I meant “they are purposely obscuring how obvious and intuitive their solution is behind all this math so they can seem extra smart, and it’s a shame because it would have been a better and more actually educative post if it just said ‘we did something very obvious and maybe you can just as easily do the same to your own data without being a math geek and it’s no big deal at all that we did, but here’s how we did it anyway.’”


You're mis-representing, because nowhere did ComputerGuru say what you're claiming you quoted, of HIBP.

The not-remotely ingenious part was regarding Cloudflare.


Consider the possibility that those of us who found it pretentious aren't lying, and that you not seeing it is perhaps more related to how you personally see things.


Now you're mis-representing my comment, I didn't anywhere discuss whether or not it was pretentious, I just pointed out that the comment was factually incorrect.


Why did you do that then?


Phrases like "honestly", "nothing fancy", "glorified", "any cs student" all convey (at least to this native English speaker) that whatever is being discussed is being looked down upon.


Right, so the post could be called "dismissive". It's hard to see it as openly glorifying anything though, so the charge of pretentiousness only works of one adopts some extra context.

If you assume that everyone who is dismissive is actually wrong and just pretending to know stuff, then I guess being dismissive is pretentious.


Sometimes it's the only way to look at things...


The analysis might not be false - but it sure is dismissive. Dismissiveness that jars with the stated dislike of 'pretension' and that makes the poster sound pretentious themselves.


You seem to be saying that one cannot be sure of ones own position and not be pretentious, is that what you mean?


No. What I’m trying to say is that any comment on a public forum does not exist in a vacuum and acquires additional properties based on where and when it is posted.


Uh... then you should probably edit your post, because I find it impossible to read this more general statement in the first post.


That's actually pretty similar to how Google's safe browsing (used by Chrome, Firefox, Safari but not Edge) works. Instead of sending Google your full URL (although that's an option), you can make some transformations, SHA-256 the result and send the first few bytes. The server then replies with the full hashes matching these prefixes. Then you can check whether your URL is on the list. Very similar.


The core trick (and difference) of Safe Browsing is that you don't send stuff most of the time. Safe Browsing clients all download the same summary information which tells them which prefixes might have unsafe hashes. Most sites you visit will not match any unsafe prefix and so your browser doesn't call Google at all.

Pwned Passwords chooses prefixes short enough that any password you wonder about will cause a prefix to be looked up that has lots of Pwned Passwords in it. Was one of them yours? Only you know, this is k-Anonymity.

Safe Browsing chooses prefixes long enough that many sites you look at won't match anything at all. There is still arguably k-Anonymity because the total number of possible URLs is so vast, but that's not their main goal.


I was thinking, with all the recent discussion about the privacy problems with DNS (over https) would it be possible to use the k-Anonymity algorithm to build a dns server where you request dns details based on a sha-1 prefix of the domain you are looking up?

I have a suspicion the answers is no due to the way DNS lookup is recursive and so all dns servers would have to implement the protocol. Would there be a way to make this work?


I tried https://api.pwnedpasswords.com/range/e38ad

And sure enough:

214943DAAD1D64C102FAEC29DE4AFE9DA3D:2413945

Wow, 2.4 million for "password1". Apparently "123456" is the most common password of all time. Even random phrases and memes I tried were pwned, like "The Lord of the Rings" and "correcthorsebatterystaple"


I know you meant the word colloquially, but those aren't random phrases and that's why they're in the dataset.

"The Lord of the Rings" and "correcthorsebatterystaple" match but similar truly random phrases do not, I picked these without looking:

This Duke and Some Circles

impossiblegoosecheesebinder

Sure enough neither of them was in the dataset.


> Sure enough neither of them was in the dataset.

Give it a week...



Worth noting that Tr0ub4dor&3 wasn't pwned, unlike "correct horse battery staple", according to https://haveibeenpwned.com/Passwords :-)


> It's worth noting that haveibeenpwned's API has a really clever design for allowing people to look up their passwords without transmitting them to the site.

If you use the API to look up one of you passwords and it turns out to be pretty poor, then the service knows you're highly likely to have been looking up the top (by count) result, so now has the password and your IP address. I appreciate this service provider is well respected, but still, this is also worth noting.


k-Anonymity is not particularly clever. Every time you use a cryptographic hash to look something up you have a tradeoff around the length of the hash.

A long hash will identify your password very precisely.

A very short hash e.g. down to 1 byte will have so many matches to be useless.

Cloudflare chose:

> For example; the Hash Prefix 21BD1 contains 475 seemingly unrelated passwords, including:

...and this allows attackers to easily create a list of short hashes of common passwords and try them against matching accounts, as they point out:

> It is important to note that where a user's password is already breached, an API call for a specific range of breached passwords can reduce the search candidates used in a brute-force attack


It's clever in the way that if your password does not appear in the dataset the service doesn't have much information (or really any information that could be used to get your password).


Why the hashes I see at this endpoint don't start with the 5 digits I given as an argument?


I assume they've been omitted from the output (each hash is 35 characters, but a full one should be 40). This of course reduces the size of the result, conserving bandwidth.


Thank you for these resources. Truly ingenious.


This and other similar solutions seem awfully over-engineered.

a) Hashes are constant size (20 bytes for SHA1)

b) You only care if they're present or not in the database. There's no associated variable length data.

The simplest yet very efficient format is simply a sorted array of "byte[20]", with binary-search as the lookup. No headers, no pointers, no custom format of any type. Literally just 20 x n bytes where 'n' is the number of hashes. Lookup of the 'n-th' hash is just multiplying 'n' by 20.

If you really, really want a B-Tree (why?), just stuff it in any database engine. Literally anything will handle a single fixed-length key lookup efficiently for you.

    CREATE TABLE "HIBP" ( "SHA1" BINARY(20) PRIMARY KEY );
    SELECT 1 FROM "HIBP"
    WHERE "SHA1" = 0x70CCD9007338D6D81DD3B6271621B9CF9A97EA00
There. I solved the blogger's problem in literally under 5 minutes without having to write a custom binary.

You can trivially query databases from both web apps and CLI tools, and you can do this with batch queries too. E.g.: WHERE "SHA1" IN (... list... )

PS: Text-based tools (such as most shell tools) suck at this type of binary data. The newline terminated hex representation is 41 bytes per hash, so just over double the required size. Clever 5-10% prefix compression tricks pale in comparison to not doubling the data size to begin with.

PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys. The 16 byte GUID ends up taking a whopping 78 bytes to store!


You can do better than binary search here with interpolation search since the data is equidistributed. (You can do it for any data with a known distribution, but it's unlikely to be worth it for any data set that isn't linearly distributed.)

Roughly, instead of picking at 1/2 every time, you pick b/256 in, where b is the current byte value. That gets you log log n search. (You can of course pick 2 bytes, or 3, or...)


Yes, and hence a flat array data file would allow notably better performance than this B-Tree solution.

The fastest method would be to treating the 20-byte hash as a floating point number in the range 0.0 and 1.0. This would let you use Newton's iteration method to very accurately guess the approximate location in the file and then narrow it down from there. By reading small blocks of 0.5-4KB you could probably get to the required hash in just 1-2 reads, which is either optimal or very close to it.

Do clever things to simple data!


As I said in the article, I rejected binary search and similar approaches on purpose.

In general, I rejected all approaches that operates on a sorted file. That's simply because keeping the file sorted is slow, when you need to insert more data in the future. With B-tree approach, you have fast searching, as well as, inserting. That's the reason.


The HIBP database is updated as infrequently as once a year.

Sorting a handful of gigabytes was a solved problem over 20 years ago. Sorting a file using an "online" operation like row-by-row insertion in a B-Tree is necessarily slower than any offline sort. Merge sort is particularly efficient with data that won't fit on disk.

If you do need "online" operation, a real database engine provides this (and more) for practically zero effort. It's almost as if... they were designed for this purpose!


There is no restriction on using okon only for HIBP passwords. Actually, the goal is to handle hashes, no matter where they come from. User may have reasons to update db frequently.

Yes, you're right that sorting data using B-tree is slower than other methods. But I don't use it only for sorting. It's not like 'create B-tree and forget about it'. You create the tree from initial data, and if you need to insert any more hashes there, you can do it really quickly, without a need to sort everything again.

I agree again. If I'd create okon only for my home usage, I'd probably use a real database. But I didn't want to depend on anything. I wanted to create a library that can be used by any program and just works, without forcing user to install anything else. That's why I sort the data on my own, and that's why I implemented B-tree on my own. Everything is in okon's codebase.


How do you use Newton’s method here to go faster than what I described? How do you treat a 20 byte hash as a floating point number? Unless you mean by dividing part of it at a time and interpolating linearly, which sounds like what I suggested :-)


Take the first 8 bytes, treat it as a 64-bit long integer, and then convert to a 64-bit floating point number and divide by 2^64 to get a number in the range 0.0-1.0.

Multiply this by the number 'n' hashes in your file, and it'll very closely approximate its ordinal in the list, because as you said hashes are equidistributed very well.

Now multiply this ordinal by 20 and load a small slice of the file a few KB to either side of where you're aiming so that there's essentially a 100% chance of finding the hash in one "random read" operation.

If not, you now have a block of hashes that is close, but not quite what you want. You can use the floating point conversion trick to see "how far away" you are, calculate a more accurate estimate and do a second read. It's very unlikely you'd need a third read step.


Yep! And here's a mini guide I wrote to building a SQLite DB that queries in sub-millisecond time:

https://gist.github.com/timmc/df5fbb6e069fb8c1c4e181a29930ac...

I didn't bother with using binary keys, and just used hex, because I couldn't be arsed to do something more than `.mode csv` and `.import hashes.lst hashes` at the time, but it would be easy to write a tiny Python program to halve the space, as you say.


1. The raw array solution doesn't work if you can't fit the entire thing in memory. You could try mmap'ing the file and letting the kernel do the paging for you, I guess.

2. There's nothing over engineered about a B-Tree. It's a simple, well understood data structure which solves this exact problem.

3. There are meaningful differences in what you can do with a solution that gives you answers in microseconds vs milliseconds, as the blogger mentions.


> The raw array solution doesn't work if you can't fit the entire thing in memory.

Of course it works. You don't have to literally load the whole file into memory as an array. That's not how file I/O works! Just fseek() to the desired multiple-of-20 address and read 20 bytes.

> You could try mmap'ing the file

You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures. Certainly less than the 11GB needed for this problem!

> There's nothing over engineered about a B-Tree.

Yes there is. This guy hand-rolled 400+ lines of low-level data structure manipulation code just for the B-Tree manipulation. How much do you want to bet that he's got zero memory safety issues or other unsafety in that thing?

Would you link this to your web app? Really, a C++ library?

> That gives you answers in microseconds vs milliseconds

I don't want to call him a liar... but now I have to. At the very least, he's made fatal mistakes in his benchmarking.

There is no way that he can do random password lookups from on-disk structures in under 50 microseconds, unless he's got an Intel Optane SSD, and even then I would be highly suspicious.

He's either cached the entire data structure in memory (at which point why bother with such a complex solution), or he's used the same set of passwords over and over for testing (which will falsely show a much lower latency than you would get with real passwords).

So answer me this: How would this solution work on a typical auto-scale cluster of stateless web server VMs? Do you replicate 5-12GB to every VM every time this database changes? Or do you put one copy on a network share? Congratulations, you now have 1ms latency minimum! That fancy microsecond optimisation has just gone out the window...

To be blunt: If your password-testing is the bottleneck to the point that a 50 microsecond latency is needed, then you're running a service that gets 20,000 password resets per second. At this point you're bigger than Office 365 and G Suite... combined, and you likely have much more important scaling concerns.


That's exacly why I publish my pet projects. To learn from others.

You're right. The benchmarks are bad. I benchmarked the best and the worst case in scope of the original file, so I look up for the first and the last hash.

I totally missed that if I look for the same hash over and over again, I'd end up reading the same B-tree files parts, so they can be easily cached. That's probably the reason it seemed so fast.

No lies here, I just missed this (:

I'll rewrite benchmarks and update the results.


> You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures.

Do you have a reference for that? I've never heard of this and I can't imagine the reason for such a limitation.


Having done a lot with mmap on 32bit systems I can't remember such a technical limitation either..

..although in practive it might be hard to find 256MB of contiguous memory in a 4GB virtual memory space due to fragmentation of other allocations and shared libraries, especially with ASLR.


https://stackoverflow.com/questions/5518084/memorymappedfile...

There's the hard limit of 2GB for most versions of 32-bit Windows and 4GB for any operating system.

Couple that with the requirement for a contiguous address space as well as various page table entry (PTE) limits, you get all sorts of "soft" limits way before 2GB. From what I've heard, 256MB is relatively safe to map, but anything much larger than that is increasingly likely to fail.

Correctly written code should be able to work with moveable "windows" into the file as small as 32MB to be properly robust, especially if the process memory is already fragmented.

Lots of software crashes with large files on 32-bit machines because of this. E.g.: https://www.monetdb.org/pipermail/users-list/2009-January/00...

As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

Even on 64-bit platforms you can run into trouble. For example: https://jira.mongodb.org/browse/SERVER-15070

In that example, Windows Server 2008 R2 has an 8 TB limit. You could hit that if using a tool like ripgrep to do "forensic analysis" of disk images from a SAN, where virtual disks typically have 16 TB limit. So if you mount a SAN snapshot and open the disk as a file to scan it, you will hit this limit!

Programmers make all sorts of invalid assumptions...


ripgrep doesn't require memory maps, and if they fail to open, it will fall back to a more traditional buffering strategy: https://github.com/BurntSushi/ripgrep/blob/50d2047ae2c0ce2ed...

ripgrep has always had a fast traditional buffering strategy using `read` calls for searching, because I knew that mmap couldn't be used in every case.

Anyway, this has been fixed for a couple years at this point, so if you're still experiencing a problem, then please file a new bug report.

> As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

This is false. The bug you're thinking about is probably https://github.com/BurntSushi/ripgrep/issues/922, which was not caused by an underlying bug in memmap. memmap did have an underlying bug with respect to file offsets, but ripgrep did not use the file offset API. The bug was caused in ripgrep itself, since I made the classic mistake of trying to predict whether an mmap call would fail instead of just trying mmap itself. That bug was fixed on master before the Windows bug was even reported: https://github.com/BurntSushi/ripgrep/commit/93943793c314e05...

> You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did.

This isn't exclusive to Rust programmers. C tools make the same mistake all the time. Because memory maps aren't just problematic with large files on 32-bit systems, but they also don't work with virtual files on Linux. Try, for example, `ag MHz /proc/cpuinfo` and see what you get. Crazy how, you know, sometimes humans make mistakes even if they are a C programmer!

And the implication that I (or the author of memmap) never read the docs for `mmap` is just absurd.

If you're going to be snooty about stuff like this, then at least get the story correct. Or better yet, don't be snooty at all.


We've spoken before about this issue and at the time ripgrep was just erroring on large files on 32-bit platforms, it didn't fall back. You were using the Rust crate "mmap" at the time, you removed it temporarily as a fix, and now you're using the much improved "memmap" crate. Good stuff! I do use your tool occasionally, and it's useful, albeit the CPU fan noises annoy my co-workers.

The specific issue making the "mmap" crate incorrect was that it used a "usize" instead of "u64" for some of the functions, limiting it to 4GB files on 32-bit platforms. I believe it's this line of code: https://github.com/rbranson/rust-mmap/blob/f973ae1969b4b7e80...

Now, I'm not a mindreader, but to me this feels an awful lot like its author made a tacit assumption that mmap() is a "memory operation" that is tied to the architecture's pointer size. In similar conversations, heck, in this very discussion people were incredulous that a file can be bigger than memory and be processed.

I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.

I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".

This is just not true, not even on 64-bit platforms. On smaller devices with only 2-4GB of memory, it's entirely possible to simply run out of page table entries (PTEs). It's possible the memory space simply gets too fragmented. It's possible the kernel has other limits for processes. It's possible the that file is some virtual device with an enormous reported size. Etc, etc, etc...

The correct usage of mmap() is to use moderately-sized sliding windows of, say, 128MB at a time or whatever.

But, having said that: Your code is now correct in the sense that it won't crash, it won't have unsafety, it'll run on 32-bit just fine, and will probably work for all practical scenarios that people want to use a grep tool for. I also know that you have specific optimisations for "the whole file fits in a byte slice", so there's benefits to using the simple approach instead of a sliding window.

However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...


> You were using the Rust crate "mmap" at the time, you removed it temporarily as a fix, and now you're using the much improved "memmap" crate.

I don't understand why you're saying this. Could you point me to the place in the commit history where I used the `mmap` crate? The second commit in ripgrep's history is what introduced memory map support and it used the `memmap` crate: https://github.com/BurntSushi/ripgrep/commit/403bb72a4dd7152...

> albeit the CPU fan noises annoy my co-workers

ripgrep is happy to be told to run more slowly with `-j1`.

> I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.

This sounds to me like "I'm right so I can be as much of an arse as I want." Just don't be snooty about this. Sometimes I can read a man page thoroughly and still come away from misconceptions. Sometimes the docs are just bad. Sometimes it's just very dense. Sometimes there's a small but important detail that's easy to miss. Or sometimes I'm just not smart enough to comprehend everything. Instead of getting up on your holier-than-thou perch, maybe tone it down a notch next time.

> I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".

Not really. Especially since ripgrep's man page explicitly calls out memory maps as potential problem areas, and even gives users the option to avoid the issue entirely if they like:

> ripgrep may abort unexpectedly when using default settings if it searches a file that is simultaneously truncated. This behavior can be avoided by passing the --no-mmap flag which will forcefully disable the use of memory maps in all cases.

But, invariably, one of the nice things about memory mapping a file is precisely that it "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand." And it generally pretty much works.

> However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...

It's good enough where SQLite actually provides an option to use memory mapped I/O (noting pertinent downsides): https://sqlite.org/mmap.html Lucene also provides it as an option: https://lucene.apache.org/core/6_3_0/core/org/apache/lucene/... --- They likely both do the windowing you're talking about, but as the SQLite docs mention, that's not enough to stop it from crashing and burning.

At that level, it's good enough for ripgrep and it sure as hell is good enough for a random fun project like the one the OP posted. Absolutely no reason to get on your soapbox and snub your nose.


> never noticed that mmap is limited to a ~256MB window on most 32-bit architectures

That's interesting, but processing more than 4gb of data on a 32bit systems seems pretty niche, these days? Where do you find them outside of industrial embedded applications? I think even my long discarded smart watch ran 64bit.

Now, vfat filesystems will bite you (Esp for removable media), but that's also fixable with some other fs, like xfat or zfs.


> If you really, really want a B-Tree (why?)

Cheap writes, in an OLTP sense?

I had a similar problem recently: discovering "everything" in a DHT (= asking each node for everything it has), and feeding the resulting documents into a message queue for processing, without wasting resources processing each object's thousands of duplicate copies found distributed in the DHT.

These objects had no natural key, so I had to use a content hash for deduplication. And there was no way to time-bound the deduplication such that I could bucket objects into generations and only deduplicate within a generation. I needed a big fat presence-set of all historical hashes; and I needed to add new hashes to it every time I found a new unique object.

B-Trees have a good balance of read- and write-performance, which is why they're used for database indices and (usually) as the lower-level persistence strategy of key-value databases.

> PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys.

Endorsing this point and boosting it: DBMSes really haven't thought out how to efficiently store large identifiers.

For example, a DBMS could suggest that clients feed it UUIDv1s, and then break them down into separate hidden {mac:48, timestamp:60, clockseq:14} columns, enabling per-column compression. (In most installations I've encountered, there's only one client generating UUIDs for the DBMS anyway, so these would compress really well, in fact enabling a default packed format where the timestamp + 5 bits of clockseq are kept in a 1-bit-tagged uint64; and the full value-size is only needed for exceptions.)


> Cheap writes, in an OLTP sense?

Sure, but HIBP is most certainly not an OLTP workload. It is updated infrequently as a batch process. The official mirror was last modified in July 2019!

Whenever a "new set" of millions of passwords are leaked, the HIBP maintainer(s) merge it with their existing data set of millions of passwords.

The "update" process is to download the new data set... and that's it. It's already sorted.

The only step I'm suggesting is to simply convert the pre-sorted HIBP SHA1 text file to a flat binary file. This takes a constant time and requires only a tiny buffer in memory.


> DBMSes really haven't thought out how to efficiently store large identifiers.

This 1000x.

I really wish SQL Server had a "hash key" type where the hash is SHA512. You could then use anything as a primary key or an index, irrespective of size.


I don't think that expecting that an end user has installed a whole database toolchain is a good solution.

The purpose of the tool and the library is to be a complete solution for this one thing. The user can grab the binary and just use it, without installing/using any database under the hood.


Not sure if you've ever tried to load such a large dataset into any rdbms, but it takes hell of a long time to create the index.


If you stripe a few NVME SSDs together, it'll help quite a bit.


Yet you pay the cost in speed when using a database engine, there is a big overhead.


Meh.

This query is just a key lookup, so the response time is likely to be well under 10ms even if you're not trying very hard to be efficient.

If using a stored procedure or a prepared query with persistent connections it'll be most likely be under 1ms if the data is stored on SSDs or in-memory. In this binary format, you need about 6 GB.

For some scenarios such as small cloud web servers, fast storage or lots of ram may not be viable. So if you're storing this on mechanical drives, the random nature of hashes means that for practically all queries the db engine will be walking the B-Tree pages and then the random I/O seek latency will dominate.

Some quick back-of-the-envelope maths: You can fit roughly 300 hashes into a typical 8KB page, as used by MS SQL Server as a random example. That means it'll build a 4-level B-Tree index for the HIBP database.

Unless you have less than 1 GB of memory, the first 2 index levels will become cached, leaving 2 random seeks for each lookup. At a typical 3-5ms, this is about 6-10ms of disk I/O latency, which will likely dwarf all other overheads.

Now keep in mind that the OP was trying to optimise a workflow that took 33 seconds originally!

Using a proper database has a ton of other benefits. For example, it becomes trivial to fit into modern asynchronous web application programming frameworks as yet another async call. Literally 1 line of code using Dapper or something along those lines.


This is bad benchmarking. There is no way you're doing a b-tree lookup in microseconds on an on-disk file... Unless the parts you care about are already cached.

So either the whole file fits in RAM and you pre-load it (in which case you have to account for that memory usage), or you have to run benchmarks on random hashes, which would yield much slower numbers (on the order of 30ms for an HDD).

Personally, when I implemented this in a web service, I used a bloom filter. It has some false positives (tunable) and requires a few extra disk reads per check, but the resulting file is also smaller and the code to generate it and check it is very, very simple.

https://gist.github.com/marcan/23e1ec416bf884dcd7f0e635ce5f2...

P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort` command. No, it does not load it all into RAM. It knows how to do chunked sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools are smarter than you think.


> P.S. if you need to sort a huge file, just literally use the UNIX/Linux `sort` command. No, it does not load it all into RAM. It knows how to do chunked sorts, dump temp files into /tmp, and then merge them. Old school UNIX tools are smarter than you think.

This so much.

I’ve worked with many devs and admins that don’t understand the tools that they have at their disposal on their systems. They end up trying to reinvent the wheel and their solutions usually don’t consider all the edge cases


Note that this does take a while; I let it go for about two hours before killing sort.


     time gsort --parallel=2 -o $HOME/pwned-pass-sorted.txt pwned-passwords-sha1-ordered-by-count-v5.txt
         2890.06 real      1400.86 user       165.54 sys

48mins

`gsort` is GNU sort on coreutils (not the one included on macOS).

This is on a Mac Mini 2011 (5,1) with the 2.3GHz i5.

They really have a lot of things built into these tools :)


> Personally, when I implemented this in a web service, I used a bloom filter.

When I was working on blocking leaked passwords, I found projects using bloomfilters for this (e.g. Keycloak). I find false positives completely unacceptable from a user experience perspective.

Blocking a perfectly fine password is playing with the users mind. If he cares even a little, he will wonder what is wrong with his password (was it leaked???) and if the answer "false positive" isn't available to him it's just evil.

A bloom filter is a good tool to filter out the negatives (which should hopefully be the majority of passwords), but please run all positives against the full set.


It depends what purpose you're using this for. If it's something like a password manager where you're checking the user's existing passwords to see if any have been breached, then yes, you should make sure not to hit false positives.

But if you're using it as a way to prevent people from using known-breached passwords on a site/service, it's really not worth worrying about. False positives would only be a problem when the user is using bad password practices anyway. If it's a non-shared, random password like it should be, a tiny chance of blocking an acceptable one is fine. They can just generate another one, it's an extremely minor inconvenience at worst.

Just include a note in the error message saying something like, "In very rare cases, this could be a false positive. Even if it is, you must choose a different password." The 0.1% of users it impacts (or whatever your error rate is) will be fine.


While I absolutely agree you shouldn't reuse passwords across services, it's a reality for many users and I'm convinced it's not an acceptable point of view to tell your user his password might be leaked, if there is no indication for it. This is not a minor inconvenience, this might trigger a major panic and it's inconsiderate to ignore it.

Even if you display there is a minor chance of a false positive, your user must now think 0,1% this was a false positive or 99,9% my password was leaked! I don't want users to panic if there is no reason to and I don't want users to blame false positives if there is reason to panic. I think it's definitely worth worrying about.

Also research has shown that forcing users to regularly change passwords leads to weaker passwords. And as many (most?) users are not using password managers (yet) expecting a different secure (i.e. long enough, non dictionary) password is too far from reality.


You probably shouldn't indicate to them that their password has been leaked.

The database only lists the SHA hash of passwords which have been found in various datasets and the occurrence count. It does not indicate that the matching password (one of half a billion) has ever been associated with the user account.

The fault in knowledge schemes is oversharing the secret. However, there is no way to know for sure that the user has done this by reusing passwords - you will have false positives (say, one other person on the internet chose this password by chance) and false negatives (the user has reused the password all over the place, but has managed to not be part of a known breach dataset).


The false positive rate I tuned for was lower than the total number of users, so chances are nobody ran into that problem.

If this were a bigger service, yes, I would've done an exact check afterwards :-)


Exactly what I thought. Microseconds? On HDD? That would mean the block is in RAM.

HDD latency is composed of seek + rotation latency + processing time, on the best of the best server HDDs random read might be in the ballpark of 10ms (10,000μs), desktop HDD... maybe around 30-40ms (30,000μs). Or ~600x times more than 49μs.


Desktop HDDs aren't that bad, but 100 IOPS is a good ballpark for back of the envelope estimates (10ms). The OP wrote their B-Tree to require 3 lookups, so assuming they really are single disk reads that's where I calculated 30ms.


Just to confirm, the benchmarks are bad. Almost always the read parts of B-tree file are cached. I'll rewrite the benchmarks and update the results!


Thanks for pointing this out.

You're right. As I said in some comment up there, the benchmarks are bad. I benchmarked the best and the worst case in scope of the original file, so I look up for the first and the last hash.

I totally missed that if I look for the same hash over and over again, I'd end up reading the same B-tree files parts, so they can be easily cached. That's probably the reason it seemed so fast.

I'll rewrite benchmarks and update the results.

About the Bloom filter, I thought about that but I wanted a solution that is 100% correct. The filter doesn't guarantee that.

About the sorting. I wanted to implement a stand-alone library that does everything for you and that is cross-platform. That's why everything is implemented, without dependencies/usage of other tools.


> Personally, when I implemented this in a web service, I used a bloom filter.

I'm not a professional programmer, but I would have supposed a bloom filter is in this case equivalent to just saving the first n bits of every sha1 hash. Is that wrong?


You're right. A bloom filter is a random projection of the input domain. A random projection of a uniformly-populated random input is not useful. A large set of password hashes should be already maximally entropic.


A bloom filter is different here. Imagine we're working with a fixed 512MB of RAM, which we're going to use as a bitset addressing a range of [0 ... 2^32). An over-estimate of the 22.8GB hash file at 40 characters per hash gives that it it contains about 2^29 hashes, so if we store the 32-bit prefix of each hash in this table, we expect it to be about (2^29 / 2^32) = 1/8th full.

We can use this set of prefixes to filter the data. Given an input hash, check whether its prefix is in the table. If it's not in the table, the prefix is certainly not in the file. If the prefix is in the table, there is a 1/8 ~ 13% chance of a "false positive" where the query is not actually in the file, but the set thinks it is.

What a Bloom filter does is start using the rest of the hash, by storing the presence of the next 32-bits of the hash in the same set, and so on. By storing the next 32 bits as well, we can drop the false positive rate to 6%. By using the next chunk of 32 bits, it drops again to 4%, and finally when storing 5 32-bit chunks (very conveniently, we have at this point "used up" all 32-bit chunks of our original 160 bits) to around 2.5%. All the while, the bloom filter is fitting into the same 512MB we originally gave it.


Thanks. I assumed it was more densely populated than that, but now that I think about it it's obviously a human-scale set of passwords.


No, a bloom filter requires k random projections of the input domain over a domain of m elements. In my smaller sample configuration, k=11 and m~=5000000000. log2(5000000000^11) ~= 354 bits (and you need a few more for padding to get better uniformity). SHA-1 hashes are 160 bits, so you can't use the original hash directly to index into the bloom bitmap. You need to hash again at least twice (or once with something like SHA-512).

But really, hashing is so cheap there's no point in trying to be clever like this for an on-disk implementation. My code is designed to take any strings as input, whether they are HIBP hex hashes already or something else. If this were intended to be a high performance in-memory implementation used for a database or something, the constraints would be different.


Thanks! That's what I was thinking.


It's not just saving a few bits. A bloom filter requires a lot less storage than just chopping the hashes for the same false positive performance, because it uses multiple hashing steps (in my default config, 11, but this is tunable depending on false positive rate and input size), and all of them have to hit bits that are set in the filter for the match to be positive.

Pwned passwords 2.0 was half a billion passwords. That's 29 bits. To get a 0.05% false positive rate you need an additional 11 bits or so. That's 40 bits per hash, or 20 billion bits total, and then you need to binary search it (29 lookups) or make it into a tree and add bookkeeping structures.

The equivalent performance bloom filter only takes 5 billion bits of storage and 11 single bit checks in a bitmap.


I use a Bloom filter for this as well, but in Redis using the RedisBloom module. This was really easy to set up, but it does mean that it requires about 1GB of RAM or so (depending on error rate) dedicated to it, so it wouldn't be ideal if you're constrained for RAM.

I run a separate Redis server specifically for this purpose, which means when I want to update the list, I can build the Bloom filter on my local machine, and then just transfer the RDB file to the server and replace the previous one.

Here's the Python code for the simple CLI tool to initialize and add the hashes from the HIBP files to the Bloom filter if anyone's curious: https://gitlab.com/tildes/tildes/-/blob/master/tildes/script...

To check if a password's in the list, you just SHA-1 hash it and send a Redis command like:

    BF.EXISTS breached_passwords_bloom <sha1>


You can actually do much better than binary search due to the uniform distribution of hashes.

https://en.wikipedia.org/wiki/Interpolation_search for example can achieve O(log log n) performance under the uniform distribution assumption which is order of magnitudes faster for this scale of data. Another trick is to start with interpoluation search and then switch to binary search once the sample size gets small enough.


Much better?

I'm not sure about that, but this seems like a fun problem for exploration. Also, 49µsec sounds like a pretty easy target to beat.

I used the following in q to build a data set I could play with:

    \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
    \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
    `:hibp 1: `s#0N 20#read1 `:hibp.input
This took about an hour to download, an hour to 7z|cut|xxd, and about 40 minutes to bake. At complete, I have an on-disk artefact in kdb's native format.

    q)hibp:get`:hibp; / this mmaps the artefact almost instantly
    q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
    5
Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's entirely possible my MacBook Air is substantially faster than the authors' machine, but I'm also doing a lot of other shit while this is going on so whilst I believe a better benchmark is possible, I suspect even better results under better conditions, not worse.

So, if binary search is fast enough, how much faster should an Interpolation search be? My understanding is that an Interpolation search will make a better initial guess than a naive binary search because it can start "closer" to the correct value but it'd only work at all if the input had an extremely even distribution so it can guess that initial starting point to reduce the search space, so let's check that first:

    q)count each group hibp[;0]
    00| 2171182
    01| 2171242
    02| 2170869
    03| 2168638
    04| 2171675
    05| 2171500
    06| 2169285
    07| 2171129
    08| 2171463
    09| 2173704
    0a| 2173702
    0b| 2169950
    0c| 2169562
    0d| 2172129
    0e| 2171763
    0f| 2173154
    10| 2170242
    11| 2168806
    12| 2172306
    13| 2171502
    14| 2170208
    15| 2167949
    ..
Ok that looks pretty even to me, so next I partition on the first byte to see if getting 99% closer (1-1/256) on our first guess gets us anything:

    q)t:(0,sums value count each group hibp[;0]) _ hibp
    q)`:t 1: t
    q)t:get`:t / no cheating! back to the disk!
    q)\t:1000 {x~g(g:t[first x]) bin x} .Q.sha1 "1234567890"
    5
And it's still 5µsec for lookup! So I'm finding it difficult to believe there's "much better" in there. Do you have some benchmarks to look at?


If you're going to sort the hashes then might as well make a jump table of the first n bits and a binary search from there.


Why do you even need the jump table? If the hash function is working you should get quite close just by dead reckoning.


In digital forensics we often have to do a hash set lookup as in this article. When the set is constant, you can sort it as the author did, and then perform a linear scan to determine the maximum error—i.e., how far away a hash value is from its expected location—and then use a reduced binary search/interpolation search, where the expected index is used as the midpoint and the maximum error is used to determine the window. On large hash sets of this size, the maximum error is still often measured in KB.

It’s probably not the fastest possible algorithm (though likely faster than what the author obtained), but it plays much better with memory than naive binary search and the storage format doesn’t have any overhead.


I'm trying to implement this myself. How is the expected location computed? Is it just hash_as_int / max_sha1_hash * file_size?


There are 555,278,657 passwords in the database. With a Bloom Filter, you could quickly rule out potential inputs. Even better, there is no need to hash the input, because... it's already a cryptographic hash.

The input SHA-1 provides 160 bits of hash. If we divide that up into 5 hash values of 32 bits, we can get a 3% false positive rate with a 483 MiB Bloom Filter (which will easily fit in memory).

https://hur.st/bloomfilter/?n=555M&p=0.03&m=&k=

This will be blindingly fast. We're talking 5 random reads from memory. Even in the worst case of 5 cache misses, we're still well under 1us. This will let us return "not found" for 97% of inputs that aren't in the database.

If we get a hit there, then we could turn to a larger bloom filter for greater accuracy, but we'd have to actually hash the key to get more hash bits.

Of course if you get hits for all your bloom filters, you still have to do a real lookup to positively confirm that the key is in the database.


> Of course if you get hits for all your bloom filters, you still have to do a real lookup to positively confirm that the key is in the database.

As I replied elsewhere, please do not ignore this part for good user experience. I've seen it ignored in open source projects (Keycloak). You don't wanna block perfectly fine passwords for reasons unknown to the user because of false positives. That might cause unwanted reactions at your user's side ("was my password leaked!?!?").


The false positive rate can be made arbitrarily small. I have an implementation with a fp rate of 1:1.000.000 with a filter size of just 1.8 GB (https://github.com/adewes/have-i-been-bloomed). The cost of lowering the rate is logarithmic so you could go to much lower values for little cost (1:1.000.000.000 would be 2.8 GB) so false positives are not really a problem in practice. If you really want you could still perform an exact check against a DB if the filter returns true to rule out false positives with certainty, though at one false positive for one billion requests this might be exaggerated.


That makes sense. I was thinking of this more as a fun algorithms optimization challenge, not something to actually put into a production setting with real users. I agree that for real users you would especially not want to skip the last step.


Any good literal search algorithm could do a one-off search for a single long literal 'needle' way faster than the roughly 1GB/s that the author attained with grep.

A single string of that length is extremely easy to search for with a range of different algorithms - I would be surprised if a decent approach couldn't keep up with memory bandwidth (assuming your 22GB file is already, somehow, in memory). The mechanics of simply reading such a big file are likely to dominate in practice.

We implemented some SIMD approaches in https://github.com/intel/hyperscan that would probably work pretty well (effectively a 2-char search followed by a quick confirm) for this case.

Of course, that begs the question - presupposing that any kind of whole-text search is actually the answer to this question. The end result - assuming that you really do have more than a few searches to do - of keeping the results in any kind of prebuilt structure - is way superior to an ad hoc literal search.


I love this write up, and while the solutions discussed are excellent, I think the general-purpose FM Index data structure might work even better. I confess I'd have to read this post more closely to be sure, but I find the FM Index data structure so appealing, I'm always looking for excuses to promote it!

The FM Index is ideally suited to the problem of repeatedly searching a large fixed corpus for many different short substrings, and achieves optimal time complexity: linear in the length of the substring, with excellent constants independent of the length of the corpus (!).

Some years ago undertook a very similar exercise to that of the author except using the leaked Adobe password data rather than the HIBP data, and found the FM Index worked well: http://olivernash.org/2014/01/03/dna-of-a-password-disaster/...


I undertook a similar endeavor a while back. My solution[1] rested on the observation that you don’t need to have a B-tree to do a binary search; one need only be able to calculate the correct byte offset of the Nth hash. With some optimizations, this approach produced a 9.9GB file with similarly fast lookups.

[1]: https://github.com/tylerchr/pwnedpass/blob/master/README.md#...


It looks like the link to your blog post is broken (at the bottom of the README).


Awkward! Thanks for the tip. I added a copy of that post to the repo and fixed the link.


If you were to just throw the file into a database, wouldn't the database's index essentially lead to the same result (b-tree, compacted using bulk-loading procedure).


I briefly had a Rocket/IRC bot that talked to a postgres instance with the HIBP DB loaded into it, and yes it worked great.


True, but I wanted to create a library and a CLI without dependencies. I don't want to force an end user to install and use a database (even under the hood).


I respect that the author learnt the underly concepts, which are not simple. But is the net result truly that the default Postgres index method was perfectly suitable for this use case?


Thanks (: About the Postgres, I wanted to create a library and a CLI without dependencies. I wanted them to be a complete tools for doing this one thing. Tools that you can just grab and use, without installing anything.


Just to remind people of `look`, which does a binary search.

It might be superior to the articles methods if you only want to search for a few.


Didn't know about look. Seems good enough (tested on a HDD):

    $ dd of=pwned-passwords-sha1-ordered-by-hash-v5.txt oflag=nocache conv=notrunc,fdatasync count=0
    0+0 records in
    0+0 records out
    0 bytes (0 B) copied, 9.5864e-05 s, 0.0 kB/s
    $ time look `sha1sum <(echo -n password) | tr [a-z] [A-Z] | cut -d" " -f1` pwned-passwords-sha1-ordered-by-hash-v5.txt
    5BAA61E4C9B93F3F0682250B6CF8331B7EE68FD8:3730471

    real    0m0.137s
    user    0m0.002s
    sys     0m0.013s
dd is used to drop the file from the fs cache. Something the author probably didn't do given the unrealistic 49μs. It's simply not possible to fetch data from a HDD that fast.


True, the benchmarks are bad. I'll rewrite them (to drop the cache every time) and update the results.


wow, thanks, yet another tool to remember in the toolbox

  sudo purge
  time look E38AD214943DAAD1D64C102FAEC29DE4AFE9DA3D pwned-passwords-sha1-ordered-by-hash-v5.txt
  E38AD214943DAAD1D64C102FAEC29DE4AFE9DA3D:2413945
          0.01 real         0.00 user         0.00 sys`
not 49us, but fast enough for most use cases (purge should clear the fs cache)


Funny how many tools are stowed on my system. I'll never know them all!


You may also consider using a bloom filter to do this: https://github.com/62726164/bp


using ETS with Erlang (or Elixir) I get sub 50μs (30μs on avergae) lookup times.

Memory usage is quite high, around 95 bytes per element, bu I'm sure that by spending more than 5 minutes on it, like I did, one can take it down considerably

For reference, this is the code I used

I converted the SHA hashes to MD5 to save memory, given we don't care about collisions (which are very unlikely anyway), we just want to know if the password was there or not.

    defmodule Pwned do
      def load do
        :ets.new(:table, [:named_table, :set])

        File.stream!("pwned-passwords-sha1-ordered-by-hash-v5.txt")
        |> Stream.each(fn line ->
          <<hash::binary-size(40), ":", _rest::binary>> = String.trim_trailing(line)
          hash = :crypto.hash(:md5, Base.decode16!(hash))
          :ets.insert(:table, {:binary.copy(hash), true})
        end)
        |> Stream.run()
      end

      def lookup_hash(hash) do
        hash = :crypto.hash(:md5, Base.decode16!(hash))

        case :ets.lookup(:table, hash) do
          [] -> false
          _ -> true
        end
      end

      def lookup_password(password) do
        lookup_hash(:crypto.hash(:sha, password) |> Base.encode16())
      end
    end


I agree these performance numbers don't seem great. Using q (another interpreted language; not compiled) I get 5µsec on my Macbook Air:

Here's my load script:

    \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
    \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
    `:hibp 1: `s#0N 20#read1 `:hibp.input
I can then shut down this process, and start a new one:

    q)hibp:get`:hibp; / this mmaps the artefact almost instantly
    q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
    5
It's so fast I need to run it 1000 times to take just 5msec (5µsec average lookup time!). I imagine converting to md5 would be substantially faster since there's a 16-byte scalar type in q I would be able to use.


I did something similar when I wanted to search the HIBP database and if you are okay with some false positives you can do better than your results, both in terms of speed and size.

If you are okay with false positives, you can use a bloom filter and tune the number of false positives you want. I chose a false positive rate of 1 in a million so my data structure was still very accurate in determining whether a password was already hacked.

It only took 30 microseconds to determine if a password was in the list and for size, was at the theoretical limit of 22 bits per element or ~1.5gb.

I originally used a bloom filter which made it 2gb but given a bloom filter was just a sequence of 0s and 1s, I was able to use a golomb coding to shrink it down to 1.5gb.

The time to process the original 24gb however, is something that I could have improved, but I kinda lost interest once I already had something that was at the theoretical minimum size, as well as able to determine a password exists within 30 microseconds.

Anyways take a look if you're interested in trying a different approach: https://github.com/terencechow/PwnedPasswords


Surely if you know that the hashes will have an ~even distribution you can quite quickly make some assumptions about roughly where the key will be?

I'm not entirely sure I'm sold on the speed gained by splitting files vs doing a simple seek operation to an offset [1]. There's probably a bunch of time lost searching the filesystem through a file/folder structure?

Also the simple act of converting the numbers from ASCII to binary should save a bunch of disk space too (and make searching quicker)?

Great write-up though, good to see a bunch of solutions tried.

[1] http://www.cplusplus.com/reference/cstdio/fseek/


Thinking about it, you would need a 64-bit value to point to the offset because of the size of the file.

But an index file of 64-bit offsets could easily be seeked to read the value of the offset based upon the first 2 or even 4 byte offset.

Though with 4 bytes, that becomes a 4 gigabyte index file! But that would probably be much faster as you only do one seek in one file, then another seek to the main file, then search a much shorter distance to the result!

If the system has enough ram, the operating system will cache the files anyway and will be pretty fast i think.

(can you tell my first job involved writing ad-hoc database systems?)


Another solution for cases where you don't add new entries all the time and can reindex the whole database every once in a while instead: use cdb. There's a nice description of the internals and how it works. http://www.unixuser.org/~euske/doc/cdbinternals/index.html It guarantees access in two disk reads.

The original version has the limit of 4gb, but there are 64b versions as well - for example https://github.com/pcarrier/cdb64?files=1


Great writeup! Another way to query the DB are probabilistic filters. I wrote a Bloom filter based query API for this a while ago:

https://github.com/adewes/have-i-been-bloomed

Very fast and highly space efficient as well, 17.000 requests per second on a conventional laptop with 1.7 GB memory required at a false positive rate of 1:1.000.000 (and no dependencies on databases or anything else).


> A node is a simple structure of sixteen 32 bit values. The values are 'pointers' to next nodes, at given character of the SHA-1 hash. So, one node takes 16 * 4B = 64B.

I have often used a dictionary to store tries to prevent this kind of memory usage–it's auto-resizing, if slightly slow. But hey, you're chasing pointers anyways, so it's not like going through the tree was going to be fast anyways…

(I'm also curious about the "scumbag Steve" hat on the B-tree, but I digress.)


> Trie structure sucks if you have pretty random words.

Classic uncompressed trie sucks pretty much in all cases.

Now if we go for half-decent implementation of packed variation, it does get significantly better:

https://en.wikipedia.org/wiki/Radix_tree


Cool learning exercise and fun read. Can't help but think SQLite could do this screamingly fast and very easily, with nice compact storage (blob primary key with the hashes, without-rowid table to avoid a hidden integer per row).

That said, 49us is very impressive. Hard to beat low level custom coded solutions.


> Hard to beat low level custom coded solutions.

Challenge accepted!

I used the following in q to download and load the data into a disk object I could mmap quickly:

    \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
    \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
    `:hibp 1: `s#0N 20#read1 `:hibp.input
This took about an hour to download, an hour to 7z|cut|xxd, and about 40 minutes to bake. At complete, I have an on-disk artefact in kdb's native format. I can load it:

    q)hibp:get`:hibp; / this mmaps the artefact almost instantly
and I can try to query it:

    q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
    5
Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's entirely possible my MacBook Air is substantially faster than the authors' machine, but I think being ten times slower than an "interpreted language" suggests there's a lot of room to improve!


Why was data sorted by usage count in the first place? Hash of the password doesn't give you the password, so you just get "something was used x amount of times". Seems like you always want to look up by hash and then sorting by hash from the beginning makes more sense.


There was an app for Android phones that searched for default router passwords based on SSID using this method. It indeed was very fast, even on very bad hardware.


B-trees, with their trade-off between slow disk seeks and fast in-memory scans, make just as much sense for slow memory accesses and fast in-cache scans.


Nicely done and explained.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: