Where the pointers are notionally 32+k bits long, but the bottom k bits are always zero, so they can be stored in 32 bits. This would mean that records always have to be aligned to a 2^k-byte boundary. If the data has some natural alignment anyway, this could be arranged so as to not involve wasted space, and even if it doesn't, for large keys and/or values, the amount of wasted space would be relatively small.
Alternatively, there was a way to reliably locate the start of a record when scanning through data, the records could be packed without padding, and the 2^k-aligned pointers could simply be approximate, pointing to somewhere in the 2^k bytes before the start of the record. Retrieval would involve following the pointer, then scanning forward to find the actual record. This would be a bit sketchy, but something a bit like this is done in ATM:
CDB doesn't include CRCs. However, you could think up various schemes to identify records based on what you know about the record format, the key you're looking for, and its hash. It's probably not worth the effort!
It's easy to implement and really demonstrates some good system engineering tradeoffs.
I think the theory is mostly that it's really simple and ends up requiring very few disk reads.
I wonder if there's any mileage in using a perfect hash function to build a database like this. It seems suited to the operating model of being slow to build but fast to access.
A lot of data doesn't change very often; and a lot of data can tolerate propagation delays on change. When you have an extremely high read volume against such data sets, it can be economical to cache the entire data set and distribute it periodically to the fleets of machines that require access, as opposed to servicing individual reads over a network. Provides lower cost and latency, while supporting a higher volume of reads and higher availability, at the expense of engineering cost and propagation delay.
Take sendmail: people would have their email-aliases in a /etc/aliases file, with a simple ascii-format, lines formatted according to: "<key>: <values...>". The "newaliases" command then would convert the content of the ASCII file /etc/aliases into the binary (originally: Berkeley db) /etc/aliases.db.
As far as I know, CDB was developed initially for qmail, DJBs email-daemon, for exactly that purpose.
Another example for this is the "Yellow-Pages" (yp) / NIS directory services to distribute usernames/userids/groups/... in a UNIX cluster or network. ASCII files were converted into e.g. /var/yp/DOMAIN/passwd.byname, a .db-file where keys are the usernames from /etc/passwd and values were the full ASCII lines from /etc/passwd, same goes for passwd.byuid, group.byname, and so on. All for fast indicing, so that the network server didn't have to repeatedly parse /etc/passwd.
It's not really the same thing, though. CDB provides a single key -> value mapping in a single file, with updates by replacing the entire file; SQLite is a relational database with updates, locking, transactions, SQL, etc.
Am I the only one who finds it hilarious that the first thing after the "no random limits" heading is a random limitation?
Where's a github mirror?
A google search reveals some entries from the language implementations.
You'll find a ton of life changing stuff on SourceForge and random FTP sites.
See, for example, the Linux kernel.