
How we store 400M phone numbers with fast lookup - bilalhusain
http://sparktg.com/engineering/
======
idorosen
Their solution seems over-engineered for the scale of data they're concerned
with.

A 10 digit phone number can be encoded in 34 bits. Supposing they used 64 bit
wide fields to store the key and target data, that leaves 30 bits (more than
their 2 bytes) to play with for flags or whatever target data. To store the
whole dataset in a trie or tree-like structure on disk in this model they
would need 8 bytes * 400 million phone numbers < 3 terabytes of data. So, a
single 3TB or 4TB HDD ($130USD/ea) would suffice, this would give them (in the
trie- or tree-like on-disk format) an access/read time in single digit or low
double digit milliseconds with an HDD, or below 1 millisecond for 3 striped
1TB SSDs ($300USD/ea), for example. The disk controller's cache and the host's
file buffers/cache would work in their favor as well since there are probably
plenty of hotspots in the data. mmap could make the job of reading the
file/disk even easier (and faster).

The data just isn't that big by today's standards.

Since false negatives are not possible for a properly implemented Bloom
filter, using a Bloom filter in front of this on-disk approach would make it
even faster for the negative case, since the disk would only be hit for
positive or false positive cases, the latter of which should be relatively
rare.

EDIT: It might be even simpler/better/faster than a trie/tree to just store
the data indexed by some hash function (even just prefix bucketing) and then
sequentially scan the collision list for that prefix hash for matches to the
target number, which would take advantage of the much better sequential read
throughput for most commodity HDDs/SSDs so that only one random read is
necessary.

~~~
Veedrac
> 8 bytes * 400 million phone numbers < 3 terabytes of data

I get 3.2GB, which fits in RAM.

So yes, I guess, but why hit the disk at all? I'd probably just sort the data
and do a binary search.

~~~
idorosen
I must be sleep deprived. 1024^3 = gigabytes, not terabytes. Wow, then they've
_definitely_ over-engineered it, and disk isn't even necessary in the event of
a 10-fold increase in the database size. Tries, search trees, or properly
sized hash tables on phone number prefixes should be plenty fast and the whole
thing can fit in RAM on a modern laptop...

~~~
bilalhusain
Our solution maintains the data in a simpler scheme because keeping the data
in a maintainable format is important aspect when the data is ever changing -
every couple of days millions of people update their preferences and we need
to be able to apply these changes. When choosing a single data structure for
all the numbers, we need to optimize for space or for faster inserts/updates.
Our solution is a trade-off between these options.

~~~
Veedrac
A hash table is more than compact enough. It should work fine even at high
load factors (80%/90%) given a decent hash scheme (eg. Robin Hood hashing)
which means you need only 4GB memory. It's random access and on average should
have one cache miss per lookup. Updating is rather trivial.

~~~
bilalhusain
Sounds promising. Little bit concerned on the computational cost of hashing
function. Gotta explore this option!

~~~
Veedrac
I wouldn't be worried about hashing, either, given you effectively want to
hash a 34 bit integer to a ⌈log 4e8⌉ = 29 bit integer.

An easy way is to take a 32 to 32 bit hash (eg.
[http://stackoverflow.com/a/12996028/1763356](http://stackoverflow.com/a/12996028/1763356)),
calculate

    
    
        hash(top 32 bits) ^ hash(bottom 32 bits)
    

and take the bottom 29 bits of that. I wouldn't be surprised if you can do
this in less time than an integer division.

------
MichaelGG
Similar issue in doing LNP lookups in the US. There's about 500M entries, each
mapping to a different LRN. Naively, that's 16 bytes per entry (8 for the
number, 8 for the LRN), so 8GB.

First optimization is changing the LRNs to a lookup, as there's only ~30K
uniques. So that cuts the data to 5GB. That's OK but it'd be better if it was
smaller and quicker to load.

I implemented a delta-encoding system with multi-level indexing, ISAM style.
Data's compressed on a "page" basis, like 4KB or so of data at a time. That
way all data is compressed while in memory and lookups can be done on the
compressed data directly. A header on each page provides the range it covers,
so you can quickly skip to the next page if the index wasn't precise enough.

Delta encoding is neat here, because on average, there'll only be gap of 18
numbers from one entry to the next (~9B possible numbers / 500M). I used
base128 encoding but simple16 or another way (there's SIMD-optimized formats)
would have been even better.

Using this method, the entire dataset only requires about 600-800MB of RAM.

------
donarb
Their basic premise is all wrong. The list of people that actually want spam
phone calls would be very short and is a faster search target.

~~~
20kleagues
If it is opt-out, then won't they have just the list of people who don't want
phone calls?

~~~
maerF0x0
assuming a saturated numberspace, 400M numbers is just 5% of all possible #s .
I doubt that the exceptions (911-xxx-xxxx) reduce the number enough to make a
whitelist a better choice.

------
fencepost
The thing that strikes me is that while the data being searched for has been
optimized down to 2 bytes of bitfields, the same hasn't been done with the
phone numbers - area codes in India appear to be variable-length (2, 3, ?
digits) with that length being part of the 10 digits, and presumably mobile
numbers are handled similarly either with their own area codes or overlapping.
Breaking the data by area code would allow the elimination of significant
chunks of the total possible address space, and further separating (as they
do) into dense and sparse would further reduce the requirements (since tries,
skip-lists, etc. have their own overhead).

It comes down to this: while the Indian phone system could in theory allow 10
billion phone numbers, there are not in fact 10 billion legally-allocatable
phone numbers and the rules in place likely effectively reduce the legally-
allocatable number to somewhere between 1-3 billion numbers. Dividing based on
those rules may both reduce the problem to something requiring less
engineering time and simiplify possible marketing-related factors (e.g.
selling access/systems targeted only to specific area codes).

~~~
fencepost
And after a bit more thinking while doing something mindless, it seems that
other interesting areas would be the distribution of digits within each of the
first 4-5 positions and perhaps looking at the bit patterns of either run-
length or Huffman encoding of the first 4-5 digits.

And separately, depending on the hit rate and particularly in the sparse
sections, there might be situations where it would make sense to have a
preliminary lookup that simply indicated whether there were _any_ possible
matches within a prefix range - before searching, get an overview of whether
there's anything to search.

Overall this strikes me as something that could demonstrate the importance of
having developers aware of the environment in which something will be used. If
there are going to be 2 of something, throw hardware at it. If there are going
to be 2,000 of "something" instead, an extra $1000 each in "throw hardware at
it" could become a real issue.

------
insaneirish
Relevant: [http://yourdatafitsinram.com](http://yourdatafitsinram.com)

------
fasteo
We are using a constant database for a similar use-case (MNP database), albeit
much smaller, around 15 million phone numbers. It requires almost no RAM and
you update the database as a whole with no downtime.

[1]
[https://en.wikipedia.org/wiki/Cdb_(software)](https://en.wikipedia.org/wiki/Cdb_\(software\))

~~~
vidarh
I'm sure that works great for 15m. (EDIT: To be clear, I love CDB, and I'm not
being sarcastic - for 15m it's probably perfectly fine; it's just space
inefficient for large datasets where the combined key+value lengths are very
low)

But even if we ignore the hash table pointer tables which can be made
arbitrarily small at the cost of probing a higher number of entries on average
before finding the right key (but in reality you'd want them to be fairly
large), the minimum space used per entry is [1]:

8 bytes for the length of the key and length of the value + 5 bytes for the
key + 2 bytes for the value, so 15 bytes per number, or 6GB (EDIT: fixed
numbers to account for BCD encoding of number). Which means you'd need to
switch to one of the (non-standard) 64-bit CDB-inspired formats, as CDB itself
can only handle 4GB data files.

[1] CDB format: [http://cr.yp.to/cdb/cdb.txt](http://cr.yp.to/cdb/cdb.txt)

------
cybernytrix
This scale is pretty small and can be done on my laptop. If you need to do
such operations at a truly large scale, this is an interesting article:
[http://alexbowe.com/rrr/](http://alexbowe.com/rrr/) C++ implementation:
[https://github.com/simongog/sdsl-
lite/blob/master/README.md](https://github.com/simongog/sdsl-
lite/blob/master/README.md)

------
anilgulecha
Wouldn't some manageable amount of RAM and a hash table be the simplest and
fastest solution?

------
d_e_solomon
Thanks for posting this article. Would this be a good use case for a bloom
filter?

~~~
fleitz
Maybe, probably not thought as a completely naive solution to this problem an
array of 400M elements which is sorted, only takes 2.4 GB of RAM, or about $50
worth of RAM.

Basically you'd waste more time (and money) than it could possibly be worth.

~~~
bpicolo
I mean, a bloom filter isn't super difficult, but it's also probabilistic
instead of deterministic.

~~~
fleitz
Exactly, so you'd need the array anyway...

Also, typically bloom filters don't come out of the box with the language
you're using, so it's just more potential for bugs.

A lookup on a sorted array should take 8.6 comparisons anyway, I bet the
hashing takes longer...

------
_justin
I had interned in a similar company, they had this sorted out way back then,
using bloom filters. This is the github link if anyone is interested:
[https://github.com/mobmewireless/fast-
filter](https://github.com/mobmewireless/fast-filter)

------
timeiscoffee
Article seems down. Here is a google cached version:

[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://sparktg.com/engineering/)

------
bilalhusain
Earlier submission tanked on HN and the link didn't work, resubmitting with
lookup timings added. Hopefully, readers will find it interesting this time.

~~~
nly
> Unfortunately, 10 digits won't fit in 32 bits and 5 * 400 MB for storing
> number is not a very happy situation and it is NOT readily searchable.

Why is 7 bytes (2b prefs + 5b number) x 400M = ~2.4GB of RAM not good enough?

~~~
tacos
Exactly. My response to the headline was "Um... using ten lines of C and $44
worth of RAM?"

~~~
yason
Because the interesting part of the effort is doing it fast in anticipation of
translating the optimization tricks to cases where you can't just go buy more
RAM anymore. A lot of problems can be "solved" with current hardware because
accesses or comparisons or whatnot become ridiculously cheap but that's not
what computer science is about.

~~~
fleitz
This is one of the biggest problems I run into in 'software engineering',
people building way to complex systems because they are 'interesting' rather
than buying $50 of RAM and being done with it, and then poo-poo'ing systems
that cost $50 and work.

Querying a DNC list is not a problem in which you will ever not be able to buy
more RAM, it's trivially parallel, if for some reason DNC lists ever outpace
Moore's law, just buy another system.

To be fair to the authors at least they didn't do something ridiculous like
build a 100 note cassandra cluster.

~~~
wtallis
And realistically, there's nothing you can do with phone numbers that even
needs the full speed of RAM. A fast SSD can do enough random reads to load the
entire database in under 20 minutes, and nobody actually needs to know the
status of all 400M numbers in the same 20 minutes because they can't all be
dialed that quickly.

~~~
fleitz
Exactly:
[https://github.com/twitter/fatcache](https://github.com/twitter/fatcache)

    
    
      A single fatcache can do close to 100K set/sec for 100 bytes item sizes.
      A single fatcache can do close to 4.5K get/sec for 100 byte item sizes.
      All the 8 fatcache instances in aggregate do 32K get/sec to a single 600 GB SSD.

------
eva1984
Interesting article.

The idea of two-layer index actually reminds me of the very similar
compression technique adapted by Roaring Bitmap.

------
halayli
My intuition is that a Radix tree trie would make a lot more sense and is more
maintainable than such a hack.

------
daurnimator
Idea: use a hash table with perfect hashing
([https://en.wikipedia.org/wiki/Perfect_hash_function](https://en.wikipedia.org/wiki/Perfect_hash_function),
e.g. library: [https://github.com/wahern/phf](https://github.com/wahern/phf))

------
vidarh
This is what tries [1] are for, as well as skip-lists [2]. Skip lists for a
suitable sized prefix would likely be more space efficient than a trie, but
some trie types such as radix-trees [3] with a sufficiently high radix value,
could also be quite efficient.

The described approach doesn't seem _bad_ , though it's a bit amusing to see
this described as if it's not a very well trodden area of computer science.

[1] [https://en.wikipedia.org/wiki/Trie](https://en.wikipedia.org/wiki/Trie)

[2]
[https://en.wikipedia.org/wiki/Skip_list](https://en.wikipedia.org/wiki/Skip_list)

[3]
[https://en.wikipedia.org/wiki/Radix_tree](https://en.wikipedia.org/wiki/Radix_tree)

~~~
fiatmoney
Depending on exactly how they're implemented, tries can give you unacceptable
numbers of random disk seeks / uncached memory lookups per read.

~~~
vidarh
They can, but they certainly don't have to, especially not for a dataset like
this which is small enough to re-generate offline with data suitably
clustered. And certainly no reason why it'd be worse than the describe
approach, which isn't really all that conceptually different. What the article
described isn't necessarily _bad_ , just unnecessarily ad-hoc when there's
plenty of suitable algorithms with well understood tradeoffs and
implementations.

------
gupta1838
Yeah, or you could just, like, use indexes?

Remind me not to hire these guys.

------
alexeyshurygin
Does it worth mentioning? Straightforward approach of storing numbers in a
hash map in memory solves the problem. This may require say 40 Gb of ram w/o
ANY optimizations.

~~~
GFK_of_xmaspast
Why would you do that when you can just keep a 10B-long bit vector.

------
meeper16
Why not OpenLDAP with MDB? It's the fastest in terms of lookups ->
[http://www.slideshare.net/ldapcon/whats-new-in-
openldap](http://www.slideshare.net/ldapcon/whats-new-in-openldap)

