
The Difficulty of Private Contact Discovery - matt_xyz
https://whispersystems.org/blog/contact-discovery/
======
junto
The problem isn't that social networks can't implement this, it is that they
don't want to.

Facebook and LinkedIn, for example, want to know who all our contacts are,
because that data is so valuable to them.

This needs to be built into the mobile OS. The OS should only be allow third
parties to access some kind of protected contacts, nothing more.

Furthermore, we need fine-grained control in the mobile OS, to optionally
block apps from accessing private information.

Apps should still work if certain privacy features are enabled by choice,
rather than the default today, which appears to be "install our app and accept
our defaults for 'all the things' or don't use our app".

Bravo to Whisper Systems calling this out, and trying to work towards a
technical solution.

~~~
Arnt
You might want to reread the post: "As far as we can determine, practical
privacy preserving contact discovery remains an unsolved problem."

~~~
junto
But there is some degree of success mentioned in the article, even if it
doesn't work for huge networks.

That being said, some degree of privacy is better than none at all. Currently
Facebook et al upload every single piece of contact information we have on our
devices. That's insane from a privacy perspective.

------
mcculley
Where the article enumerates the reasons one might not want to share one's
entire address book, it leaves out the one most important to me. If I allow an
app access to my entire address book, it can see atrributes on an entry that I
feel I should not leak such as birthdate and home physical address. Address
books need a finer grained way to express such access control. Even when
sharing a contact with a third person and not a service, I might not want to
share the entire address book entry for a person.

~~~
dalbin
In iOS, you can present a contact picker and the app will only get access to
the user final selection (a contact or a contact property).

[https://developer.apple.com/library/prerelease/ios/documenta...](https://developer.apple.com/library/prerelease/ios/documentation/ContactsUI/Reference/CNContactPickerViewController_Class/index.html#//apple_ref/occ/cl/CNContactPickerViewController)

~~~
mcculley
Interesting. I was aware that this new API exists. I've not seen an app use it
yet. I was not aware it could narrow the selection down to just a single
contact property. The documentation does not make that clear.

Nevertheless, apps like Facebook aren't going to use this as they want the
whole address book and not a single person.

------
Spearchucker
The use cases are interesting to me. I once crossed a border where my phone
was taken from me, into another room. It doesn't take a rocket scientist to
work out that an image was lifted from it before it was returned to me.
Sending text messages from that same country had a typical latency of between
one and four hours before arriving back home. There are not many people I want
to communicate with privately, so I wrote my own secure app to send encrypted
text to a web site located in a country I trust, from where it's forwarded as
required. I know my contacts, and don't want those contacts known to others.
There aren't many I feel the need to exchange messages with securely, and
therefore don't want my contacts "discovered".

So it would be nice to find something some day that lets me confidently opt
out of ANY kind of contact discovery. As it is I don't install any apps on my
phone that require permission to access my address book. I realize that this
probably puts me into a tiny subset of user profiles, so not holding my
breath. My own solution is clunky, but it appears to remain my best option for
the foreseeable future.

~~~
smartbit
What do you think of Wire.com? Doesn't need access to address book (like
WhatsApp didn't until a few years ago). See discussion elsewhere on this page
for a few links.

------
StavrosK
Phil Zimmermann did come up with a protocol that avoids sending everything,
which I implemented here:

[https://github.com/SilentCircle/contact-
discovery](https://github.com/SilentCircle/contact-discovery)

Moxie said it wasn't "meaningfully private", if I recall correctly, without
elaborating further.

~~~
oceanofsolaris
As far as I can see this is quite similar to the 'sharded bloom filter'
discussed in the submitted link.

This approach is dismissed in the document as

> In the end, this strategy seems to provide very little at the cost of fairly
> substantial complexity.

Since it seems you use it for silent circle, how does it work there? How many
'longer hashes' do clients get on average per submitted 'short hash'? How
large is the 'false positive' rate you get from the long hashes? Since the
argument against the 'sharded bloom filter' in the submitted link boils down
to 'the tradeoffs don't work', I would be very interested to know how these
tradeoffs work out for silent circle.

As far as I understand, you would still need to have some rate limiting in
order to stop someone to collect all registered phone numbers though, wouldn't
you?

~~~
StavrosK
> As far as I can see this is quite similar to the 'sharded bloom filter'
> discussed in the submitted link.

It's pretty similar, but not exactly like that. The main difference is that
the server doesn't need to transmit a bloom filter, it transmits partial
hashes, but yes, there's a tradeoff.

> How many 'longer hashes' do clients get on average per submitted 'short
> hash'? How large is the 'false positive' rate you get from the long hashes?

I'm afraid I don't know the answer to these questions, as we don't keep
logs... The false positive rate of the long hashes should be astronomically
low (since they are full 256-bit hashes), but it doesn't much matter because
the client only needs to be reasonably sure that the user exists so it can
perform an actual query with the plaintext identifier.

The problem isn't really "we shouldn't know any of your contacts", it's "we
shouldn't know any of your contacts that aren't using our service", since
clients have to divulge _some_ of their contacts in order to set up calls with
them or send them messages. That means the client only has to know which of
their contacts are probably on the server so they can request more
information, which will, at that point, tell them with certainty whether a
contact is using the service.

P.S. My opinions are my own, I'm not speaking for SC, etc.

~~~
oceanofsolaris
I have to say that this sounds very reasonable (and certainly a lot better
than sending your plain text contacts).

Does anyone know what kind of solution Signal is using nowadays? The linked
article is not very recent and they might have changed their method in the
meantime. I had a quick look at their source but was not successful in finding
the relevant part for uploading the contacts.

------
noja
From 2014. Previous discussion:
[https://news.ycombinator.com/item?id=7007554](https://news.ycombinator.com/item?id=7007554)

------
sgentle
I was recently reading about some of the interesting work with invertible
bloom filters/invertible bloom lookup tables, eg [0] [1].

There might be either a clever way to exploit IBLTs to to make a low-
information query of the data (like "Oblivious Selection from a Table" in
[1]), or otherwise there may be a way to use them to more efficiently
implement the bloom filter/sharded bloom filter options.

My information theory/coding theory is not strong enough to really know, but
it might be worth looking into!

[0] [https://www.ics.uci.edu/~eppstein/pubs/EppGooUye-
SIGCOMM-11....](https://www.ics.uci.edu/~eppstein/pubs/EppGooUye-
SIGCOMM-11.pdf) [1]
[http://arxiv.org/pdf/1101.2245.pdf](http://arxiv.org/pdf/1101.2245.pdf)

------
caf
_The average Android user has approximately 5000 contacts synced to their
device._

This seems staggeringly large.

~~~
marssaxman
Yes, that number startled me as well - it is so far beyond Dunbar's number
that it must represent the result of some automatic collection process.
Perhaps Gmail harvests all the email addresses passing through your inbox and
stuffs them into the contact list? I imagine that most people using Android
phones are less persnickety than I am about allowing Google to manage their
lives, so I can imagine that Google might be routinely shuffling large numbers
of "contacts" around, which the user of the device neither knows or cares
about.

~~~
mahyarm
I think it's the gmail "you've emailed them so now they are a contact" and
other social services syncing their address books into your address book.

------
newman314
Well, here's another UX nit to pick with Signal. On Signal, I have multiple
contacts that have been uploaded/updated somehow. They are grayed out in the
address book so I'm kinda sure they are not on Signal just someone else that
knows them.

What's not clear to me is why/how this is being determined.

On a side note, this is one reason why I would contend that phone numbers (or
a SHA256 equivalent) is not a good selector to use for Signal. It could easily
be a privacy preserving custom field (that is then hashed) without
accidentally sharing PII. This would also have the benefit of being able to
reset the field without too much pain. Having to reset and get a new number is
much more painful.

One last thing, has anyone else noticed that Signal seems to use an inordinate
amount of data when _not_ using it? Once again, not clear to me why this much
data transfer is needed. I just checked my phone and 359MB of data has been
sent in the last 30 days. I use Signal sparingly (mostly because most of my
contacts are not on it) and never a voice call over LTE. So I have no idea
what is causing that much data to be transferred.

~~~
zmanian
Signal for iOS desperately needs a new maintainer to replace Fred.

Here is the tracking issue for data usage.

[https://github.com/WhisperSystems/Signal-
iOS/issues/866](https://github.com/WhisperSystems/Signal-iOS/issues/866)

~~~
mahyarm
They don't pay that well relative to the local market, insist you be in SF at
the same time and only want full time. They can solve the problem by allowing
remotes, having a few part timers or paying the SF standard salary. If you try
to do a simple github pull request on the repo, it takes a very long time to
do a code & design review so it makes possible maintainers discouraged from
contributing.

It's the old choice triangle: fast, cheap, good: pick two.

~~~
zmanian
Totally agree with you.

------
wtbob
As Moxie notes, this problem is known as private set intersection: how can two
parties P1 & P2 discover which elements they have in common, without revealing
all their elements to each other or to a third party? His error is in
describing scenarios where P1 is Open Whisper Systems and P2 is one of its
users: yes, it would be insane to use a protocol where OWS must transmit its
entire user database (hashed, encrypted or permuted) due to bandwidth reasons.
But that's crazy anyway: OWS can see who is talking to whom on its servers
already. Let's not do that!

The Right Thing is for individual users of Signal to privately compute set
intersections between their own contacts. If Alice knows Bob, Charlie & Dan,
and Bob knows Alice, Charlie & Eve, then when Alice & Bob add each other then
they could each execute a protocol which would reveal that they both know
Charlie.

In the last discussion of this, pbsd posted a link [1] to a great Microsoft
paper covering private set intersection [2]. It works its way through a number
of different protocols, but what it ends up with is this:

\- Alice generates three sets of dummy data D1, D2 & D3 (where the lengths of
all three sets are equal to a security parameter t, and the items in the sets
are invalid phone numbers (perhaps, elements of the form
'invalid:$RANDOM_DATA') and sends them to Bob securely. These dummy sets will
be used to observe if the server is cheating.

\- Bob checks that they are well-formed (i.e., of the right length and invalid
phone numbers).

\- Alice & Bob then generate a shared key K1 using a simulated coin-tossing
protocol.

\- Bob then generates a shared key K2 with the server (in this case, the OWS
server) using the simulated coing-tossing protocol.

\- Alice then sends to the server SHUFFLE{HMAC(K1, Charlie:1), HMAC(K1,
Charlie:2), … HMAC(K1, Charlie:n), HMAC(K1, Dan:1), HMAC(K1, Dan:2), …
HMAC(K1, Dan:n), HMAC(K1, D1_1), HMAC(K1, D1_2) … HMAC(K1, D1_t), HMAC(K1,
D2_1), HMAC(K1, D2_2), HMAC(K1, D2_t)}, where n is another security parameter,
Charlie is Charlie's phone number and Dan is Dan's phone number. Since the
server doesn't know K1 and it doesn't know how the set was shuffled, it has no
idea which of the hashes is which phone number, nor which is a dummy.

\- The server then generates a random SEED and sends to Alice
KEYED_SHUFFLE(SEED, {HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1,
Charlie:2)), … HMAC(K2, HMAC(K1, D2_t))}). Since Alice doesn't know K2, and
since she doesn't know how the set was shuffled, she learns nothing yet.

\- Bob sends SHUFFLE{HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1,
Charlie:2)), … HMAC(K2, HMAC(K1, Charlie:n)), HMAC(K2, HMAC(K1, Eve:1)),
HMAC(K2, HMAC(K1, Eve:2)), … HMAC(K2, HMAC(K1, Eve:n)), HMAC(K2, HMAC(K1,
D2_1)), HMAC(K2, HMAC(K1, D2_2)), … HMAC(K2, HMAC(K1, D2_t)), HMAC(K2,
HMAC(K1, D3_1)), HMAC(K2, HMAC(K1, D3_2)), … HMAC(K2, HMAC(K1, D3_t))} to
Alice.

\- Alice is know able to compute the intersection of the message she received
from the server and the message she received from Bob. She does so, and sends
the intersection {HMAC(K2, HMAC(K1, Charlie:1)), HMAC(K2, HMAC(K1,
Charlie:2)), … HMAC(K2, HMAC(K1, Charlie:n)), HMAC(K2, HMAC(K1, D2)} to Bob.

\- Bob knows K1 & K2, and can calculate HMAC(K2, HMAC(K1, X)) for any X he
knows (including the members of the three dummy sets D1, D2 & D3). He can now
check that neither Alice nor the server has cheated. First, he finds the
original X for each HMAC(K2, HMAC(K1, X)) in the intersection. Then, if any
member of D2 is missing from the result, or if any member of D1 or D3 is in
the result, then someone has cheated, or if he doesn't see all of Charlie:1,
Charlie:2 … Charlie:n then someone has cheated. In this case, Bob knows that
neither Alice nor the server cheated, and now he knows that they share Charlie
in common.

\- Bob tells the server 'continue' and the server sends Alice the random seed
used to in the KEYED_SHUFFLE; Alice uses this to determine which item she sent
the server corresponds to which item the server sent to her, and from that
which item in the intersection she sent to Bob corresponds to Charlie:1,
Charlie:2, … Charlie:n, Dan:1, Dan:2, … Dan:n and so forth. Alice, too, can
see if either Bob or the server cheated.

At the end of this protocol, Alice & Bob both know that they share a contact
in common with the same phone number. In place of phone numbers, of course,
one might use pseudonyms, public keys &c. They've sent relatively small
messages (just a few times the size of their contact lists), and the server
doesn't know how many they have in common, or whether they have anyone in
common.

So, problem solved.

Edit: I should note that there's no way for Alice to know that Dan also has an
account if they don't know any other user in common, and of course Bob & Alice
need to exchange identities offline in person first. But, once that exchange
has happened, everything else Just Works™ and due to the small-world
phenomenon as more users use it, more users discover one another using it.
Users who know one another can even re-run the protocol every once in awhile,
when discovering new contacts.

[1]
[https://news.ycombinator.com/item?id=7009674](https://news.ycombinator.com/item?id=7009674)
[2]
[http://research.microsoft.com/pubs/194141/sapsi2.pdf](http://research.microsoft.com/pubs/194141/sapsi2.pdf)

~~~
smartbit
Despite this excellent explanation by wtbob, despite Moxie's blog posting,
despite EFF, Snowden, Poitras, Schneider supporting it, I distrust
Whispersystems for persisting Signal _must_ have access to the addressbook by
refusing to work without. I tried it several times during the years, hoping
they would have reverted this and allow usage without access to the
addressbook. Everytime I deregistered again.

Today I will delete Signal from my devices because earlier this month Wire.com
included PFS [0] _without_ requiring access to the addressbook. I’ll start
using Wire.com despite it not being fully opensource. Until now I was using
iMessage and FaceTime Audio as they seem the most secure PFS solutions till
date that don’t require the addressbook to be uploaded. For some contacts I
have no option but using WhatsApp which is quite a hassle without allowing
WhatsApp access to the addressbook. From today I’ll switch to Wire.com as it
seems more secure than Apple’s offering as Wire.com allows signature
verification.

Regretfully Tom’s Hardware [1] doesn’t mention the addressbook in it’s
comparison-table neither mentions Apple's offerings. Neither does wikipedia
[2]

Edit: found a German video-blog [3] that _does_ mention the Address book
capabilities of Wire.com

[0] [http://www.reuters.com/article/us-dataprotection-
messaging-w...](http://www.reuters.com/article/us-dataprotection-messaging-
wire-idUSKCN0WC2GM)

[1] [http://www.tomshardware.com/news/wire-app-complete-end-to-
en...](http://www.tomshardware.com/news/wire-app-complete-end-to-end-
encryption,31389.html)

[2]
[https://en.wikipedia.org/wiki/Comparison_of_instant_messagin...](https://en.wikipedia.org/wiki/Comparison_of_instant_messaging_clients#Secure_messengers)

[3]
[https://www.youtube.com/watch?v=9kCsFjBuH0o&t=74](https://www.youtube.com/watch?v=9kCsFjBuH0o&t=74)

~~~
wtbob
> Despite this excellent explanation by wtbob, despite Moxie's blog posting,
> despite EFF, Snowden, Poitras, Schneider supporting it, I distrust
> Whispersystems for persisting Signal must have access to the addressbook by
> refusing to work without.

The mechanism I explained would permit Signal to work without uploading one's
address book to the Signal servers. It would, of course, still require access
to the address book itself in order to perform contact discovery. I personally
don't mind allowing access to the address book itself; what I mind is
uploading its contents.

~~~
smartbit
> what I mind is uploading its contents.

You're correct that _uploading its contents_ is what it is all about for me.
And why doesn't Signal use the new CNContactPickerViewController on iOS? But
if I'm not able to verify the code, why does Signal _require_ access? If that
is their business model, that is fine for Moxie etal but not for me.

@wtbob thanks for the clarification. I wonder what you think of Wire.com?

------
legulere
Why not share the contact list with the people on your contact list? Facebook
demonstrated that most people don't have a problem with that and most people I
want to contact are already contacts of some contacts that I have.

~~~
Arnt
That's the goal. The problem is discovering it. If x has me on the list and I
have x on the list, how can our two Signal apps discover that? Without telling
the Signal server, and without telling random third parties?

~~~
zeveb
> If x has me on the list and I have x on the list, how can our two Signal
> apps discover that? Without telling the Signal server, and without telling
> random third parties?

Take a look at the protocol documented in
[https://news.ycombinator.com/item?id=11289223](https://news.ycombinator.com/item?id=11289223)

It's secure against an honest server and a malicious person, and against a
dishonest server; it's insecure against a colluding server and person.

~~~
Arnt
Fascinating algorithm. I wonder how many Signal users it would discover in the
~500 typical contacts, given one starting contact.

------
blockChainAttak
So, then, once everyone imagines they're safe in a totally private
encapsulated world of secret hashes, then comes along a peer-to-peer
initiative to target a brute force attack on the entire keyspace, using
strategies not unlike the bitcoin blockchain.

------
jsnathan
Can't you just make the hash more expensive?

Also, couldn't you regularly rotate salts?

~~~
geocar
Ten character alpha numeric rainbow tables can be had for only a few thousand
dollars.

~~~
jsnathan
There cannot be precomputed tables for every custom hash computation.
Especially an expensive one. Even less so if you add a salt.

It could be made prohibitively costly for all but the most determined
adversaries - and those could probably get your social graph much easier from
other sources.

Only problem I can see is it would take a few minutes after the app is
installed to find your contacts..

~~~
maxerickson
You can't use a salt, the contact you want to match has to be able to compute
the same hash.

You can hash the contact pair instead of one side of it, but even then an
attacker can rent X instances of some hardware that is Y times faster than
your phone, so if you target a difficulty of ~1 second per contact on your
phone, you get a difficulty of 2700 hours / XY for the attacker to recover all
of your contacts

So if Y is 10 times faster than your phone, you only need X=10 to recover the
contacts for a given number in a day. If the attacker is willing to set
X=1000, Y isn't even relevant.

~~~
jsnathan
> You can't use a salt, the contact you want to match has to be able to
> compute the same hash.

I think the misunderstanding here is that I meant one salt for _all_ users at
a time, instead of one salt for each hash.

And as long as stored hashes can be recomputed (by active users), the 'global
salt' can be rotated over time.

We can pre-compute hashes with future salts in case apps are not constantly
online. Then we could rotate salts, say, once every 2 hours.

The point being to make rainbow tables more expensive.

> You can hash the contact pair instead of one side of it, but even then an
> attacker can rent X instances of some hardware that is Y times faster than
> your phone, so if you target a difficulty of ~1 second per contact on your
> phone, you get a difficulty of 2700 hours / XY for the attacker to recover
> all of your contacts

> So if Y is 10 times faster than your phone, you only need X=10 to recover
> the contacts for a given number in a day. If the attacker is willing to set
> X=1000, Y isn't even relevant.

I think this should be (10^10/3600 ~ 2.7 million hours) / XY. Also, let's say
we can make it Y seconds (perhaps the hash is memory hard, etc). Then to
compute a single table (per salt) we need 2.7e6 hours, at say $0.01 per hour,
so $27k per table.

Taken together that would mean to reveal the contacts of signups happening
within a 2 hour window, would cost quite a sum.

It's not secure, of course - but it does give some measure of privacy.

If you think users are willing to wait more than an hour, you could 10x the
cost too. But that's probably not practical anymore!

Edit: Oh. I see I made a mistake here! If the server computes a single rainbow
table it will be able to retroactively de-anonymize all users that are already
active at that time! :( Sorry.

~~~
maxerickson
Yeah, somehow I missed a couple of zeros, it would be 2.7 million.

Such a salt wouldn't tighten up the window though, for contact discovery to
work, all the outputs have to be available to whatever is doing the matching.

~~~
jsnathan
OK hold on. Say the rotating salt thing doesn't work, for whatever reason.

But you yourself made a better suggestion! If we do hash (phone #) contact
pairs, and every client publishes multiple hashes, one for each of its
contacts, then a cheating server would need to compute (max) (10^10 *
10^10)/3600 ~ 2.7e17 hours worth of hashes.

That would actually be secure, no?

Edit: messed up the math myself this time; I think this is correct now though

~~~
maxerickson
For the contact discovery to be private, the phone has to calculate the hash.
So the server cost also depends on the performance differential (many entities
would likely still spend $50,000 to discover the contacts of a high value
target).

Adding information increases the cost, but it also reduces the likelihood of a
match.

You've edited your comment since I wrote the above. Anyway, if the attack is
directed at a single user, there's only one 10^10 involved (and really, the
phone number space isn't that big).

~~~
jsnathan
Really sorry about the edits! Thinking in real time :)

So about the size of the phone number space. Interestingly I guess you could
at least limit it by country. But, unless you already know the mobile phone
numbers of the target's acquaintances, (in which case there is no real danger
of 'leaking' your social graph), you can't really limit it more, because
mobile phone numbers are pretty random within each country's phone number
space (I think).

So say the US has at least 0.5 billion assigned mobile numbers, and you don't
know which are assigned, so the space could easily be 10^10 for just the US.
(I can't find any quick numbers on Google right now).

But even if we go down to say, 100 million, we would still get (10^7 *
10^7)/3600 ~ 2.7e11 hours worth of hashing to get all contact pairs. Which is
still secure.

Now as far as limiting one end point to a single phone number - that would be
a special case. I'm not sure if this isn't moving the goal post a bit.

As long as the server does not have access to your own phone number, it, or
anyone who has hacked into the server, cannot really use that to limit their
computations. Unless they are already targeting you through other means
(possibly you are a person of interest to someone), and so know your phone #.

In that case yes, they could reveal (part of) your social graph for a (maybe)
affordable cost. But in that very special case, I think we are dealing with
nation-state adversaries.

And in that case, the telcos probably already have a list of your phone #
contacts. So it would probably be easier to simply get it from them.

The only thing left to be revealed would be that you communicated with a
specific acquaintance _using that particular app_ , at some particular time.

So a targeted attack could leak that metadata.

It's not perfect.. but I think it might just be good enough for most users.

If we go off the deep end with an attack tailored to a single phone #, we also
have to worry about a whole lot of other attack vectors that are probably more
important.

Lastly, note that the user (if sufficiently paranoid) could simply opt out of
the automatic contact search post-install.

~~~
dfox
As for mobile phone numbers being randomly placed inside larger space of all
phone numbers, while this makes some sense for US/NANP, in most of the rest of
the world mobile numbers have some semi-well defined prefixes (as in "number
that starts with 6 or 7 is certainly mobile phone, discerning what operator is
sligthly non-trivial") and even for NANP there probably is some publicly
available database that says whether given number is mobile or not.

------
chrisdew
> This would be a great problem to really solve well. If anyone develops any
> insight into a practical

> privacy preserving mechanism for achieving this type of contact discovery,
> or notices something

> we’ve missed, please let us know here, on twitter, or on our development
> mailing list.

> – Moxie Marlinspike, 03 January 2014

This idea is probably full of holes, but there might be something useful
here...

Could you use layered system of bloom filters of shards of the hash(telno)
space?

(Figures are made up, but are roughly the right order of magnitude for
millions of service users.)

The filters could be made smaller, at the expense of having more of them.

Layer 0 is a 1MB bloom filter with a 10% error rate, for _all_ hashes of
telnos

Layer 1 are 16 1MB bloom filters with 1% error rates, sharded by the last
nibble of the telno's hash

Layer 2 are 256 1MB bloom filters with 0.1% error rates, sharded by the last
byte of the telno's hash

Layer 3 are 4096 1MB bloom filters with 0.01% error rates, sharded by the last
12 bits of the telno's hash

...

A newly installed app uses the following algorithm to determine which of its
contacts are already users of the service, without disclosing the phone's
contact list to the service:

The app makes a copy of the list of contacts' telephone numbers (E164 format).

Request the L0 bloom filter. Using L0, 90% of non-member telnos are removed.

Request only the required L1 shards, for the remaining telnos. Using the
necessary L1 shards, more non-member telnos are removed (99% are gone now).

Request only the required L2 shards, for the remaining telnos.. Using the
necessary L2 shards, more non-member telnos are removed (99.9% are gone now).

Repeat until required error rate is reached. i.e. The list only contains
telnos of other service users.

The benefit of this method is that bloom filter shards are only requested for
numbers which are _probably_ service users, thus disclosing much less
information to the service provider.

90% of contacts' numbers will never be used to request a Layer 1 bloom filter,
and those that are, will only disclose 4 bits of the hash to the service
provider, narrowing them to 6% of the worlds telnos.

99% of contacts' numbers will never be used to request a Layer 2 bloom filter,
and those that are, will only disclose 8 bits of the hash, narrowing them to
0.4% of the worlds telnos.

99.9% of contacts' numbers will never be used to request a Layer 3 bloom
filter, and those that are, will only disclose 12 bits of the hash, narrowing
them to 0.02% of the worlds telnos. (240 million telnos, if there are 10
billion telnos, as per the webpage.)

~~~
geocar
The server can cheat.

If I provide an L0 with numbers that begin with `1` and you don't download an
L1, I learn you don't have any numbers that begin with `1`.

Facebook says they get 5 new profiles every second[1], so it may be
justifiable to query too often.

[1]: [https://zephoria.com/top-15-valuable-facebook-
statistics/](https://zephoria.com/top-15-valuable-facebook-statistics/)

~~~
chrisdew
Good catch. This is a weakness, but it may not be relevant to WhisperSystems'
use case.

If you provide an L0 with numbers whose hashes end with '1' (hex), and I don't
request L1/..1 shard, then you learn that I don't have any numbers whose
hashes end with '1' (hex).

I have 1024 contact telno hashes.

Assuming I request only when the app is installed, and daily after that, it
will take you:

16 days to learn which final nibbles my contact hashes have - probably all of
them.

256 days to learn which final bytes my contact hashes have - probably all of
them.

4096 days to learn which final 12 bits my contact hashes have - probably
around a quarter of them.

Three years in, you know which 900-1024 of the 4096 12 bit combos match the
last twelve bits of one or more of the telno hashes.

~950 * 4096 = 15,200 days to learn the final two bytes of hashes, of which
there will probably be ~1000 unique values.

You now know 1000 buckets of 6 million numbers which contain one or more
numbers from my contacts.

If you update the app, over the air, to query every second, this would take
just a few hours. But then, you could just update the app to grab all my
contacts and send it instantly.

The "server cheating" is a valid attack, but it may not be important, as there
are easier attacks if we assume a malicious service.

I assumed that WhisperSystems were _trying_ to limit the information they had,
as a good actor, rather than ensuring complete app safety vs a malicious
service provider.

i.e. their motivation is not to have the information to turn over, rather than
to not be able to get it when compelled to modify their app. This is an
important legal distinction in some jurisdictions.

~~~
cvwright
I think it's a bit worse than that. Suppose the server wants to learn whether
you have a certain number in your contacts.

Then they simply put that number into your lower-level Bloom filters for L0,
L1, L2, ... even if that number is not really a member of the service. To
prevent a final match, they only need to keep the target number out of the
final BF.

Then they passively watch which BF's you retrieve at L1, L2, ... etc. If you
never retrieve the BF for Ln, despite the bits for the target number being
present in the L(n-1) Bloom filter, then they know you don't have the target
as a contact.

Does that make sense?

Of course, this attack and the previous one both go away if we assume the
server is "honest but curious".

~~~
chrisdew
That makes sense, but any malicious service could pretend that a particular
telno was a user of the service, in order to see which users had that contact.

I'm assuming the objective is to minimise the service's unnecessary knowledge
of non-service-user contacts.

~~~
cvwright
> I'm assuming the objective is to minimise the service's unnecessary
> knowledge of non-service-user contacts.

Right, so an honest-but-curious adversary model sounds like it's a good fit
for the server. Then all these attacks where the server lies about something
are out of scope.

Again, I like this idea so I think it's worth thinking about it some more. Not
trying to nitpick here, just trying to help think through what you can do.

What about the clients? Do you want to limit what they can learn?

Given any telno, a client can quickly tell whether that telno belongs to the
service, even if the client doesn't really have that number in its contacts.
Does that matter? I guess we come to a sort of meta-point here. What does it
mean for a number to be in someone's address book? Anything? A fundamental
part of the problem seems to stem from trying to make this work with
identifiers that come from a (cryptographically speaking) very small space.

Extending the attack above, really the client can get a pretty decent
approximation of the entire list of telnos in the service. If he checks all 10
billion telnos against the BF for L0, he gets a list that includes all
subscribers and several false positives -- each number in the list is only 90%
likely to be a subscriber. If he wants to refine his estimate for the
likelihood that any given number is a subscriber, he retrieves the
corresponding BF for L1. So he starts with a decent estimate of the subscriber
list, and with each BF that he downloads, is estimate improves. How bad is
that? I'm really not sure...

~~~
chrisdew
The number of downloads required to get each additional nibble is 16 times
that of the previous nibble.

~~~
cvwright
Also -- I was too sloppy in my earlier analysis, and as a result, my numbers
could be _way_ off. The attacker might learn much less than I initially
thought. It depends on how many users are subscribed to the service.

After the client downloads the BF for L0 and hashes the entire space of 10
billion telnos, he knows 90% of the non-subscribers for certain. All the
telnos that matched in the Bloom filter are either (a) subscribers of the
service or (b) false positives from the BF. Let's say there are n subscribers.
Then the attacker gets a total of n + 0.10 * (10^10 - n) hits on L0. The
likelihood that a matching telno is actually a subscriber is n / (n + 0.10 *
(10^10 - n)).

So he learns something with each level of the Bloom filter, but getting to 90%
certainty may require retrieving several levels of the BF.

~~~
chrisdew
Thanks for your questions and analysis.

