
The Difficulty Of Private Contact Discovery - FredericJ
https://whispersystems.org/blog/contact-discovery/
======
MagicWishMonkey
This is great until you realize the average address book is a disgusting mess
of spelling errors, wrong values in wrong fields, punctuation in places where
punctuation isn't necessary (commas in phone number fields, digits in name
fields, etc.)

The only field that _might_ be fairly consistent is email address and even
that is no guarantee. Sure, it's possible to clean up a contact before
hashing, but when you consider the fact that the typical phone contains >500
contacts it's not something that can be done in a reasonable amount of time on
a modern smartphone. For the last two years I've been working on a product
that involves cleaning/organizing contacts and making them searchable, when I
started I had no idea what a massive undertaking it would be.

So, hash till your hearts content, but if you have 10 different people with
the same contact in their address book I would be willing to bet that you will
get 10 different hashes because humans are not good at data entry.

~~~
danielrhodes
This times a million.

Phone numbers in particular can be a horrid mess to deal with. Some decent
libraries have come out to normalize them, but in general it remains
difficult.

For example:

013811234 +88-1-3811234 3811234 0118813811234 008813811234

are all be the same number.

To hash them, you could naively just strip off the area codes and country
codes, but that is only if you can know ahead of time they are indeed area or
country codes. Some countries have 7 digit numbers, others have 9, some have
6, or even 5 digits. Sometimes you don't even know the country to apply these
rules to.

~~~
svasan
Just a thought - Why not reverse the phone number string and if there is a
match in the first N digits of all the above phone numbers, then classify
these as the same phone number? N would vary from country to country, no
doubt. The intermediate characters like {,, +, -, .,} could also be stripped
out from the phone number string to make the original problem less complex.

------
sigil
Interesting problem!

> It’s also possible to compress “updates” to the bloom filter. The server
> just needs to calculate the XOR of the version the client has and the
> updated version, then run that through LZMA (the input will be mostly
> zeros), and transmit the compressed diff to the client.

> Unfortunately, for a reasonably large user base, this strategy doesn’t work
> because the bloom filters themselves are too large to transmit to mobile
> clients. For 10 million TextSecure users, a bloom filter like this would be
> ~40MB, requested from the server 116 times a second if every client
> refreshes it once a day.

Wait, the LZMA-compressed daily diffs would be still be ~40MB? If the 40MB is
just the initial bloom filter download, that's not so bad. If 40MB is the
daily diff, I'd be interested in seeing the calculation for that.

~~~
moxie
Our feeling is that 40MB for an initial download is not acceptable on mobile
devices, particularly in many of the areas we'd like to support. Maybe we're
wrong, but it seems like too much on edge networks, particularly given that
it's only going to grow as more users join. Ideally we'd have 100MM or 500MM,
not just 10MM.

We can definitely make lightweight diffs happen, but that initial download is
tough, and is going to get tougher.

~~~
sigil
So make contact discovery optional. My guess is most of your growth will come
from that core of early adopters in the graph, and these folks have nice data
plans / fast wifi and won't mind the 40MB download at all.

In the long term, where you're 10x or 50x where you are today, yes the filter
size is a problem, but it's a "nice problem to have" as they say.

Side question: suppose I'm a less fortunate user of TextSecure, can I manually
exchange keys with someone IRL? How does that work?

~~~
streetnigga
"My guess is most of your growth will come from that core of early adopters in
the graph, and these folks have nice data plans / fast wifi and won't mind the
40MB download at all."

See my comment in this thread about low resource mobiles, or the article on
BZR being overtaken by Github and complaints about resource use from low
memory laptop users. Please don't assume technological knowledge or early
adoption means plentiful resources from the user.

------
pbsd
A new private-set intersection paper, claiming to scale orders of magnitude
better than predecessors, has just been released [1]. I don't know whether it
solves your problem, as I haven't read anything beyond the abstract, but you
might want to check it out anyway.

[1] [https://research.microsoft.com/en-
us/um/people/senyk/pubs/sa...](https://research.microsoft.com/en-
us/um/people/senyk/pubs/sapsi.pdf)

------
tlb
The problem is that it's easy to upload 1000 randomly generated phone numbers
and get back results. But I think you can distinguish random or sequentially
generated numbers from a real user's actual contact list, by looking at the
distribution of friends among the contacts. A real contact list should have a
high density of friendships between contacts. The system should only return
results if the uploader provides a subgraph that's more highly connected than
a random one would be.

You'd need to look at real data and tune the parameters to make this
effective.

------
spullara
I think there is a big difference between private contact discovery and
contact discovery that isn't brute forceable like we are seeing with SnapChat
and others. One change that could be made for these very small pre image
corpuses would be to ask the target user if they should be exposed to the
searcher.

1) I upload the hashes of my contacts which enqueues a request to those that
match the hashes

2) The target of the request dequeues those requests and is given the option
to expose themselves to the other user

3) If approved, the original uploader gets access to that user on the service

This would stop the massive discovery issue and would only put a medium amount
of friction into the growth engine. It obviously doesn't cover the issue that
the service itself has access to the contacts which means they could
potentially be exposed. However, if the service discards hashes without
matching queues and removes matches when dequeued, the risk to the user is
much reduced from the full address book storage case.

This was written off the top of my head, so forgive me if I have left a big
hole somewhere.

~~~
herge
RTFA. He addresses that in the first few paragraphs. Because there are 10^10
possible phone numbers (approximately), it is very easy to build a rainbow
table of all hashed phone numbers and back-track a hash to it's phone number.

~~~
MagicWishMonkey
Phone numbers are not unique values, you should never assume that a phone
number alone (or even a phone number and last name) is enough information to
uniquely identify an individual.

Case in point: I work with a guy who used to work at his dads office. He and
his dad both have the same name, and they both had the same office number.
This algorithm would incorrectly identify both people as being the same person
when in fact they are not.

~~~
sln
Phone numbers are unique values when you need to call a phone or send an SMS,
which is what this software does. If you were trying to do your taxes,
sure...but in this case a phone number is exactly enough.

~~~
MagicWishMonkey
Lots of folks still use land lines.

~~~
milkshakes
They probably aren't trying to use TextSecure with their landlines though.

------
jcampbell1
I understand how the Encrypted Bloom Filter they describe keeps my contacts
private from the service, but it seems to leak the entire contact list the
same as the trivial solution:

I install millions of copies of the app, all with a different contact list. I
do this signed query process for every possible phone number. I now have the
complete list of users.

~~~
moxie
This post is about building untrusted servers, which is somewhat orthogonal to
"the snapchat problem."

The best a privacy preserving protocol can hope to achieve is to match the
characteristics of its non-privacy preserving equivalent.

Asymmetric PIR (the server transmits an unencrypted bloom filter) falls short
of the non-privacy preserving equivalent, since the user makes offline queries
that can't be rate limited by the server. Symmetric PIR (the server transmits
an encrypted bloom filter) provides identical properties to the non-privacy
preserving equivalent, since the user needs to make queries that can be rate
limited by the server, even though the queries leak no information to the
server.

How that rate limiting (or unlisted numbers, or whatever) works is outside the
scope of the protocol, just as it is in the non-privacy preserving case. For
TextSecure, you'd need to verify 1 million phone numbers in order to perform
your attack. Whether you can do that or not isn't really the point: we simply
want to do as least as well as everyone else, without revealing information to
the server.

------
robrenaud
Is this a real problem?

Is it reasonable to trust for me to trust you to you run some arbitrary binary
on my phone that does some magical black box computation on my contact list
and then phone home to your service, but not trust to that you will just not
be a dick with the raw data?

Stated more simply, if I am paranoid about my privacy, then why am I running
your apps and letting you read my contact list at all?

~~~
goatslacker
To the end user probably not but if the encryption is performed on the client
and the client code is open sourced and you don't auto-update your application
then you'll know exactly what is happening to your data when you let it access
your contacts.

~~~
Spearchucker
Not realistic. If it's open-sourced I can go examine the code for sure, but
it's not a good use of my time (and I can actually read code, but even if
someone had the time but can't read code?). I also have no way of knowing if
anyone else has looked at the code at all, and if they have, whether they're
skilled enough to determine that the code is kosher. Again, if _I_ struggle
with that, how on earth is a non-technical-minded person even going to begin?

The second question I haven't seen asked here yet, is why we need everything
to be social to start with? My social needs are covered by Facebook and
Twitter, and I don't install their apps on my phone either.

Given those two things, if an app (any app!) requires access to my contacts I
don't install it. Period.

~~~
goatslacker
You're right it isn't realistic, but it's the only way to fully know that the
application is secure. Otherwise you're putting your trust in
something/someone.

------
maxerickson
In a similar discussion, I said something about hashing phone number pairs.
The graph has a lot more edges than nodes.

Still doesn't stop someone from attacking a single number though.

~~~
ash
Interesting idea. But the address book "friends" must be mutual for this to
work. You would miss those contacts that don't have you in the _their_
contacts.

~~~
napoleond
It seems to me like that fits somewhere in the space between "small
compromise" and "bonus feature". This idea seems like a very good practical
solution--I'm very curious what moxie's take on it would be.

~~~
maxerickson
If the design goal is to not trust the server, it is still pretty useless.

Say you make the hash take 1 second on a phone. Without trying to really pin
it down, I think an attacker would be able to do an area code in hours with
minimal investment. They could do the entire phone number space fairly quickly
for a modest investment.

(My attacker there is focusing on the pairs for one number. The area code part
comes in because phone numbers really are lumpy like that.)

------
pyrocat
So don't do it. I know I'm probably in the minority here but I've never given
another app permission to use my facebook or twitter account to "Find Friends"
and I personally find the whole practice offensive. Just have an option to
invite people by their email address/number/whatever, but not in bulk.

~~~
ash
You seem to dislike inviting a bulk of friends to a service. But the article
doesn't talk about invites. The problem is securely discovering who's
_already_ invited. It's not offensive in my opinion.

~~~
jen_h
Not everyone wants to be found; and not everyone wants to find every single
contact in their phone. Especially for something like RedPhone whose purported
aim is user privacy and security...I'm not sure why they'd want to even
_touch_ 5000+ contacts from a single user (or why someone who used RedPhone
would want to have them scan every contact on their phone), even if they do
use some kind of magical unicorn encryption sauce.

------
abentspoon
What about this?

1) Client uploads a bloom filter with all contacts on phone

2) Server responds with a bloom filter with all registered contacts that match
the client's bloom filter

3) Client displays contacts that match server's bloom filter

You can optionally trade contacts back and forth again with a larger
bits/contact ratio to decrease false positives.

I think it works out so that in exchange for 7 bits of information about each
contact from the client, you can reduce the server's response by a factor of
128.

~~~
sigil
The problem with this is that (1) + (2) means the server now knows the user's
address book. They're trying to avoid that with encrypted bloom filters and
blind signature queries.

~~~
abentspoon
First, it looks like this scheme is broken due to cpu constraints. However...

It had looked more like the encrypted bloom filter was intended to prevent the
client from obtaining the list of registered users.

With (1) + (2), the server only has a few bits of information about each of
the phone's contacts. It would be analogous to just having the area codes.

~~~
ay
What about a slightly modified idea:

(0) The server informs the client about the parameters of the bloom filter
where it currently keeps all the contacts.

(1) The client builds the bloom filter with the same parameters based on its
address book. After building the bloom filter, the client XORs it with a
fuzzing pattern [a].

(2) The server performs a logical AND of the received bloom filter and (its
own bloom filter XOR another fuzzing pattern [b]), and sends back the result.

(3) The client XORs the result with the fuzzing pattern [a], and uses this
bloom filter to perform the local queries.

(4) The client stores the differences in results between the subsequent
results, and each time the same result is returned, it doubles the query
interval [c].

(5) The client sends its own ID with each query, so as the network grows and
the server inevitably needs to recreate the bloom filter with different
parameters, it can do so gradually ahead of time from scratch just based on
the IDs of the queriers.

[a] I haven't done any napkin calculations of how random it should be, nor how
many bits would need to be set to 1. The idea is to add a certain amount of
false negatives here so the server were not able to recover the contact list
just by bruteforce lookups. To avoid the leakage of information by correlating
the subsequent requests, this pattern should probably be a
PRF(client_contacts), rather than a purely random function - thus, if the
contact list did not change, it should not change either, and if the contact
list did change, it should change in a way to mask the change in the contact
list somewhat.

[b] This may not be needed. The idea is to try to protect against the client
sending an all-1 list and getting the entire bloom filter. But given that both
[a] and [b] may both introduce false negatives and false positives, maybe
there should be another approach to restrict what is sent to the client.

[c] Especially given the [a] and [b], chances are quite high that this might
be a no-op.

It is obvious this method will introduce the false negatives, and if [b] is
used, can add false positives as well.

------
Strilanc
40 MiB for ten million users? I think you should ditch the bloom filter if the
false-positive rate is set so low it takes that much space.

Assuming you can map phone numbers onto the integers from 0 to ten billion,
trivial compression of the is-a-user/is-not-a-user bit sequence should easily
fit in ~14MiB [1].

See also: succinct data structures [2] for accessing data without
decompressing it.

(This is obviously a short term solution. Growth will overcome the savings.)

1:
[http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-...](http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-p%29+lg_2%281-p%29%29+*+ten+billion++%2F+2^23+where++p+%3D+ten+million+%2F+ten+billion)

2:
[http://en.wikipedia.org/wiki/Succinct_data_structure](http://en.wikipedia.org/wiki/Succinct_data_structure)

------
fleitz
It's not a technical problem it's a value prop problem. Most users don't care
about privacy....

Fundamentally there is more value for the network operator to grow the network
than for users to disclose their details. Email works perfectly fine without
having to disclose your address book because no one owns the network and
therefore do not care how many people are using email.

Alternatively one could just allow users to choose whether to make their info
public and then the client can just search a directory.

------
Illniyar
What about artificially increasing the size of the hash input. For instance:
when finding out if a a contact exists the client sends 10 iterations of the
contact (I.E. the contact number with another digit, each 0-9) hashed, the
server keeps only one random hashed one to compare to (presumably the client
hashes one of them randomly when sending it's own contact information).

That way you increase the size of the problem from 10^10 to 10^11 but only
increase your bandwidth 10 fold.

It still keeps the need to balance the amount of bandwidth used, but the
bandwidth is no longer determined by the amount of all the users in the
system, only the increase in problem size and the amount of contacts a person
has.

Is that a worthwhile solution?

------
cristiantincu
> Building a social network is not easy. Social networks have value
> proportional to their size, so participants aren’t motivated to join new
> social networks which aren’t already large. It’s a paradox where if people
> haven’t already joined, people aren’t motivated to join.

Value is relative. For stakeholders, value is indeed given by the network’s
size. For users (participants), it’s very debatable whether size equals value.
The author seems to be mistaking the two perspectives for a single one. No
paradox here.

~~~
abdullahkhalids
Not really. I will only want to join a social network if a lot of my friends
are already on it. The fewer of my friends on it, the lower the value I get
out of using the service (cetris paribus of course).

------
abdullahkhalids
Is is possible to leverage some sort of peer-assisted discovery? Where you ask
some sub-set of your trusted contacts already on the system, to call the
server on your behalf. So that the server is less likely to know whose
contacts are being sent to it.

Of course, this can only be part of the full solution.

Disclaimer: just throwing a wild idea out there. don't know anything about the
field.

~~~
jcampbell1
If you used TOR for the queries, then it would hide the possibility that I
have drug dealers/prostitutes/bookies in my contact list. I am not sure how
you would then be able to rate limit it in this case.

------
jahewson
Could you use an enormously expensive hash function so that building a lookup
table is infeasible?

~~~
russellsprouts
Assuming you want to try every phone number in the US, that would be almost 10
billion phone numbers. An above average user may have about 1,000 contacts. We
want this to run on phones in a reasonable time for 1000 contacts, but be
infeasible for 10 billion using any hardware the attacker uses.

Bitcoin gives an idea of the relative speeds. I looked up the speeds of some
Android bitcoin miners. All were on the order of 10Khash/s.

The attacker is not constrained to use phones. Custom hardware for BitCoin
mining can be on the order of 1Thash/s.

This means that the custom hardware can reach 10^8 times faster, but the work
to be done -- 10 billion hashes vs. 1000, is only 10^7 times as much. If the
slow hashing algorithm used was anything like the one for BitCoins, a
determined attacker could calculate the entire table faster than the person
looking up their contacts.

This is not the whole story -- this could be mitigated using better hashing
algorithms that are designed to not scale well with better hardware, but the
user will only wait maybe a minute, while an attacker could spend weeks.

~~~
jahewson
This is a great answer.

------
amix
I don't really think you need to solve this problem using crypto. I think a
simpler solution would be to use the "hash it" solution, but restrict how many
lookups each client can do. You can either do this by the using the client's
phone number, IP or other unique information etc. This way an attacker would
have a very hard time brute forcing this.

Additionally you could use captchas (or other humanity tests, such as SMS
verification) to limit hackers creating automated accounts (in order to fight
automated bots spamming and polluting the network).

~~~
cortesoft
I think the point it to avoid having to trust the server. If you can trust the
server not to do malicious things with the data, then you can do any number of
techniques.

This post is about how to do contact lookup without having to trust the server
to maintain privacy.

~~~
amix
As I see it, the biggest current threat is bruteforcing of this service, at
least based on the recent snapchat attack. And implementing rate limiting
would solve this issue.

And given that this is a very hard problem to solve this kind of solution
could be a good bandaid. Anyhow, this is at least my 2 cents on this issue.

