Hacker News new | comments | show | ask | jobs | submit login
The Difficulty Of Private Contact Discovery (whispersystems.org)
99 points by FredericJ 1206 days ago | hide | past | web | 71 comments | favorite

This is great until you realize the average address book is a disgusting mess of spelling errors, wrong values in wrong fields, punctuation in places where punctuation isn't necessary (commas in phone number fields, digits in name fields, etc.)

The only field that might be fairly consistent is email address and even that is no guarantee. Sure, it's possible to clean up a contact before hashing, but when you consider the fact that the typical phone contains >500 contacts it's not something that can be done in a reasonable amount of time on a modern smartphone. For the last two years I've been working on a product that involves cleaning/organizing contacts and making them searchable, when I started I had no idea what a massive undertaking it would be.

So, hash till your hearts content, but if you have 10 different people with the same contact in their address book I would be willing to bet that you will get 10 different hashes because humans are not good at data entry.

This times a million.

Phone numbers in particular can be a horrid mess to deal with. Some decent libraries have come out to normalize them, but in general it remains difficult.

For example:

013811234 +88-1-3811234 3811234 0118813811234 008813811234

are all be the same number.

To hash them, you could naively just strip off the area codes and country codes, but that is only if you can know ahead of time they are indeed area or country codes. Some countries have 7 digit numbers, others have 9, some have 6, or even 5 digits. Sometimes you don't even know the country to apply these rules to.

Just a thought - Why not reverse the phone number string and if there is a match in the first N digits of all the above phone numbers, then classify these as the same phone number? N would vary from country to country, no doubt. The intermediate characters like {,, +, -, .,} could also be stripped out from the phone number string to make the original problem less complex.

Where can I find your app?

It's not public yet (focused on the enterprise space right now), I'll send you a pm to our website.

edit not sure how to do a pm here, if you use reddit send me a private message user/MagicWishMonkey

There similarity hash algorithms that takes a fuzzy approach to hashing slightly different values to the same hash value.

The SHA256 algorithm described in the article is not a fuzzy algorithm.

Interesting problem!

> It’s also possible to compress “updates” to the bloom filter. The server just needs to calculate the XOR of the version the client has and the updated version, then run that through LZMA (the input will be mostly zeros), and transmit the compressed diff to the client.

> Unfortunately, for a reasonably large user base, this strategy doesn’t work because the bloom filters themselves are too large to transmit to mobile clients. For 10 million TextSecure users, a bloom filter like this would be ~40MB, requested from the server 116 times a second if every client refreshes it once a day.

Wait, the LZMA-compressed daily diffs would be still be ~40MB? If the 40MB is just the initial bloom filter download, that's not so bad. If 40MB is the daily diff, I'd be interested in seeing the calculation for that.

Our feeling is that 40MB for an initial download is not acceptable on mobile devices, particularly in many of the areas we'd like to support. Maybe we're wrong, but it seems like too much on edge networks, particularly given that it's only going to grow as more users join. Ideally we'd have 100MM or 500MM, not just 10MM.

We can definitely make lightweight diffs happen, but that initial download is tough, and is going to get tougher.


1) Bundle a snapshot of the global contact list with the APK. Assume that end users will be on 3G or WI-FI when installing the program. I suppose that is is worth looking at the assumption about what is the largest download that we can live with under EDGE. The current TextSecure APK is 2.5MB as far as I can tell. The best case for the APK download is 1.4 minutes (236kbit/s). If the size is essentially zero, then we need to move to idea 2.

2) Split the rate-limiter into an anonymous proxy controlled by a partially trusted third party. Open Whisper Systems controls the database, but not the proxy. Open Whisper Systems doesn't know where the request comes from. The rate limiter doesn't know what is requested.

Edit: forgot to complete a sentence

So make contact discovery optional. My guess is most of your growth will come from that core of early adopters in the graph, and these folks have nice data plans / fast wifi and won't mind the 40MB download at all.

In the long term, where you're 10x or 50x where you are today, yes the filter size is a problem, but it's a "nice problem to have" as they say.

Side question: suppose I'm a less fortunate user of TextSecure, can I manually exchange keys with someone IRL? How does that work?

"My guess is most of your growth will come from that core of early adopters in the graph, and these folks have nice data plans / fast wifi and won't mind the 40MB download at all."

See my comment in this thread about low resource mobiles, or the article on BZR being overtaken by Github and complaints about resource use from low memory laptop users. Please don't assume technological knowledge or early adoption means plentiful resources from the user.

Aren't you essentially just offering the users to download as much data as they want to reduce the probability of 'collateral uploads' to levels that are acceptable for them?

Could that mode be optional? Give the user the option protecting their contacts privacy at the cost of the large download or to trust your servers to delete the unmatched contacts.

My original device was what you may call an edge case. Underpowered, lack of space, pretty crappy carrier plan.

Thanks for thinking of us. :)

Indeed; if I understand it correctly encrypted data goes into a standard bloom filter.

If that's indeed the case then, assuming that users can never be removed, a daily update would just contain the indexes of the additional bits set by the newly registered users, which could never be 40MB.

It seems they also forgot to include lowering the p-value of the bloom filter as a trade-off, since it seems that a false positive means that a contact is sent to the server which shouldn't have been.

If you push that down to p=.95/.98/.99 or so you'd get a much smaller Bloom filter and many people might not object to a few % of 'collateral uploads' so to speak.

Minimizing the amount of information leaked through these various tradeoffs is an interesting problem though, since you'll have to leak something one way or the other.

A new private-set intersection paper, claiming to scale orders of magnitude better than predecessors, has just been released [1]. I don't know whether it solves your problem, as I haven't read anything beyond the abstract, but you might want to check it out anyway.

[1] https://research.microsoft.com/en-us/um/people/senyk/pubs/sa...

The problem is that it's easy to upload 1000 randomly generated phone numbers and get back results. But I think you can distinguish random or sequentially generated numbers from a real user's actual contact list, by looking at the distribution of friends among the contacts. A real contact list should have a high density of friendships between contacts. The system should only return results if the uploader provides a subgraph that's more highly connected than a random one would be.

You'd need to look at real data and tune the parameters to make this effective.

I think there is a big difference between private contact discovery and contact discovery that isn't brute forceable like we are seeing with SnapChat and others. One change that could be made for these very small pre image corpuses would be to ask the target user if they should be exposed to the searcher.

1) I upload the hashes of my contacts which enqueues a request to those that match the hashes

2) The target of the request dequeues those requests and is given the option to expose themselves to the other user

3) If approved, the original uploader gets access to that user on the service

This would stop the massive discovery issue and would only put a medium amount of friction into the growth engine. It obviously doesn't cover the issue that the service itself has access to the contacts which means they could potentially be exposed. However, if the service discards hashes without matching queues and removes matches when dequeued, the risk to the user is much reduced from the full address book storage case.

This was written off the top of my head, so forgive me if I have left a big hole somewhere.

RTFA. He addresses that in the first few paragraphs. Because there are 10^10 possible phone numbers (approximately), it is very easy to build a rainbow table of all hashed phone numbers and back-track a hash to it's phone number.

Well, actually I did read the article and address that in my comment that I don't think that it is important to hide the address from the server but it is important to hide them from the brute force attacker. I believe that my suggestion stops a brute force attacker from pulling the whole database like you can today with many of these systems.

Properly implemented rate limiting would be a better way to achieve that though. That is were the 'blinding signature' part comes in: to give the server the possibility to intelligently rate limit.

Rate limiting is a terrible solution to this sort of problem IMO. It just forces the attacker to be patient, but they will eventually get what they want, which is name/phone number pairs of the entire numeric space of phone numbers. His solution is problematic because it distributes the burdain to the user, and anyone who's gotten massive numbers of friend requests on a chat service can attest to how unfriendly that is, but it's on the right track. It should not be possible for an attacker to get contact information for arbitrary phone numbers through an api. Something should gate that, probably through the mutuality of having each other on each other's contact list.

We're not necessarily talking about sharing contact information; all we're talking about here is an acknowledgement of existence.

Given that context, due to the nature of the problem, rate limiting is by definition the -only- solution to the problem (if I'm understanding the problem correctly.)

You do have a point that perhaps the question we should be asking should be 'how do we detect mutual connections,' but that's another story altogether.

Under what circumstances do I want someone who I have not even enough interest in to put in my phone's dialer to be able to ping my phone number and get my name back?

The question of whether services should even be DOING this kind of open book "does this phone number have a user attached and what is their name?" query is more fundamental than any other question, imo. It's not another story, it's a skipped first story.

Phone numbers are not unique values, you should never assume that a phone number alone (or even a phone number and last name) is enough information to uniquely identify an individual.

Case in point: I work with a guy who used to work at his dads office. He and his dad both have the same name, and they both had the same office number. This algorithm would incorrectly identify both people as being the same person when in fact they are not.

Phone numbers are unique values when you need to call a phone or send an SMS, which is what this software does. If you were trying to do your taxes, sure...but in this case a phone number is exactly enough.

Lots of folks still use land lines.

They probably aren't trying to use TextSecure with their landlines though.

You should never assume that any piece of information is not enough to uniquely identify an individual.

You are thinking from the point of view of a database administrator, whose job it is to make it easy to uniquely identify an entity.

In security it's about making it as hard as possible to deduce even the slightest of information about a person, increasing the k-anonymity as they say.

For all intents and purposes, that a phone number resolves to only two people, and they're also conveniently located geographically near to each other, and are probably very distinct in a whole range of other things, makes it the perfect key for any security adversary.

You make a good point, my job is to make it simple for machines to differentiate between two people, a task a human could do easily.

I understand how the Encrypted Bloom Filter they describe keeps my contacts private from the service, but it seems to leak the entire contact list the same as the trivial solution:

I install millions of copies of the app, all with a different contact list. I do this signed query process for every possible phone number. I now have the complete list of users.

This post is about building untrusted servers, which is somewhat orthogonal to "the snapchat problem."

The best a privacy preserving protocol can hope to achieve is to match the characteristics of its non-privacy preserving equivalent.

Asymmetric PIR (the server transmits an unencrypted bloom filter) falls short of the non-privacy preserving equivalent, since the user makes offline queries that can't be rate limited by the server. Symmetric PIR (the server transmits an encrypted bloom filter) provides identical properties to the non-privacy preserving equivalent, since the user needs to make queries that can be rate limited by the server, even though the queries leak no information to the server.

How that rate limiting (or unlisted numbers, or whatever) works is outside the scope of the protocol, just as it is in the non-privacy preserving case. For TextSecure, you'd need to verify 1 million phone numbers in order to perform your attack. Whether you can do that or not isn't really the point: we simply want to do as least as well as everyone else, without revealing information to the server.

Is this a real problem?

Is it reasonable to trust for me to trust you to you run some arbitrary binary on my phone that does some magical black box computation on my contact list and then phone home to your service, but not trust to that you will just not be a dick with the raw data?

Stated more simply, if I am paranoid about my privacy, then why am I running your apps and letting you read my contact list at all?

Agreed. The whole thing seems to really miss the problem Snapchat exposed, turning an access control problem into a cryptographic problem. At least I assume this is being posted now because of snapchat.

The problem with how snapchat failed isn't so much that they weren't storing the information securely but that they had absolutely no gate on getting the information aside from a global api key. If you're running their app you've already exposed the phone numbers and are trusting them not to be too evil with them.

My expectation as a user of their service (I'm not, but if I were, and this applies to things like facebook as well, of which I am a user) is that they won't tell people I don't want to know who I am and if I'm using their service.

To this end what's needed is that before it gives another user a positive result (especially tied to my name) to my phone number it should have a reasonable expectation that I want them to have it. If they store the information that lets them derive this expectation in a way that keeps my info secure, that is also fantastic, but it's a second tier concern to becoming literally the whitepages by accident.

They are two separate problems.

The problem is that an application requests your contacts, uploads them to their server, and then who knows what they do with that information afterwards; maybe they store it, perhaps later that database is leaked, or they sell the stored information in the future.

If you trust the service you're using fully (you trust the server will not be compromised, the owners of the servers will not cave to demands to store/release data, the owners will not be malicious with your data) then this isn't a problem at all. Transmit the raw data over SSL then perform the checks and do not store the data.

Yes, this is what I'm saying. It is two separate problems but the only one people seem interested in addressing is the one that is probably not the problem for most people.

Like it or not, people do overtrust companies like snapchat. We need to change that too.

But in the mean time, snapchat's biggest and first mistake was to leave their database an open book without even having to compromise their service or servers. The people who did this used nothing that their client application doesn't use every day, and that is not an unsolvable problem at all. It's no different a problem than every single social network already has to solve, whether their users trust them to encrypt or otherwise obscure their data once obtained or not.

Yes this is a real problem. Being "paranoid about privacy" is not a yes/no decision for people to make. Different people have different things they would like to be private about and to different degrees. Some people may not want it to be possible for anyone to ever know anything at all about them. Those people don't use phones of any nature. Maybe some other people only trust phones that they build themselves. Other people are willing to trust open source software and the communities around it. Some people only care about making dragnet surveillance difficult or expensive. Others need perfect assurance of privacy because their lives are in danger. Others don't care if anyone listens to their calls.

But what you are saying is like looking at football players, seeing that they could still get hurt when they wear lots of padding, and deciding either no one should play the game or we should all play inside giant airbags.

They provide the source. Examine it and build it yourself.

To the end user probably not but if the encryption is performed on the client and the client code is open sourced and you don't auto-update your application then you'll know exactly what is happening to your data when you let it access your contacts.

Not realistic. If it's open-sourced I can go examine the code for sure, but it's not a good use of my time (and I can actually read code, but even if someone had the time but can't read code?). I also have no way of knowing if anyone else has looked at the code at all, and if they have, whether they're skilled enough to determine that the code is kosher. Again, if I struggle with that, how on earth is a non-technical-minded person even going to begin?

The second question I haven't seen asked here yet, is why we need everything to be social to start with? My social needs are covered by Facebook and Twitter, and I don't install their apps on my phone either.

Given those two things, if an app (any app!) requires access to my contacts I don't install it. Period.

You're right it isn't realistic, but it's the only way to fully know that the application is secure. Otherwise you're putting your trust in something/someone.

In a similar discussion, I said something about hashing phone number pairs. The graph has a lot more edges than nodes.

Still doesn't stop someone from attacking a single number though.

I you treat a pair as a one-direction edge, and only match when you have an edge in both directions, then you could add text message verification to prevent people from both impersonating others and extracting additional information out of the service. It would be reasonably trivial then to notify clients when a new match comes in.

This approach seems to require that you trust the server.

Interesting idea. But the address book "friends" must be mutual for this to work. You would miss those contacts that don't have you in the their contacts.

It seems to me like that fits somewhere in the space between "small compromise" and "bonus feature". This idea seems like a very good practical solution--I'm very curious what moxie's take on it would be.

If the design goal is to not trust the server, it is still pretty useless.

Say you make the hash take 1 second on a phone. Without trying to really pin it down, I think an attacker would be able to do an area code in hours with minimal investment. They could do the entire phone number space fairly quickly for a modest investment.

(My attacker there is focusing on the pairs for one number. The area code part comes in because phone numbers really are lumpy like that.)

And this is bad because...?

If they don't have you in their contacts it's probably for a reason. A lot of times (not all, but more than enough) it's probably because they've basically stalked you.

So don't do it. I know I'm probably in the minority here but I've never given another app permission to use my facebook or twitter account to "Find Friends" and I personally find the whole practice offensive. Just have an option to invite people by their email address/number/whatever, but not in bulk.

You seem to dislike inviting a bulk of friends to a service. But the article doesn't talk about invites. The problem is securely discovering who's already invited. It's not offensive in my opinion.

Not everyone wants to be found; and not everyone wants to find every single contact in their phone. Especially for something like RedPhone whose purported aim is user privacy and security...I'm not sure why they'd want to even touch 5000+ contacts from a single user (or why someone who used RedPhone would want to have them scan every contact on their phone), even if they do use some kind of magical unicorn encryption sauce.

What about this?

1) Client uploads a bloom filter with all contacts on phone

2) Server responds with a bloom filter with all registered contacts that match the client's bloom filter

3) Client displays contacts that match server's bloom filter

You can optionally trade contacts back and forth again with a larger bits/contact ratio to decrease false positives.

I think it works out so that in exchange for 7 bits of information about each contact from the client, you can reduce the server's response by a factor of 128.

The problem with this is that (1) + (2) means the server now knows the user's address book. They're trying to avoid that with encrypted bloom filters and blind signature queries.

First, it looks like this scheme is broken due to cpu constraints. However...

It had looked more like the encrypted bloom filter was intended to prevent the client from obtaining the list of registered users.

With (1) + (2), the server only has a few bits of information about each of the phone's contacts. It would be analogous to just having the area codes.

What about a slightly modified idea:

(0) The server informs the client about the parameters of the bloom filter where it currently keeps all the contacts.

(1) The client builds the bloom filter with the same parameters based on its address book. After building the bloom filter, the client XORs it with a fuzzing pattern [a].

(2) The server performs a logical AND of the received bloom filter and (its own bloom filter XOR another fuzzing pattern [b]), and sends back the result.

(3) The client XORs the result with the fuzzing pattern [a], and uses this bloom filter to perform the local queries.

(4) The client stores the differences in results between the subsequent results, and each time the same result is returned, it doubles the query interval [c].

(5) The client sends its own ID with each query, so as the network grows and the server inevitably needs to recreate the bloom filter with different parameters, it can do so gradually ahead of time from scratch just based on the IDs of the queriers.

[a] I haven't done any napkin calculations of how random it should be, nor how many bits would need to be set to 1. The idea is to add a certain amount of false negatives here so the server were not able to recover the contact list just by bruteforce lookups. To avoid the leakage of information by correlating the subsequent requests, this pattern should probably be a PRF(client_contacts), rather than a purely random function - thus, if the contact list did not change, it should not change either, and if the contact list did change, it should change in a way to mask the change in the contact list somewhat.

[b] This may not be needed. The idea is to try to protect against the client sending an all-1 list and getting the entire bloom filter. But given that both [a] and [b] may both introduce false negatives and false positives, maybe there should be another approach to restrict what is sent to the client.

[c] Especially given the [a] and [b], chances are quite high that this might be a no-op.

It is obvious this method will introduce the false negatives, and if [b] is used, can add false positives as well.

The biggest problem with that is:

1) Client uploads a bloom filter consisting entirely of ones.

2) Server responds by sending its entire contact database.

How would you do (2) without iterating over all registered contacts on the server?

Good point. Now I understand why they were suggesting bucketing.

I came up with this method for maintaining privacy while retrieving installed apps (to give app recommendations). Sounds like it might not translate across so well.

40 MiB for ten million users? I think you should ditch the bloom filter if the false-positive rate is set so low it takes that much space.

Assuming you can map phone numbers onto the integers from 0 to ten billion, trivial compression of the is-a-user/is-not-a-user bit sequence should easily fit in ~14MiB [1].

See also: succinct data structures [2] for accessing data without decompressing it.

(This is obviously a short term solution. Growth will overcome the savings.)

1: http://www.wolframalpha.com/input/?i=%28-p+lg_2%28p%29-%281-...

2: http://en.wikipedia.org/wiki/Succinct_data_structure

It's not a technical problem it's a value prop problem. Most users don't care about privacy....

Fundamentally there is more value for the network operator to grow the network than for users to disclose their details. Email works perfectly fine without having to disclose your address book because no one owns the network and therefore do not care how many people are using email.

Alternatively one could just allow users to choose whether to make their info public and then the client can just search a directory.

What about artificially increasing the size of the hash input. For instance: when finding out if a a contact exists the client sends 10 iterations of the contact (I.E. the contact number with another digit, each 0-9) hashed, the server keeps only one random hashed one to compare to (presumably the client hashes one of them randomly when sending it's own contact information).

That way you increase the size of the problem from 10^10 to 10^11 but only increase your bandwidth 10 fold.

It still keeps the need to balance the amount of bandwidth used, but the bandwidth is no longer determined by the amount of all the users in the system, only the increase in problem size and the amount of contacts a person has.

Is that a worthwhile solution?

> Building a social network is not easy. Social networks have value proportional to their size, so participants aren’t motivated to join new social networks which aren’t already large. It’s a paradox where if people haven’t already joined, people aren’t motivated to join.

Value is relative. For stakeholders, value is indeed given by the network’s size. For users (participants), it’s very debatable whether size equals value. The author seems to be mistaking the two perspectives for a single one. No paradox here.

Not really. I will only want to join a social network if a lot of my friends are already on it. The fewer of my friends on it, the lower the value I get out of using the service (cetris paribus of course).

Is is possible to leverage some sort of peer-assisted discovery? Where you ask some sub-set of your trusted contacts already on the system, to call the server on your behalf. So that the server is less likely to know whose contacts are being sent to it.

Of course, this can only be part of the full solution.

Disclaimer: just throwing a wild idea out there. don't know anything about the field.

If you used TOR for the queries, then it would hide the possibility that I have drug dealers/prostitutes/bookies in my contact list. I am not sure how you would then be able to rate limit it in this case.

Could you use an enormously expensive hash function so that building a lookup table is infeasible?

Assuming you want to try every phone number in the US, that would be almost 10 billion phone numbers. An above average user may have about 1,000 contacts. We want this to run on phones in a reasonable time for 1000 contacts, but be infeasible for 10 billion using any hardware the attacker uses.

Bitcoin gives an idea of the relative speeds. I looked up the speeds of some Android bitcoin miners. All were on the order of 10Khash/s.

The attacker is not constrained to use phones. Custom hardware for BitCoin mining can be on the order of 1Thash/s.

This means that the custom hardware can reach 10^8 times faster, but the work to be done -- 10 billion hashes vs. 1000, is only 10^7 times as much. If the slow hashing algorithm used was anything like the one for BitCoins, a determined attacker could calculate the entire table faster than the person looking up their contacts.

This is not the whole story -- this could be mitigated using better hashing algorithms that are designed to not scale well with better hardware, but the user will only wait maybe a minute, while an attacker could spend weeks.

This is a great answer.

I don't really think you need to solve this problem using crypto. I think a simpler solution would be to use the "hash it" solution, but restrict how many lookups each client can do. You can either do this by the using the client's phone number, IP or other unique information etc. This way an attacker would have a very hard time brute forcing this.

Additionally you could use captchas (or other humanity tests, such as SMS verification) to limit hackers creating automated accounts (in order to fight automated bots spamming and polluting the network).

I think the point it to avoid having to trust the server. If you can trust the server not to do malicious things with the data, then you can do any number of techniques.

This post is about how to do contact lookup without having to trust the server to maintain privacy.

As I see it, the biggest current threat is bruteforcing of this service, at least based on the recent snapchat attack. And implementing rate limiting would solve this issue.

And given that this is a very hard problem to solve this kind of solution could be a good bandaid. Anyhow, this is at least my 2 cents on this issue.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact