The only field that might be fairly consistent is email address and even that is no guarantee. Sure, it's possible to clean up a contact before hashing, but when you consider the fact that the typical phone contains >500 contacts it's not something that can be done in a reasonable amount of time on a modern smartphone. For the last two years I've been working on a product that involves cleaning/organizing contacts and making them searchable, when I started I had no idea what a massive undertaking it would be.
So, hash till your hearts content, but if you have 10 different people with the same contact in their address book I would be willing to bet that you will get 10 different hashes because humans are not good at data entry.
Phone numbers in particular can be a horrid mess to deal with. Some decent libraries have come out to normalize them, but in general it remains difficult.
are all be the same number.
To hash them, you could naively just strip off the area codes and country codes, but that is only if you can know ahead of time they are indeed area or country codes. Some countries have 7 digit numbers, others have 9, some have 6, or even 5 digits. Sometimes you don't even know the country to apply these rules to.
edit not sure how to do a pm here, if you use reddit send me a private message user/MagicWishMonkey
> It’s also possible to compress “updates” to the bloom filter. The server just needs to calculate the XOR of the version the client has and the updated version, then run that through LZMA (the input will be mostly zeros), and transmit the compressed diff to the client.
> Unfortunately, for a reasonably large user base, this strategy doesn’t work because the bloom filters themselves are too large to transmit to mobile clients. For 10 million TextSecure users, a bloom filter like this would be ~40MB, requested from the server 116 times a second if every client refreshes it once a day.
Wait, the LZMA-compressed daily diffs would be still be ~40MB? If the 40MB is just the initial bloom filter download, that's not so bad. If 40MB is the daily diff, I'd be interested in seeing the calculation for that.
We can definitely make lightweight diffs happen, but that initial download is tough, and is going to get tougher.
1) Bundle a snapshot of the global contact list with the APK. Assume that end users will be on 3G or WI-FI when installing the program. I suppose that is is worth looking at the assumption about what is the largest download that we can live with under EDGE. The current TextSecure APK is 2.5MB as far as I can tell. The best case for the APK download is 1.4 minutes (236kbit/s). If the size is essentially zero, then we need to move to idea 2.
2) Split the rate-limiter into an anonymous proxy controlled by a partially trusted third party. Open Whisper Systems controls the database, but not the proxy. Open Whisper Systems doesn't know where the request comes from. The rate limiter doesn't know what is requested.
Edit: forgot to complete a sentence
In the long term, where you're 10x or 50x where you are today, yes the filter size is a problem, but it's a "nice problem to have" as they say.
Side question: suppose I'm a less fortunate user of TextSecure, can I manually exchange keys with someone IRL? How does that work?
See my comment in this thread about low resource mobiles, or the article on BZR being overtaken by Github and complaints about resource use from low memory laptop users. Please don't assume technological knowledge or early adoption means plentiful resources from the user.
Thanks for thinking of us. :)
If that's indeed the case then, assuming that users can never be removed, a daily update would just contain the indexes of the additional bits set by the newly registered users, which could never be 40MB.
It seems they also forgot to include lowering the p-value of the bloom filter as a trade-off, since it seems that a false positive means that a contact is sent to the server which shouldn't have been.
If you push that down to p=.95/.98/.99 or so you'd get a much smaller Bloom filter and many people might not object to a few % of 'collateral uploads' so to speak.
Minimizing the amount of information leaked through these various tradeoffs is an interesting problem though, since you'll have to leak something one way or the other.
You'd need to look at real data and tune the parameters to make this effective.
1) I upload the hashes of my contacts which enqueues a request to those that match the hashes
2) The target of the request dequeues those requests and is given the option to expose themselves to the other user
3) If approved, the original uploader gets access to that user on the service
This would stop the massive discovery issue and would only put a medium amount of friction into the growth engine. It obviously doesn't cover the issue that the service itself has access to the contacts which means they could potentially be exposed. However, if the service discards hashes without matching queues and removes matches when dequeued, the risk to the user is much reduced from the full address book storage case.
This was written off the top of my head, so forgive me if I have left a big hole somewhere.
Given that context, due to the nature of the problem, rate limiting is by definition the -only- solution to the problem (if I'm understanding the problem correctly.)
You do have a point that perhaps the question we should be asking should be 'how do we detect mutual connections,' but that's another story altogether.
The question of whether services should even be DOING this kind of open book "does this phone number have a user attached and what is their name?" query is more fundamental than any other question, imo. It's not another story, it's a skipped first story.
Case in point: I work with a guy who used to work at his dads office. He and his dad both have the same name, and they both had the same office number. This algorithm would incorrectly identify both people as being the same person when in fact they are not.
You are thinking from the point of view of a database administrator, whose job it is to make it easy to uniquely identify an entity.
In security it's about making it as hard as possible to deduce even the slightest of information about a person, increasing the k-anonymity as they say.
For all intents and purposes, that a phone number resolves to only two people, and they're also conveniently located geographically near to each other, and are probably very distinct in a whole range of other things, makes it the perfect key for any security adversary.
I install millions of copies of the app, all with a different contact list. I do this signed query process for every possible phone number. I now have the complete list of users.
The best a privacy preserving protocol can hope to achieve is to match the characteristics of its non-privacy preserving equivalent.
Asymmetric PIR (the server transmits an unencrypted bloom filter) falls short of the non-privacy preserving equivalent, since the user makes offline queries that can't be rate limited by the server. Symmetric PIR (the server transmits an encrypted bloom filter) provides identical properties to the non-privacy preserving equivalent, since the user needs to make queries that can be rate limited by the server, even though the queries leak no information to the server.
How that rate limiting (or unlisted numbers, or whatever) works is outside the scope of the protocol, just as it is in the non-privacy preserving case. For TextSecure, you'd need to verify 1 million phone numbers in order to perform your attack. Whether you can do that or not isn't really the point: we simply want to do as least as well as everyone else, without revealing information to the server.
Is it reasonable to trust for me to trust you to you run some arbitrary binary on my phone that does some magical black box computation on my contact list and then phone home to your service, but not trust to that you will just not be a dick with the raw data?
Stated more simply, if I am paranoid about my privacy, then why am I running your apps and letting you read my contact list at all?
The problem with how snapchat failed isn't so much that they weren't storing the information securely but that they had absolutely no gate on getting the information aside from a global api key. If you're running their app you've already exposed the phone numbers and are trusting them not to be too evil with them.
My expectation as a user of their service (I'm not, but if I were, and this applies to things like facebook as well, of which I am a user) is that they won't tell people I don't want to know who I am and if I'm using their service.
To this end what's needed is that before it gives another user a positive result (especially tied to my name) to my phone number it should have a reasonable expectation that I want them to have it. If they store the information that lets them derive this expectation in a way that keeps my info secure, that is also fantastic, but it's a second tier concern to becoming literally the whitepages by accident.
The problem is that an application requests your contacts, uploads them to their server, and then who knows what they do with that information afterwards; maybe they store it, perhaps later that database is leaked, or they sell the stored information in the future.
If you trust the service you're using fully (you trust the server will not be compromised, the owners of the servers will not cave to demands to store/release data, the owners will not be malicious with your data) then this isn't a problem at all. Transmit the raw data over SSL then perform the checks and do not store the data.
Like it or not, people do overtrust companies like snapchat. We need to change that too.
But in the mean time, snapchat's biggest and first mistake was to leave their database an open book without even having to compromise their service or servers. The people who did this used nothing that their client application doesn't use every day, and that is not an unsolvable problem at all. It's no different a problem than every single social network already has to solve, whether their users trust them to encrypt or otherwise obscure their data once obtained or not.
But what you are saying is like looking at football players, seeing that they could still get hurt when they wear lots of padding, and deciding either no one should play the game or we should all play inside giant airbags.
The second question I haven't seen asked here yet, is why we need everything to be social to start with? My social needs are covered by Facebook and Twitter, and I don't install their apps on my phone either.
Given those two things, if an app (any app!) requires access to my contacts I don't install it. Period.
Still doesn't stop someone from attacking a single number though.
This approach seems to require that you trust the server.
Say you make the hash take 1 second on a phone. Without trying to really pin it down, I think an attacker would be able to do an area code in hours with minimal investment. They could do the entire phone number space fairly quickly for a modest investment.
(My attacker there is focusing on the pairs for one number. The area code part comes in because phone numbers really are lumpy like that.)
If they don't have you in their contacts it's probably for a reason. A lot of times (not all, but more than enough) it's probably because they've basically stalked you.
1) Client uploads a bloom filter with all contacts on phone
2) Server responds with a bloom filter with all registered contacts that match the client's bloom filter
3) Client displays contacts that match server's bloom filter
You can optionally trade contacts back and forth again with a larger bits/contact ratio to decrease false positives.
I think it works out so that in exchange for 7 bits of information about each contact from the client, you can reduce the server's response by a factor of 128.
It had looked more like the encrypted bloom filter was intended to prevent the client from obtaining the list of registered users.
With (1) + (2), the server only has a few bits of information about each of the phone's contacts. It would be analogous to just having the area codes.
(0) The server informs the client about the parameters of the bloom filter where it currently keeps all the contacts.
(1) The client builds the bloom filter with the same parameters based on its address book. After building the bloom filter, the client XORs it with a fuzzing pattern [a].
(2) The server performs a logical AND of the received bloom filter and (its own bloom filter XOR another fuzzing pattern [b]), and sends back the result.
(3) The client XORs the result with the fuzzing pattern [a], and uses this bloom filter to perform the local queries.
(4) The client stores the differences in results between the subsequent results, and each time the same result is returned, it doubles the query interval [c].
(5) The client sends its own ID with each query, so as the network grows and the server inevitably needs to recreate the bloom filter with different parameters, it can do so gradually ahead of time from scratch just based on the IDs of the queriers.
[a] I haven't done any napkin calculations of how random it should be, nor how many bits would need to be set to 1. The idea is to add a certain amount of false negatives here so the server were not able to recover the contact list just by bruteforce lookups. To avoid the leakage of information by correlating the subsequent requests, this pattern should probably be a PRF(client_contacts), rather than a purely random function - thus, if the contact list did not change, it should not change either, and if the contact list did change, it should change in a way to mask the change in the contact list somewhat.
[b] This may not be needed. The idea is to try to protect against the client sending an all-1 list and getting the entire bloom filter. But given that both [a] and [b] may both introduce false negatives and false positives, maybe there should be another approach to restrict what is sent to the client.
[c] Especially given the [a] and [b], chances are quite high that this might be a no-op.
It is obvious this method will introduce the false negatives, and if [b] is used, can add false positives as well.
1) Client uploads a bloom filter consisting entirely of ones.
2) Server responds by sending its entire contact database.
I came up with this method for maintaining privacy while retrieving installed apps (to give app recommendations). Sounds like it might not translate across so well.
Assuming you can map phone numbers onto the integers from 0 to ten billion, trivial compression of the is-a-user/is-not-a-user bit sequence should easily fit in ~14MiB .
See also: succinct data structures  for accessing data without decompressing it.
(This is obviously a short term solution. Growth will overcome the savings.)
Fundamentally there is more value for the network operator to grow the network than for users to disclose their details. Email works perfectly fine without having to disclose your address book because no one owns the network and therefore do not care how many people are using email.
Alternatively one could just allow users to choose whether to make their info public and then the client can just search a directory.
That way you increase the size of the problem from 10^10 to 10^11 but only increase your bandwidth 10 fold.
It still keeps the need to balance the amount of bandwidth used, but the bandwidth is no longer determined by the amount of all the users in the system, only the increase in problem size and the amount of contacts a person has.
Is that a worthwhile solution?
Value is relative. For stakeholders, value is indeed given by the network’s size. For users (participants), it’s very debatable whether size equals value. The author seems to be mistaking the two perspectives for a single one. No paradox here.
Of course, this can only be part of the full solution.
Disclaimer: just throwing a wild idea out there. don't know anything about the field.
Bitcoin gives an idea of the relative speeds. I looked up the speeds of some Android bitcoin miners. All were on the order of 10Khash/s.
The attacker is not constrained to use phones. Custom hardware for BitCoin mining can be on the order of 1Thash/s.
This means that the custom hardware can reach 10^8 times faster, but the work to be done -- 10 billion hashes vs. 1000, is only 10^7 times as much. If the slow hashing algorithm used was anything like the one for BitCoins, a determined attacker could calculate the entire table faster than the person looking up their contacts.
This is not the whole story -- this could be mitigated using better hashing algorithms that are designed to not scale well with better hardware, but the user will only wait maybe a minute, while an attacker could spend weeks.
Additionally you could use captchas (or other humanity tests, such as SMS verification) to limit hackers creating automated accounts (in order to fight automated bots spamming and polluting the network).
This post is about how to do contact lookup without having to trust the server to maintain privacy.
And given that this is a very hard problem to solve this kind of solution could be a good bandaid. Anyhow, this is at least my 2 cents on this issue.