
Ask HN: Securely finding the intersection of two lists? - aaronharnly
I&#x27;ve now three or four times run into the situation where I want to securely and privately find the intersection of a list I have, with a list someone else has – but without either of us seeing one another&#x27;s full list.<p>For example, I want to find discover customers with a prospective partner. But they don&#x27;t want to send us their complete customer list, and we don&#x27;t want to send ours to them.<p>There is CS research into the topic of &quot;Private Set Intersection Protocols&quot;, but I haven&#x27;t found a working implementation. And I could pay an accounting or legal firm to do this manually, but that will cost more money and time than it&#x27;s usually worth.<p>Have you ever heard of somebody doing an escrow service for this, where party A and party B each send their lists to the service, the esrvice verify the size of each list (so we&#x27;re not just spamming the service), and then they send back the matches to both parties? Or a tool simple enough to send to someone else to run locally?
======
klavinski
What about exchanging the hashes of the customers? A common hash would mean a
common customer.

"What if he sends back my list of hashes, although he has no common customer?"

I would do as follows:

1\. Both pick a hashing function (f and g).

2\. Both share their hashing function.

3\. Both hash their list with two functions.

4\. Both send the list of hashes computed with the other's function.

5\. Common hashes <=> common customer.

For example, my list: Amanda, Patrick

His list: Patrick, Stanley

I choose f, he chooses g.

My list |> f = 12, 54 (mine)

My list |> g = 36, 2 (public)

His list |> f = 54, 7 (public)

His list |> g = 2, 68 (his)

I compare both lists with f, and he does the same with g. I conclude the only
common customer is Patrick. He would not be able to send 54 without knowing
Patrick or reverse-engineering the function (or being pretty lucky).

~~~
yorwba
That scheme has the weakness that if the space of possible values is easy to
enumerate (e.g. by buying lists of leaked email addresses off the black
market), you can hash them all and find out whether they're on the other
person's list without their knowledge, and you can do that at any later point
without their cooperation.

~~~
gshdg
What possible solution to this problem isn’t vulnerable to the first part of
this, tho?

~~~
yorwba
This is essentially a "secure multi-party computation" problem: if you had a
trusted third party that both sides can communicate with using encrypted
channels, they could just send their lists to the third party and get the
intersection back.

There are standard techniques in secure multi-party computation that can be
used to eliminate the need for a trusted third party by replacing it with a
cryptographic communication protocol that can only be executed by all
participants acting together, for example garbled circuits [0]. Basically, one
participant creates an encrypted representation of the computation as a
circuit and sends it to the other participant, who can execute the circuit on
encrypted inputs to get an encrypted result, which only the sender of the
circuit can decrypt, so they need to be sent back. IIRC, there's no way to
prevent the sender from simply not sharing the decrypted results with the
receiver, but this is super obvious, so it can't happen in secret. Because the
encryption prevents either party from executing the complete computation on
their own, it's also not possible to secretly try multiple inputs to extract
additional information.

The one attack this can't protect against is if one party lies about the
contents of the list. (Note that this is different from reporting one's own
list honestly, but trying to also get the other list.) While this can't be
prevented, you could require the size of each input list to be part of the
output, which would make this kind of cheating detectable as well.

I'm sure the research literature on this problem also has some more specific
techniques that would be preferable to the generic approach described above.

[0]
[https://en.wikipedia.org/wiki/Garbled_circuit](https://en.wikipedia.org/wiki/Garbled_circuit)

------
aaronharnly
This is one of the many papers on the topic of algorithms to compute set
intersection while revealing little or no additional information - but I
haven't seen anyone running this as a service or writing a tool to make it
possible for two parties:

[https://pdfs.semanticscholar.org/389d/1af665a5111306e1f118ff...](https://pdfs.semanticscholar.org/389d/1af665a5111306e1f118ff4aa9a1dc073cc1.pdf)

~~~
kkm
There is some similar work in the space of checking if the credentials being
used by the user are compromised or not.

Example:
[https://arxiv.org/pdf/1905.13737.pdf](https://arxiv.org/pdf/1905.13737.pdf)

------
gwittel
I haven’t seen an off the shelf thing. It might be possible with PKI
incantations but the first that comes to mind without a third party is
homomorphic encryption.

Libs like MS SEAL only do limited numeric operations. Without reading the
research, that seems fine. Hash each name to a number. Encrypt, then do a
compare. (Lots of hand waving here).

Another possible scheme are things like the Google safe browsing update/client
side protocol. There you’re working against hashes only. Depending on needs it
may be good enough.

------
eb0la
If you can tolerate some error, a bloom filter might work.

Both parties make their own bloom filter and send the results to the other
party.

The good part is no personal data goes inside. I mean, you cannot tell
beforehand if someone is a customer until you try to store their data into the
filter.

Bad news: bloom filters have erros. So, before doing this better guess the
monetary value of the error.

