

Cryptographically Secure Bloom Filters [pdf] - api
http://www.tdp.cat/issues/tdp.a015a09.pdf

======
zaroth
Background

A bloom filter is a bitmap which lets you test if an item is a member of a
set, in constant time and constant space. It tells you with 100% certainty if
the item is NOT in the set, and it tells you with variable certainty if the
item MIGHT be in the set. In other words, it answers yes/no if an item is in
the set, but there's a chance of false positives, but no chance of false
negatives.

The probability of a false positive is based on the number of items in the
set, and the number of bits in the bitmap. Here
([http://hur.st/bloomfilter](http://hur.st/bloomfilter)) is a calculator, for
example, which lets you say how many items are in the set, and what
probability you want for false positives, and it tells you how many bits
should be in the bitmap, and how many bits you should set each time you are
adding an element into the filter (they call this 'number of hashes').

So you can imagine there are a lot of really interesting applications for
bloom filters when you want a really fast, compact way of knowing if something
is LIKELY to be true. Like, 'is this request in my cache?'

Bloom filters can also be used to provide some interesting privacy-preserving
properties. In Bitcoin, for example, you have a wallet with money stored at a
bunch of addresses, and unused addresses where new money can come in. You
don't want anyone to just be able to look into your wallet and know all your
addresses, but you DO want to be notified when someone sends you money. So you
load up a bloom filter with a bunch of addresses which you expect any new
payments will come in on, and you can share that bitmap with other people on
the network. When new payments come in, the server can check them against the
bitmap and forward you any matching transactions. In this case, you would
setup a high probability of false positives, because you don't want even the
servers to know which transactions are yours, but you don't want the server to
just send you EVERYTHING.

Problem Statement

So we have a set 'S' which has been encoded in a bloom filter a.k.a. bitmap on
the server. And we have a client with an element 'x' they want to check if
it's a member of the set. But we don't want to disclose 'x' to the server, and
we don't want to disclose the bitmap to the client. As the paper says, 'x' and
BF(s) are kept secret from each other.

On the face of it, this seems really easy just by the very nature of how bloom
filters work. Client would give the server a list of bits to check are TRUE in
the bitmap, but do all the hashing to calculate that checklist on the client-
side. This appears to keep 'x' secret from the server, but obviously it's
trivial for the client to come up with a bunch of requests to ultimately
discover the full state of the bitmap. Also, if 'x' IS in the set, the server
could compare the checklist it received from the client, to the checklist of
each element in the set, and discover 'x'.

Method

What the paper says is, instead of hashing each element to build the
checklist, we will hash a blind signature of each element. A blind signature
is a cryptographic signature of 'x' where the signing key (sk) can be kept
secret from 'x'.

To construct the bloom filter on the server, instead of just hashing 'x', we
will hash the signature of 'x'. Note, at this point it's not a 'blind'
signature, since it's all happening on the server.

Then when a client wants to check 'x' against BF, the server can actually give
the BF to the client, because in order for the client to USE the bloom filter,
it needs to get its 'x' signed by the server! Now, we can use a blind-
signature protocol between the client and the server, to basically "unlock"
the BF for a specific 'x', without ever seeing 'x' on the server.

It's like server-based access control to a client-side bloom filter.

BTW - I wish a prerequisite for publishing these papers was an open-source
sample implementation! :-)

~~~
saurik
> BTW - I wish a prerequisite for publishing these papers was an open-source
> sample implementation! :-)

I sometimes feel the same, but much interesting work in algorithms has been
done by people who are unable to program in an executable language.

~~~
dllthomas
Yes, but they usually have access to grad students...

------
nullc
I'll put an example in techie-English:

Say the government has a giant No-Bus-list of suspected naughty people that
shouldn't be allowed to ride busses. The complete list is highly secret,
because we wouldn't want people improperly listed challenging it in court.
Ahem I mean we wouldn't want terrorists to change their strategy to evade.

A Bus operator needs to query passengers against the list, but doesn't want to
tell the government the names of all their passengers (Because they know the
government will leak the data to their competition. What? You thought they
cared about privacy? Give me a break).

So what the government does is computes a digital signature for each national
ID number in the naughty list using a private key known only to the government
and puts the digital signature instead of the name in a bloom filter. They
send the whole bloom filter to the Bus operator. (The bloom filter is much
smaller than the list— which is good because the list is gigantic) The bus
operator learns nothing about the list except an approximate upper bound on
its size.

Then when a passenger shows up the bus operator encrypts the passengers ID
number and sends it to the government— imagine the encryption as a sealed
envelope. The government can't read it because it's encrypted. The government
performs a blind signing operation— sort of like signing a piece of carbon
paper inside an envelope without opening it (this requires the encryption and
signing methods be a kind that allows for this) and sends back the blind-
signed passenger name.

The Bus operator decrypts to obtain the signature and checks it against the
bloomfilter. If they get a hit on the filter they redirect the passenger to
Guantanamo for enhanced interrogation.

(A bloom filter, of course, has no false negatives: exactly the property we
need to keep America safe from Terrorism)

~~~
pudquick
Thank you for the write up here.

I was going to post something similar, but instead use the example of a lawyer
wanting to perform name searches on the list of people the NSA has information
on.

One point I do want to make, though, for anyone that can't tell the last
paragraph might be a bit of a joke: while a Bloom filter never has false
negatives, it will always have a certain amount (controllable by adjusting the
size of the filter) of false POSITIVES.

Think of it this way: a Bloom filter lets you determine with 100% accuracy if
someone is a member of Group A (the set you build the filter with) but less
than 100% accuracy if they're instead only a member of Group Not-A.

In the example above, if Group A is the Evil Terrorists (which you build the
Bloom filter with), then there's a non-zero chance that a Good American member
of Group Not-A may end up falsely identified as a terrorist.

If Group A is instead made up of every Good American, you flip the problem and
now have the non-zero possibility of falsely identifying a Evil Terrorist
(from group Not-A) as a benign Good American.

If you attempt to use them to solve a problem like this, you have to have a
mechanism in place to deal with the very real case of a false positive (or
decide which group to build the filter with - which one is less damaging to
have show up as a false positive).

Bloom filters are cool, but they're a tradeoff of speed and memory for
accuracy. They are not a perfect index if you're trying to determine A vs.
Not-A-ness. The best concept is a cache lookup - if it's not a member of the
Bloom filter, you DEINITELY don't have it cached, but if there's a hit then
you very very very likely have it cached (and if you don't, then at least
you're only redirecting a vanishingly small number of false positives back to
the live server).

Google actually uses this for determining attack sites in Chrome without
giving out an index of all the attack sites they know of. They give out a
Bloom filter instead with the web client. When you visit a site, if it's a
miss on the Bloom filter, it is currently not known by Google to be an attack
site. However, if you get a hit, it first calls home to a Google server to re-
verify that the site is not a false positive and confirm it is indeed an
attack site. In this way, since the vast majority of the web is not malicious,
they only impose a bandwidth cost when they think you may have stumbled upon a
bad site (member of the Bloom filter) - but they get a chance to weed out the
false positives before telling you.

~~~
nullc
Indeed, I wondered if the dark humor of the last comment might be a little too
subtle. :)

One of the bummers about this scheme is that it doesn't have the property of
making the bandwidth for all queries constant (get the bloom filter) instead
of linear in the number of lookups that the regular non-private use of bloom
filters has. But thats the tradeoff for the bi-directional privacy.

I could actually imagine a production system using this, which is something I
can't say for most privacy preserving query techniques.

------
bascule
If you think that's cool, you ain't seen nothin' yet. Check out Andrew
Miller's PhD thesis: a compiler that can take the Agda description of _any_
data structure and produce a cryptographically authenticated "Merkelized"
version of it:

[https://github.com/amiller/generic-ads](https://github.com/amiller/generic-
ads)

Here's an implementation of a red-black tree:

[https://github.com/amiller/generic-
ads/blob/master/RedBlack....](https://github.com/amiller/generic-
ads/blob/master/RedBlack.hs)

~~~
jaekwon
Can you explain how say the red-black tree impl can be used in real life?

------
moxie
Basic protocol:

1) Put signatures of your elements into a bloom filter, rather than the
elements themselves.

2) Send the client the entire bloom filter.

3) Allow clients to request blind signatures for the elements they'd like to
query locally.

So unfortunately, like most solutions in the "private information retrieval"
domain, step one is "send the client your entire data set." They've explained
that cryptographically secure bloom filters are superior to secure set
intersection protocols (which are also largely unusable, although for
different reasons) for cases where the data set is so large that the server
would like to avail itself of the space savings that a bloom filter provides.

So at set sizes that large (say, 7 billion elements), step one is
unfortunately to send the client a 700MB file. Not to mention that step zero
was for the server to have calculated 7 billion digital signatures. It's
otherwise a pretty neat idea, though.

~~~
zaroth
LOL, too true! And even 700MB may be a bit stingy for a 7b element bloom
filter.

Are there any options out there for privately testing set membership on a
central server where client-server message size is constant?

I can see how any type of private querying would necessarily involve putting
the data somewhere the client can access it, where the server can't observe
the access patterns. (EDIT) Or making the server look at all the data to
service a single request.

So one option is handing the client an access-controlled data set, so you know
WHO is accessing the data, but not WHAT they are accessing. I guess the other
option is trying to hide who the client is, perhaps via onion routing, so you
know WHAT data is being accessed, but not necessarily WHO is accessing it.
Maybe a third way would be to use leaky selectors -- queries that return the
desired data, plus a bunch more, so the server doesn't know exactly what piece
of data was being targeted.

They reference a "Keyword Search and Oblivious Pseudorandom Functions" paper
([http://www.cs.princeton.edu/~mfreed/docs/FIPR05-ks.pdf](http://www.cs.princeton.edu/~mfreed/docs/FIPR05-ks.pdf))
which looks like their inspiration, which is a privacy maintaining Key-Value
store "with a communication overhead of O(polylog n) and a computation
overhead of O(log n) public-key operations for the client and O(n) for the
server." I'll have to read some more about that one.

------
erikpukinskis
We are barreling towards an future where a Bitcoin/Bittorrent-like network
provides cryptographically guaranteed trust-free hosting of data and data
services. It could be distributed such that computation was spread across all
of the users of data. Bittorrent for hosting.

In such a world, merely publishing code into the network would allow anyone to
use it anywhere in the world, for the rest of time.

~~~
zaroth
It's a beautiful vision of what "P2P" could really deliver to the world --
just as you say, "publishing code into the network" and then the users decide
the resources they are willing to dedicate to your software. We are, as they
say, so close, and yet so far away from achieving anything like this that's
even remotely usable.

Maybe we can get private, secure, scalable, DDoS resistant DHT working, and go
from there? :-)

------
rca
Side question,

In the introduction the author mentions that if you don't care about privacy,
you can solve the x ∈ S problem easily by sending the whole bloom filter to
the client. Wouldn't it be simpler for the client to send x to the server
instead, and then let the server send the answer back to the client? As I
understood bloom filters can get quite large...

------
aortega
There are many ways to use an encrypted bloomfilter. My thesis (shamelss plug)
uses an encrypted bloomfilter to share an optical fibre channel securely among
up to 128 users -
[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6...](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6476559)

