
Bloom filters debunked: Dispelling 30 Years of bad math with Coq - gopiandcode
https://gopiandcode.uk/logs/log-bloomfilters-debunked.html
======
curryhoward
I feel like most commenters are missing the point. The fact that this issue
was finally settled once and for all using a proof assistant is a huge
achievement! That's the highest degree of scrutiny that a proof can undergo.
This is especially important given the high number of mistakes in prior
work—one would be forgiven for distrusting new papers on this topic that _don
't_ have machine-certified proofs. I can't believe people think this is just
clickbait. Do y'all not recognize the importance of math
being...well...correct?

~~~
tom_mellior
The article doesn't say that the original _proof_ was incorrect. It argues
that one of the original theorem's _assumptions_ was not justified. A machine
certified proof doesn't help with this. [EDIT: The author points out below
that the assumption was not explicit in Bloom's work. An attempt at a machine
certified proof _would_ catch this aspect. But not the aspect of the
assumption being wrong, if it were added explicitly.]

Now, others later recognized this faulty assumption and produced a new formula
and a faulty proof for its correctness. A machine certified proof would help
here if an actual proof step is wrong, except note the footnote: "To be fair,
the error in Bose et al.'s paper was primarily due to incorrect definitions,
rather than an incorrect logical step. As an interesting anecdote, the first
version of our work was based off the incorrect definitions by Bose et al. and
ended up being rejected by reviewers who then rediscovered this error." So
again it was _human_ checking that (re-)discovered an issue. One that appears
to have been fixed by humans before this work. And one that the machine wasn't
able to find.

So now that all known incorrect assumptions are shaken out (there might be
unkown ones, Coq can't tell) this work is a machine checked proof of something
that had been proved before. It's an achievement, and it's good to have
certainty. Almost anything that gets us closer to more formal math and
computer science is good. But this particular result is hardly spectacular.

As for the clickbait aspect, I'm also annoyed by it. The title is clearly
factually incorrect. No property of Bloom filters has been debunked. A widely
cited formula was replaced by an asymptotically equivalent one 12 years ago,
and we now have a machine checked argument that this replacement was correct.
This isn't a debunking of anything. No math has been disspelled.

Clear and honest science communication is important. Authors overselling their
work in such (obvious) ways just highlights that they themselves don't think
the work is sensational (which it doesn't have to be!), and it makes readers
wonder in what other ways they are intellectually dishonest.

~~~
gopiandcode
> The article doesn't say that the original proof was incorrect

Apologies if my wording did not make this clear: the original proof by Bloom
_is logically incorrect_

Bloom derives his expression by performing a transformation that would only be
possible if the bits are independent, but nowhere in his proof does he state
that he is making this assumption. It just so happens that even if he had
explicitly included this assumption, it would not have been justified. So in
that sense, the debunking part is correct - we disprove Bloom's original bound
by proving the true bound under the same assumptions that he explicitly makes.

~~~
tom_mellior
> Bloom derives his expression by performing a transformation that would only
> be possible if the bits are independent, but nowhere in his proof does he
> state that he is making this assumption.

I see, thanks for this clarification. I read your post as saying that the
assumption was explicit. I agree that reasoning from an implicit assumption is
a logical error. Still, the real problem was not the assumption being
implicit/explicit but rather that one doesn't want to rely on it at all. Which
again is an extra-logical issue.

> So in that sense, the debunking part is correct - we disprove Bloom's
> original bound by proving the true bound under the same assumptions that he
> explicitly makes.

OK. I don't think that recapitulating something that has been known with more
or less certainty for 12 years counts as "debunking". For me "debunking" has a
connotation of being original, maybe for you it doesn't.

But I especially think that "the formula was only asymptotically correct" is
too weak a statement to count as a "debunking". For me "debunking" an
algorithm or a data structure would have to be something much bigger, like
"Quicksort doesn't always sort" or "binary search trees can lose data" or
"Bloom filters can sometimes have false negatives". Not "the behavior we have
been seeing for the last 50 years matches the original formula well, but
strictly speaking it matches the new formula a bit better". Your mileage
obviously varies.

EDIT: Let me stress again that I'm not pooping on your work. The work appears
sound and important, since this was so tricky to get right in the past. But I
_am_ pooping on the title.

~~~
gopiandcode
I guess we'll have to agree to disagree here.

Given the history of the proofs of the Bloomfilter, I'd argue that the bound
was not really "known with more or less certainty" \- if so many corrections
had to be made, why should we believe that this latest paper was truly
correct? it does not seem to be unreasonable to believe that there might be
other errors that had been been similarly overlooked. Our research tackles
this problem - it provides a guarantee that there are no further hidden
errors.

> "the formula was only asymptotically correct"

I have no issues with the behaviors of a data structure being characterized in
an asymptotic sense, but I do think that giving a bound that you claim to hold
exactly, but then having it turn out to _only_ hold asymptotically is
incorrect and worthy of debunking. Furthermore, most citations of Bloom's
bound do not claim it is an asymptotic bound, but use it as an exact one,
which is clearly incorrect.

~~~
aflag
I think from an engineering perspective it doesn't feel like a debunking.
Nothing we assume about the bloom filter has been fundamentally changed. I
think that might be the perspective of most people in here.

~~~
gopiandcode
Yes, I think that's an accurate take on the work.

At the end of the day, this is primarily a formal result with few direct
implications on practical usage. As I mentioned in another post, I tried to
make it more relatable to an engineering audience, but it's possible that I
may have buried the lede somewhat. However, I still stand by the statement
that the title is not incorrect, at least if only from a mathematical
perspective.

------
gregw2
I thought the article had a very nice, succinct and clear explanation of bloom
filters and wanted to say thanks to the author reading this thread.

A year or two ago when bloom filters became a recurring popular topic on HN I
read a long illustrated medium post on them found via HN out of curiosity to
add the concept and tool to my back pocket, and it all seemed complicated and
the explanation didn't stick. Your explanation however was quick and made
complete sense and I cannot forget it. Appreciate it! Thank you!

~~~
tomxor
Agree, it almost highlights how bad others descriptions can be sometimes. I
mean the article makes it feel like such a simple idea given how many words
are spoken about it.

I wonder if the key is in quickly grounding the core abstractions with
concrete meaning early on, that way your mind quickly has a model of "things"
to operate on before attempting to build up the behaviour with more abstract
description.

Many descriptions stay in the abstract too long without anything to attach it
to for the uninitiated... in which case you either persevere and eventually it
clicks and all the relationships fall into place - or you give up out of
disinterest.

I've noticed this when explaining things to others, especially non-technical
people, when explaining seemingly very simple things, attempting to describe
them in multiple ways and failing, and then realizing they need clarification
of the "what", after which explanation is easy - sometimes you are blind to it
when you already know "what" and already have the mental model so you jump
straight into how and why.

------
jchw
A bit tangential, but I suspect many people don’t know is there is actually
more information-dense data structures for the use case of bloom filters;
Cuckoo filters for example can get close to the theoretical lower bound of
information required and have some other interesting properties. So you should
consider that before reaching for bloom filters, probably!

~~~
haecceity
Bloom filter was probably chosen not because of its false negative rate but
because it was good enough whatever its false negative rate is.

~~~
eis
Bloom filters have no false negatives, they have only false positives. If it
were the other way round, they would be of no practical use as you'd have to
query the original datastructure in any case:

\- If the bloom filter said "True", then you go ahead and fetch the data from
the original structure \- If the bloom filter said "False" but that might be a
false negative then you'd have to anyway query the original structure to be
sure

With 0% false negatives but a relatively small rate of false positives
instead, you don't have to query the original source if the filter gave a
"False".

Btw. Bloom filters were one of the first probabilistic data structures with
real practical uses and still to this day are widely used. They are still not
out-classed in every way by newer algorithms like Cuckoo filters.

~~~
emn13
Of course that's purely a matter of perspective. If you regard a bloom filter
as an approximation of inclusion in a set, it's a false positive, which is how
we usually look at it. But if you look at it as a an approximation of safe
urls e.g. in a browser it's a false negative (and there are no false
positives). The data structure doesn't care; it's just bits: if the overall
set is finite, it's even quite hard to tell which makes most sense, but in the
more usual practically unbounded case, sure, finite subset inclusion
approximation seems a less braintwisting interpretation.

~~~
eis
It's an interesting thought but I'd argue the following.

First, the definition of Bloom filters is a probabilistic data structure to
test the presence of an element in a set no matter the program logic around
it.

But secondly, given your example of safe browsing, you can flip positive and
negative meanings around freely by how you define the question. Is it "is this
a bad domain?" or "is this a good domain?" which you both can convert to "is
this domain in the set?" and "is this domain NOT in the set?".

And that's a general problem with "positives" and "negatives", they depend
purely on the question and the question can have a boolean negator in it. But
that's a logical layer above the data structure. As you said, the data
structure does not care, it is just bits and therefore the question is "is the
element in the set?".

Same as one could ask "Is it day?" or "Is it night?" which both query the same
underlying data about the time and location but would have opposite meanings
for positive and negative results.

And so, to give the two terms a better meaning we have a definition of the
data structure and the operations on it which defines the meanings of
negatives and positives.

I think some part of the reason for confusion with false negatives and
positives results from the conotations of "positive" (good) and "negative"
(bad).

------
mdonahoe
“ To be fair, the asymptotic behaviour of Bloom's original bound is consistent
with this updated definition, so the impact is more on an issue of pedantry
rather than for practical applications.”

Would love to see a table with computed numbers comparing the rates. It’s hard
for me to understand the behavior of that second result

~~~
gopiandcode
This paper[1] which corrected Bose et al.'s original derivation, has more
discussion about the impact of the second result - in particular, Figure 2 (on
page 13) has a comparison of the relative error of the old and new bounds
against the empirically calculated rate.

[1]
[https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=90377...](https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=903775)

------
cb321
It bears mentioning that not only cuckoo filters but also simply vanilla
linear probed hash tables of B-bit truncated hashes (sometimes called
fingerprints) can be a better substitute for Bloom filters. This seems a
highly under propagated fact. "high p" (order 10%) numerical examples are
often used to sell an idea less valuable at small p.

An (asymptotic, for p <=~ 10%) back of the envelope formula is that such a
hash table/set of fingerprints takes up a factor of about (1+log_{1/p}(N))
more space than a Bloom filter. It is not hard to derive this. Unlike the
incredibly precise Coq formula proof theme, this is all approximate, but more
engineering-relevant.

If you were targeting p=0.001 to have a small mistake rate, 1 + log_1000(N) is
pretty small (say <~ 1+3=4 for for N <~ 1e9 elements). While it does use 25%
space, this Bloom filter would require many more (-log_2(p) =~ 10) probes
while the LP hash table would only hit the DIMMs once. Many, but not all,
might view a 10x latency reduction as worth 4x the space in the game of space-
speed trade-offs.

~~~
cb321
I should also have said that this competing idea was raised literally in _the
very same_ Burton Bloom 1970 paper that the more famous filters come from.

Analysis of speed these days (where a single main memory hit is thousands of
superscalar dynamic instructions) is tilted differently than it was in 1969.
Still, _even back then_ Bloom's own original paper had a footnote qualifying
his superiority conclusion as dependent upon memory system assumptions. Beats
me how this gets lost. Call it "The Bloom filter mystique".

------
bawolff
This is important work and all, but i can't help but feel the headline is a
bit clickbaity

~~~
gopiandcode
Yes, that's a fair comment. I took some creative liberties with the title to
try and make this theoretical result more relatable to the average reader, but
it's possible I may have gone too far.

~~~
random314
It would be interesting if you can show how Coq refused to accept the old
formula and what error it produced.

~~~
gopiandcode
When I was attempting to prove Bloom's original incorrect bound, the work
never progressed to the point where I was actively working directly on proving
his bound - I managed to prove some intermediate theorems, but was unable to
work out a way to compose them. The issue ended up being that that I was
unable to derive the independence required to prove the inductive step.

If you're interested at looking at the sources, I think the following commit
was around the place where I was working on this:
[https://github.com/certichain/ceramist/commit/70927c5b50e21a...](https://github.com/certichain/ceramist/commit/70927c5b50e21a08f510cfd9555d8324a61c1233)

------
anonymoushn
Is there a sound technique to get your k hash functions to produce k distinct
bits, such that the original derivation would become correct?

~~~
ReaLNero
It's not too difficult: hash function 1 generates an integer from 0 to n-1,
that index element is deleted, then hash function 2 generates an integer from
0 to n-2, that element is deleted etc.

This is an O(n^2) algorithm, which you can improve to O(nlgn) using fenwick
trees.

In practice however, it is much quicker to generate random hashes and repeat
until they're distinct.

~~~
anonymoushn
In k^2 time, which is maybe faster than running a hash an additional time:

    
    
      function get_bits(input, hashes, n)
        local bits = {}
        for i=1,#hashes do
          local bit = hashes[i](input) % (n-i+1)
          for j=1,i-1 do
            if bits[j] <= bit then
              bit = bit + 1
            end
          end
          bits[i] = bit
          while i > 1 and bits[i] < bits[i-1] do
            bits[i], bits[i-1] = bits[i-1], bits[i]
            i = i - 1
          end
        end
        return bits
      end

~~~
marcan_42
Your % operation is not uniformly distributed if the hash domain isn't an
integer multiple of the output size, which is a bigger problem for formal
correctness than the colliding output issue (hashes are not uniformly
distributed).

You need to truncate to an even number of bits and retry until the output is <
n, at which point you're retrying anyway, so you can just retry on collision
instead of having all that complicated logic to skip bits.

Nice try though, but if you want formally perfect results like the OP you have
to try harder :-) (except real hash functions are only assumed to have
perfectly distributed functions anyway, that is not proven and probably not
provable, so basically you're screwed either way and none of this matters :-)
).

------
thomasahle
The original false negative rate approximation can be proved (correctly) using
Martingale arguments as done by Mitzenmacher and Upfal in 2005. The Wikipedia
page also shows this version.

~~~
gopiandcode
Ah, that's a new change on the wikipedia page, when I started work on this
proof back in December, the Wikipedia page actually had the incorrect
derivation.

Additionally, Bloom's original bound is given (and typically quoted) as an
exact expression for the false positive rate, so while it may be correct as an
approximation, I'd say its fair to say that the original bound is wrong.

~~~
thomasahle
Yeah, it's only right asymptotically. It's nice work testing some of those
things in coq.

------
peter_d_sherman
>"Using a probabilistic data structure known as a Bloomfilter, Browsers
maintain a approximate representation of the set of known malicious URLs
locally. By querying this space-efficient local set, browsers will only send
up a small proportion of URLs that have a high likelihood of actually being
malicious."

------
NikkiA
Surely the original 'false positive rate' is only wrong if the hash function
is 'bad', and thus Bloom wasn't 'incorrect' as much as assuming a perfect hash
function.

~~~
gopiandcode
I don't think that would solve this issue - a perfect hash function is
guaranteed to not have any collisions for any element in some predefined set.
What Bloom's proof requires is that all of the k hash functions should not
have any collision for any input that is inserted into the Bloom filter, which
is not covered by just having each function alone be perfect. That aside,
Bloom does not make any assumptions about the chosen hash functions being
perfect or not.

~~~
jesboat
Correct.

In the extreme case, imagine that your k hash functions are nearly identical:
hash function `H_i` differs from `H_1` only in that the outputs for the `1`st
and `i`th elements in the input space are swapped. Of the `N` elements in the
input space, all but `k` will completely collide.

------
dan-robertson
Here’s a derivation of the formula Bose et al give for the false positive
probability:

To calculate the probability, we will work out the number of ways to assign
_kl_ hashes to _m_ bits (I.e. functions from a set of size _kl_ to a set of
size _m_ ), and for each of those ways we will work out how many ways we could
assign _k_ hashes to the bits which are set (I.e. ways to get a false
positive). We then count the number of possible ways to assign our _kl_ hashes
of existing elements and _k_ hashes of the tested element to _m_ bits and
divide the former by the latter. For the argument to be valid, each assignment
of hashes to bits must have equal probability, which is true if the hashes are
independent.

The simple way to count the number of assignments of _kl_ hashes to _m_ bits
is easy: for each hash there are _m_ possible bits so we get:

    
    
      m^(kl)
    

Similarly for _kl + k_ hashes:

    
    
      m^(k(l + 1))
    

Now we will break this count up by the number of bits which are set. Suppose
_i_ bits are set. Then the number of possibilities for those set bits of the
_m_ total bits is (m choose i). And the number of ways the _kl_ hashes could
be assigned to the _i_ bits is equal to the number of surjections from a set
of _kl_ hashes to a set of _i_ bits, which is i!{kl; i}, where {s; t} is the
sterling number of the second kind, the number of ways to partition _s_
labelled objects into _t_ unlabelled non-empty partitions [the author’s paper
claims this is the number of surjections which is slightly wrong]. This gives
the number of assignments given exactly i set bits as:

    
    
      (m choose i) i! {kl; i}
    

And the total as:

    
    
      m^(kl) = Sum_(i = 0)^m (m choose i) i! {kl; i}
    

Given exactly _i_ bits are set, how many ways can we assign k hashes to those
_i_ bits? Easy: i^k. So the number of lists of kl + k hashes (integers from 1
to m) such that the last k all appear in the earlier list of kl is:

    
    
      Sum_i i^k (m choose i) i! {kl; i}
    

Finally we divide by the number of possible lists, m^(k(l+1)), to get the
probability given by Bose et al (this is valid because the hashes are iid so
each list has equal probability of occurring)

------
natch
Their prose description of a bloom filter firstly has a significant flaw, and
secondly falls victim to what I believe is a fallacy in many discussions about
bloom filters.

The flaw is that they do not specify that the size of the bit array should be
a prime number. This omission alone is astounding. To be fair, they state that
they assume the bits from the hash functions are randomly distributed over the
bit vector, so with this assumption they skate past this issue even though
they have apparently missed that crucial detail of part of how it is
accomplished. One wonders if they were unaware.

The fallacy imho, but this is where I depart from the community, thus imho, so
take me with a grain of salt if you wish, is that you don’t need multiple
independent hash functions. You just need multiple inputs. For example instead
of hashing the word “salad” three times, just hash the tokens salad1, salad2,
and salad3. If your hash function is worth its salt (npi) then you will be
just fine.

~~~
gopiandcode
This is a good point about practical implementations of hashing based
randomization - incorrect table sizes can invalidate assumptions about
uniformity and cause guarantees to fail. However, this work was primarily
about the theoretical analysis of these data structures, which presupposes
that we already have a means of generating uniformly randomly distributed bits
over the table space, hence the lack of discussion about table sizes.

------
phkahler
What if a distinct bit vector is used for each hash function? Then isn't the
original false positive rate correct?

Can you use less storage by having distinct bit vectors of each hash? This
seems like a natural question once we know the false positive rates are
different.

Maybe Bloom was misinterpreted and right all along?

~~~
gopiandcode
Good point, You are right that our correction does not address errors in
Bloom's original definition of a Bloom filter - which uses distinct bit
vectors, but rather in the definition of the Bloom filter that is typically
used (and referenced) in practice (Knuth's version) and in the literature,
which functions in the way described in the article. This version is easier to
implement as the hash functions are independent, but which _does_ use
incorrect reasoning in it's correctness proof. It's wrong to attribute this to
Bloom, so I've added a note to address this.

------
drewm1980
I'm curious about the properties of the new approximate membership algorithms
you discovered as part of this research. Are they better?

"We instantiated this interface with each of the previously defined AMQ
structures, obtaining the Blocked Bloom filters, Counting Blocked Bloom
filters and Blocked Quotient filter along with proofs of similar properties
for them, for free."

So it sounds like the new AMQ algorithms you allude to are blocked (more cache
friendly) variants of existing AMQ algorithms. Are the bounds you proved all
good in some sense? Do you know yet if they're actually faster in practice on
real hardware, or do optimized implementations still need to be written?

~~~
gopiandcode
The corresponding bounds for the variant structures we construct do result in
lower false-positive rates than a standard Bloomfilter - so, in that sense,
you could say that they are better. However, they also require more space,
striking a slightly different theoretical trade-off between space and
accuracy.

For practical purposes, you would have to take the effect of caches into
account, and the performance may vary depending on the particular choice of
hardware. Our work stuck mainly to the theoretical side, so we didn't do any
empirical testing of these new data structures. I guess the jury is still out
on whether these variants are actually better in practice than the existing
ones.

------
yomly
This was very accessible! Thank you for writing it.

There were a couple of typos - might be worth getting someone to proof read
it... (I am out atm so don't have a good way of laying out suggestions for
corrections)

~~~
gopiandcode
Thanks for the feedback. My main aim with this post was just to provide a
slightly more accessible introduction to the full work, so I'm glad that it
was successful in that sense.

It was mostly just a quick transcription of the corresponding presentation for
the paper, so some typos may have crept in. I'll do a second pass and try and
fix that later today.

~~~
PAPPPmAc
That's mostly a really nice explanation, the one thing that bugged me (which
is my pet peeve in math-y CS papers) is a not-perfectly-clear explanation of
your variables. The ones in the equations interspersed with the diagrams of
the original derivation are easy enough to pick up from the diagrams, but you
switched from $n$ to $l$ for the number of inputs in the correct expression
(and the Bose paper also used $n$, so I'm not sure why).

~~~
gopiandcode
Thanks for the comment.

I think that was actually a mistake in my writeup, I just copied the latex
from the paper without adjusting it to the notations used in the article. I'll
make sure to fix it.

------
dooglius
I didn't realize Bloom filters were used in this way. Thinking adversarially
for a moment, doesn't this provide an easy way to get a target website marked
as a false positive?

~~~
a1369209993
Thinking more adversarially, doesn't this provide an easy way to track anyone
who visits a target website?

~~~
dooglius
Based on advance512's answer, it sounds that way.

------
hanoz
I can't follow all the maths but the introduction is surely wrong based on
simple probability alone.

It's claimed that the URLs which browsers send up (having been diagnosed
positive by the test) will _”have a high likelihood of actually being
malicious "_, but that by no means follows from the test's low false positive
rate. You need to consider the background rate.

Just like in the classic example of a positive diagnosis from a low false
positive test for a rare desease.

~~~
gopiandcode
I think in this case the fact that there are no false negatives means that the
low false positive rate is enough to infer that a positive result implies high
likelihood of actually being malicious.

You can reason roughly as follows:

    
    
      P[ pos | mal ] = 1 (no false negatives)
    
      = P[ pos /\ mal ] = P[ mal ] (Bayes)
    

A low false positive rate means that:

    
    
      P[ pos | ¬ mal ] ~= 0                  (low false positive rate)
    
      P[ pos /\ ¬ mal ] / P[ ¬ mal ] ~= 0
    
      (P[ pos ] - P[ pos /\ mal])/(1 - P[mal]) ~= 0
    
      (P[ pos ] - P[ mal ])/(1 - P[ mal ]) ~= 0
    

From this fraction we can conclude:

    
    
      P[ pos ] - P[ mal ] ~= 0
    

Returning back to a the likelihood of being malicious given a positive result:

    
    
      P[ mal | pos ]
    
      = P[ mal /\ pos ] / P[pos]
    
      = P[ mal ] / P[ pos ] 
    
      ~= 1.0

~~~
hanoz
I think you're right. I'll stick to the day job.

~~~
hanoz
No, wait, I think I was right after all.

Say we have a million urls, and a thousand of them are malicious. Our filter
returns a positive result for all the malicious 1000 (no false negatives) and
for the safe 999000 urls only 1% will return positive (low false positive),
but that's still 9900 false positives. So a positive result only has a (1000 /
(1000 + 9900)), i.e. 9%, chance of actually being malicious.

Even with a false positive result of only 0.1%, the probability of a positive
result actually being malicious only rises to 50%, so still not in _" high
likelihood"_ territory.

~~~
gopiandcode
Nice example - you were right, my wording on that section was incorrect.
sukilot quite nicely presents the error in my derivation - I guess I probably
shouldn't try coming up with proofs on the spot - especially if they're not
machine checked.

I think what I probably should have said instead was that the low false
positive rate means that only a small proportion of the honest URLs will be
sent up.

------
hyyypr
> Conversely, sending every URL that a user visits to some external service,
> where it could be logged and data-mined by nefarious third parties (i.e
> Google)

As in, just like DNS.

~~~
tzs
This suggests an interesting question to put on an exam in whatever class
teaches about Bloom filters.

\----------------

Q: It is proposed to make the local DNS resolver handle IPv4 by using 64 Bloom
filters, BO1, BZ1, BO2, BZ2, ... BO64, BZ64.

Filter BOn answers the question "is bit n of the IP address a 1?", and BZn
answers the question "is bit n of the IP address a 0?".

Your resolver checks all these Bloom filters. If BOn return "no", then it
knows bit n of the address is 0. If BZn returns "no", then it is knows bit n
of the address is a 1. Only if BOn and BZn both return "maybe" for some n must
the resolver actually do a DNS query over the internet.

Explain why this proposed resolver would not be useful.

~~~
maxfan8
These would have to be pretty large and updated regularly. Why not use as
k-anonymity scheme? A decent amount of password managers use that to see if
the password a user generated is unique.

------
ghj
The theory of hashing never matched up well with how I've experienced it in
the real world. For example almost all analysis of the hash table (for proving
stuff about load factor, chain length, or probe clustering sizes, etc) starts
off with a "uniform hashing" assumption. But that assumption have basically
never held true given how languages define their default hashes.

------
ur-whale
The actual Coq code is here:

[https://github.com/certichain/ceramist/blob/fd5e522f2c381f7d...](https://github.com/certichain/ceramist/blob/fd5e522f2c381f7dbd5b8e38b48041dfd4bd261a/Structures/BloomFilter/BloomFilter_Probability.v#L1187)

------
reanimus
This is interesting stuff! It's especially nice to see some more details
behind the math -- I did some stuff with bloom filters for work and found that
the math didn't always seem to line up with what we'd expect during tests. I
wonder if it needs some adjustment...

------
martincmartin
The new formula doesn't appear in the Wikipedia article for Bloom Filter,
although it does say that the old formula is only an approximation because of
the incorrect independence assumption.

------
fierarul
For an academic paper this page was extremely engaging! Congrats to the
author.

------
layoutIfNeeded
It’s not “Bloomfilter” but “Bloom filter” after Burton Howard Bloom.

~~~
gopiandcode
My bad, I didn't realize, I will update the post accordingly.

