I feel like most commenters are missing the point. The fact that this issue was finally settled once and for all using a proof assistant is a huge achievement! That's the highest degree of scrutiny that a proof can undergo. This is especially important given the high number of mistakes in prior work—one would be forgiven for distrusting new papers on this topic that don't have machine-certified proofs. I can't believe people think this is just clickbait. Do y'all not recognize the importance of math being...well...correct?
The article doesn't say that the original proof was incorrect. It argues that one of the original theorem's assumptions was not justified. A machine certified proof doesn't help with this. [EDIT: The author points out below that the assumption was not explicit in Bloom's work. An attempt at a machine certified proof would catch this aspect. But not the aspect of the assumption being wrong, if it were added explicitly.]
Now, others later recognized this faulty assumption and produced a new formula and a faulty proof for its correctness. A machine certified proof would help here if an actual proof step is wrong, except note the footnote: "To be fair, the error in Bose et al.'s paper was primarily due to incorrect definitions, rather than an incorrect logical step. As an interesting anecdote, the first version of our work was based off the incorrect definitions by Bose et al. and ended up being rejected by reviewers who then rediscovered this error." So again it was human checking that (re-)discovered an issue. One that appears to have been fixed by humans before this work. And one that the machine wasn't able to find.
So now that all known incorrect assumptions are shaken out (there might be unkown ones, Coq can't tell) this work is a machine checked proof of something that had been proved before. It's an achievement, and it's good to have certainty. Almost anything that gets us closer to more formal math and computer science is good. But this particular result is hardly spectacular.
As for the clickbait aspect, I'm also annoyed by it. The title is clearly factually incorrect. No property of Bloom filters has been debunked. A widely cited formula was replaced by an asymptotically equivalent one 12 years ago, and we now have a machine checked argument that this replacement was correct. This isn't a debunking of anything. No math has been disspelled.
Clear and honest science communication is important. Authors overselling their work in such (obvious) ways just highlights that they themselves don't think the work is sensational (which it doesn't have to be!), and it makes readers wonder in what other ways they are intellectually dishonest.
> The article doesn't say that the original proof was incorrect
Apologies if my wording did not make this clear: the original proof by Bloom is logically incorrect
Bloom derives his expression by performing a transformation that would only be possible if the bits are independent, but nowhere in his proof does he state that he is making this assumption. It just so happens that even if he had explicitly included this assumption, it would not have been justified. So in that sense, the debunking part is correct - we disprove Bloom's original bound by proving the true bound under the same assumptions that he explicitly makes.
> Bloom derives his expression by performing a transformation that would only be possible if the bits are independent, but nowhere in his proof does he state that he is making this assumption.
I see, thanks for this clarification. I read your post as saying that the assumption was explicit. I agree that reasoning from an implicit assumption is a logical error. Still, the real problem was not the assumption being implicit/explicit but rather that one doesn't want to rely on it at all. Which again is an extra-logical issue.
> So in that sense, the debunking part is correct - we disprove Bloom's original bound by proving the true bound under the same assumptions that he explicitly makes.
OK. I don't think that recapitulating something that has been known with more or less certainty for 12 years counts as "debunking". For me "debunking" has a connotation of being original, maybe for you it doesn't.
But I especially think that "the formula was only asymptotically correct" is too weak a statement to count as a "debunking". For me "debunking" an algorithm or a data structure would have to be something much bigger, like "Quicksort doesn't always sort" or "binary search trees can lose data" or "Bloom filters can sometimes have false negatives". Not "the behavior we have been seeing for the last 50 years matches the original formula well, but strictly speaking it matches the new formula a bit better". Your mileage obviously varies.
EDIT: Let me stress again that I'm not pooping on your work. The work appears sound and important, since this was so tricky to get right in the past. But I am pooping on the title.
Given the history of the proofs of the Bloomfilter, I'd argue that the bound was not really "known with more or less certainty" - if so many corrections had to be made, why should we believe that this latest paper was truly correct? it does not seem to be unreasonable to believe that there might be other errors that had been been similarly overlooked. Our research tackles this problem - it provides a guarantee that there are no further hidden errors.
> "the formula was only asymptotically correct"
I have no issues with the behaviors of a data structure being characterized in an asymptotic sense, but I do think that giving a bound that you claim to hold exactly, but then having it turn out to only hold asymptotically is incorrect and worthy of debunking. Furthermore, most citations of Bloom's bound do not claim it is an asymptotic bound, but use it as an exact one, which is clearly incorrect.
I think from an engineering perspective it doesn't feel like a debunking. Nothing we assume about the bloom filter has been fundamentally changed. I think that might be the perspective of most people in here.
At the end of the day, this is primarily a formal result with few direct implications on practical usage. As I mentioned in another post, I tried to make it more relatable to an engineering audience, but it's possible that I may have buried the lede somewhat. However, I still stand by the statement that the title is not incorrect, at least if only from a mathematical perspective.
Good point. The main Bloom filter I use at work is 8 million bits in size, and at that scale, it does seem completely negligible whether you account for bits being selected by multiple hash functions since the odds of that happening are so small. We could add a few bits to the 8 million to correct for the error, but that's not meaningfully changing anything; it's just a drop in the bucket.
> Our research tackles this problem - it provides a guarantee that there are no further hidden errors
Does it? At a glance you seem to have evaluated Bloom's false positive rate against Knuth's version of the filter. I say this because your starting point is Bose, whose result is for Knuth's version.
The popular "Bloom filter" is not, in fact, Bloom's filter. Their different constructions lead to different exact false positive rates, so I suspect you've failed to prove anything about Bloom's original paper.
Of course Bloom was wrong in any case. Grandi's "On the Analysis of Bloom Filters" (2017) is the paper to read there. It was the first to offer an exact account of the false positive rate for Bloom's construction.
> so I suspect you've failed to prove anything about Bloom's original paper
In Bloom's original paper "Space/Time Trade-offs in Hash Coding with Allowable Errors", Bloom proposes multiple variants of these filter structures, and it is true that these variants do have different behaviours.
However, the expression cited as Bloom's bound (equations 16/17 in the paper) are specifically about a data structure (he refers to it as method 2) that works exactly like the standard definition of a Bloom filter (Knuth's version). In this sense, our result holds for Bloom's bound.
For reference, Bloom's description of method 2 is as follows:
> Method 2 competely gets away from the conventional concept of organizing the hash area into cells. The hash area is considered as N individual addressable bits, with addresses 0 through N - 1. It is assumed that all bits in the hash area are first set to 0. Next, each message in the set to be stored is hash coded into a number of distinct bit addresses, say al, a2, ..., ad. Finally, all d bits addressed by al through ad are set to 1.
> To test a new message a sequence of d bit addresses, say al, a2, .. , ad, is generated in the same manner as for storing a message. If all d bits are 1, the new message is accepted. If any of these bits is zero, the message is rejected.
Clearly this method describes the conventional version of the Bloom filter.
This is an easy mistake to make. Note the word distinct:
> each message in the set to be stored is hash coded into a number of distinct bit addresses
In Knuth's filter, the addresses aren't distinct. Bloom's construction requires that they are, thus the probability of a bit being set to 0 from (16) in Bloom's paper is 1−(1−k/m)^n, different from the 1−(1−1/m)^(kn) in your paper.
Cuckoo hashing is commonly simplified in a similar way, indexing into a single table and allowing the hashes to overlap, unlike Pagh's original.
Anyway, from your paper:
> Bloom then claimed that the probability of a false positive was simply the probability of a single bit being set, raised to the power of k, reasoning that a false positive for an element y ∈ bf only occurs when all the k bits corresponding to the hash outputs are set.
> Unfortunately, as was later pointed out by Bose et al.[8], as the bits specified by f_1(x),...,f_k−1(x) may overlap, we cannot guarantee the independence that is required for any simple relation between the probabilities.
You're still right that Bloom is off, but it's not due to overlap in f_1(x),...,f_k−1(x), which his construction doesn't allow.
My broader point though was about your "guarantee that there are no further hidden errors", because a theorem prover is just a tool, tools are used by people, and people make mistakes.
> This is an easy mistake to make. Note the word distinct:
> each message in the set to be stored is hash coded into a number of distinct bit addresses
Good point, You are right that our correction does not address errors in Bloom's original definition, but rather in the definition of the Bloom filter that is typically used - I'll add a note to address this.
> My broader point though was about your "guarantee that there are no further hidden errors", because a theorem prover is just a tool, tools are used by people, and people make mistakes.
Yes, you are right, that's probably too strong a claim, my aim was to emphasize that there is more certainty in the proof, making it unlikely that there are no further hidden errors, but that was not clear.
> the real problem was not the assumption being implicit/explicit but rather that one doesn't want to rely on it at all. Which again is an extra-logical issue.
In a strict sense, yes, the problem with Bloom's proof was due to an implicit unjustified assumption. However, the reason that this assumption was implicitly introduced, was because Bloom accidentally overlooked certain dependencies between the bits - the "real problem" in this sense is the fact that when reasoning about random algorithms, it is hard to keep track of all the various dependencies involved and this can lead to incorrect proofs. This problem of tracking dependencies is a logical issue and one that is solved by proving the bound in a proof assistant.
That's neat! Is the deviation between Bloom's equation and the correct one relevant in pratice? Typical Bloom filters have millions or even billions of bits and a few to a few dozen hash functions, so I think the probability that an entry gets hashed to the same bit more than once should be vanishingly small?
Really cool otherwise, I should really look into Coq more for my own work.
For most practical applications, the sizes involved are such that you're probably dealing with the asymptotic behaviors, so the differences will be negligible and either bound can be used.
In that sense, this work is really more theoretical than practical, but still, I think, a nice result.
> a transformation that would only be possible if the bits are independent, but nowhere in his proof does he state that he is making this assumption.
Am I misunderstanding what you and/or the article means by "independent"? The fact that indexes derived from (non-overlapping) parts of a hash function output are uniformly ((preferably crypographically-secure-)psuedo-)random (and thus uncorrelated with each other and with indexes from other hash invocations) is what the phrase "hash function" means ("avalanche effect" if I remember my terminology correctly).
> The article doesn't say that the original proof was incorrect.
The article very clearly points out an error in the original proof.
The HN community can be so toxic sometimes. Inevitably whenever someone produces a new machine-checked proof, all these commenters come out of the woodwork explaining how assumptions/definitions are not verified and therefore the whole endeavor is somehow worthless.
A researcher being excited about their research is nothing to be ashamed about. I especially appreciate the fact that one of the authors took the time to make an accessible blog post to introduce us to their work.
> The article very clearly points out an error in the original proof.
We're using slightly different notions of "proof": I am talking about (1) the proof steps involved in proving a given, fixed statement. You are talking about (2) a wider notion that includes the formulation of the statement to be proved. Both of these are valid meanings for the word "proof". Coq can only check the details of sense 1, not the additional details of sense 2. The error in the original proof (sense 2) is in these additional parts, not in the proof according to sense 1.
> assumptions/definitions are not verified and therefore the whole endeavor is somehow worthless.
This is not at all what I have done.
> A researcher being excited about their research is nothing to be ashamed about.
Right. And the actual work done here is reason enough for excitement.
> I especially appreciate the fact that one of the authors took the time to make an accessible blog post to introduce us to their work.
So do I. But the fact that the target audience keeps discussing the title shows that the title was not appropriate for the target audience.
> I am talking about (1) the proof steps involved in proving a given, fixed statement. You are talking about (2) a wider notion that includes the formulation of the statement to be proved.
No, I am not. The original proof used an implicit assumption, which is a problem with the proof proper, not the formulation of the statement to be proved. Using a proof assistant guarantees that there are no implicit assumptions, which is one reason why this work is notable. It rules out an entire class of errors which were present in previous work.
>>> The article very clearly points out an error in the original proof.
You didn't say what error you meant here, but later you wrote this:
> The original proof used an implicit assumption, which is a problem with the proof proper
This is true, but this is not pointed out in the article at all, let alone "very clearly".
As established elsethread, the article's problem with the assumption is not it being implict or explicit, the authors simply didn't want to have to use the assumption at all.
It's a nice accomplishment, but even the article footnotes note that getting the definitions right for the paper was as important as verifying the actual proofs. And a proof assistant cannot fully verify your definitions, so these checks are still largely left to human scrutiny.
Just to clarify, Bose et al.'s paper was incorrect due to an incorrect definition of Stirling numbers of the second kind (some of their expressions used the wrong exponents). In our certified work, in light of this error, we ended up deriving the expression for Stirling numbers of the second kind from first principles[1], eliding it as a source of errors. As such, the main definition that needs to be human-verified in our work is the actual implementation of the Bloomfilter[2], which is fairly simple and easy to check as being correct.
> That's the highest degree of scrutiny that a proof can undergo.
No it isn't.
"Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
The highest degree of scrutiny that a proof can undergo is to run the code and observe that it works.
It's just ('just'), that that's usually computationally intractable, because you're trying to prove things about the entire set of possible inputs, and that set is infinite or at least extremely large. So you have to settle for lower degrees of scrutiny, that can be more easily verified.
I thought the article had a very nice, succinct and clear explanation of bloom filters and wanted to say thanks to the author reading this thread.
A year or two ago when bloom filters became a recurring popular topic on HN I read a long illustrated medium post on them found via HN out of curiosity to add the concept and tool to my back pocket, and it all seemed complicated and the explanation didn't stick. Your explanation however was quick and made complete sense and I cannot forget it. Appreciate it! Thank you!
Agree, it almost highlights how bad others descriptions can be sometimes. I mean the article makes it feel like such a simple idea given how many words are spoken about it.
I wonder if the key is in quickly grounding the core abstractions with concrete meaning early on, that way your mind quickly has a model of "things" to operate on before attempting to build up the behaviour with more abstract description.
Many descriptions stay in the abstract too long without anything to attach it to for the uninitiated... in which case you either persevere and eventually it clicks and all the relationships fall into place - or you give up out of disinterest.
I've noticed this when explaining things to others, especially non-technical people, when explaining seemingly very simple things, attempting to describe them in multiple ways and failing, and then realizing they need clarification of the "what", after which explanation is easy - sometimes you are blind to it when you already know "what" and already have the mental model so you jump straight into how and why.
Yeah the visual really helps with explaining the concept. I read the wiki article on bloom filters first, which was great for understanding the motivation for a bloom filter, but I subtly mis-interpreted how the hash functions were used to set the bit-vector. The visuals in the linked post clarified this really succintly.
I always thought the explanation of bloom filters was way overcomplicated. They are simply chaining hash tables, without the chain. Chaining hash tables first hash a value to an entry in an array, then walk down the chain at that entry to see if the value is actually there and disambiguate when different values hash to the same array entry. Bloom filters just get rid of that second step and leave it up to you to determine if the thing you are looking for is "really" there.
Of course, that assumes familiarity with chaining hash tables, but I feel like every CS program goes over hash table implementations, but bloom filters weren't something I learned about until I left college.
A bit tangential, but I suspect many people don’t know is there is actually more information-dense data structures for the use case of bloom filters; Cuckoo filters for example can get close to the theoretical lower bound of information required and have some other interesting properties. So you should consider that before reaching for bloom filters, probably!
Thanks, that’s even more interesting. I thought cuckoo filters were near the theoretical limit of information density so it looks like I have some more learning to do.
Bloom filters have no false negatives, they have only false positives. If it were the other way round, they would be of no practical use as you'd have to query the original datastructure in any case:
- If the bloom filter said "True", then you go ahead and fetch the data from the original structure
- If the bloom filter said "False" but that might be a false negative then you'd have to anyway query the original structure to be sure
With 0% false negatives but a relatively small rate of false positives instead, you don't have to query the original source if the filter gave a "False".
Btw. Bloom filters were one of the first probabilistic data structures with real practical uses and still to this day are widely used. They are still not out-classed in every way by newer algorithms like Cuckoo filters.
Of course that's purely a matter of perspective. If you regard a bloom filter as an approximation of inclusion in a set, it's a false positive, which is how we usually look at it. But if you look at it as a an approximation of safe urls e.g. in a browser it's a false negative (and there are no false positives). The data structure doesn't care; it's just bits: if the overall set is finite, it's even quite hard to tell which makes most sense, but in the more usual practically unbounded case, sure, finite subset inclusion approximation seems a less braintwisting interpretation.
It's an interesting thought but I'd argue the following.
First, the definition of Bloom filters is a probabilistic data structure to test the presence of an element in a set no matter the program logic around it.
But secondly, given your example of safe browsing, you can flip positive and negative meanings around freely by how you define the question. Is it "is this a bad domain?" or "is this a good domain?" which you both can convert to "is this domain in the set?" and "is this domain NOT in the set?".
And that's a general problem with "positives" and "negatives", they depend purely on the question and the question can have a boolean negator in it. But that's a logical layer above the data structure. As you said, the data structure does not care, it is just bits and therefore the question is "is the element in the set?".
Same as one could ask "Is it day?" or "Is it night?" which both query the same underlying data about the time and location but would have opposite meanings for positive and negative results.
And so, to give the two terms a better meaning we have a definition of the data structure and the operations on it which defines the meanings of negatives and positives.
I think some part of the reason for confusion with false negatives and positives results from the conotations of "positive" (good) and "negative" (bad).
To clarify why this is the case. It is because bloom filters effectively group everything in set X that you want to check for into a much smaller "approximated set" (the bit-vector). You can then check elements of set Y against the approximated set as a much faster test of whether they are in set X. The important detail here is that the approximated set will be positive for anything that matches an item in set X plus some additional items that are not in set X.
Right. Another way to think about bloom filters is a data structure that remembers a list of hashes. It will tell you "oh I saw this hash before". What it will NOT tell you is "oh I saw this element before".
If "Foo" and "Bar" hash to "XX" and you've shown "Foo" to the filter, it will think it saw "Bar" because what you are really asking is if it saw something that hashes to "XX".
That's the false positive.
On the flip side if "Baz" hashes to "XY" then the filter would tell you with certainty that it never saw "Baz" because it never saw anything hashing to "XY" and so "Baz" could not have been seen either.
I like your explanation a little better than mine because it is more true to the actual behaviour of the bloom filter. Mine maybe simplifies more than is needed for a clear explanation.
“
To be fair, the asymptotic behaviour of Bloom's original bound is consistent with this updated definition, so the impact is more on an issue of pedantry rather than for practical applications.”
Would love to see a table with computed numbers comparing the rates. It’s hard for me to understand the behavior of that second result
This paper[1] which corrected Bose et al.'s original derivation, has more discussion about the impact of the second result - in particular, Figure 2 (on page 13) has a comparison of the relative error of the old and new bounds against the empirically calculated rate.
It bears mentioning that not only cuckoo filters but also simply vanilla linear probed hash tables of B-bit truncated hashes (sometimes called fingerprints) can be a better substitute for Bloom filters. This seems a highly under propagated fact. "high p" (order 10%) numerical examples are often used to sell an idea less valuable at small p.
An (asymptotic, for p <=~ 10%) back of the envelope formula is that such a hash table/set of fingerprints takes up a factor of about (1+log_{1/p}(N)) more space than a Bloom filter. It is not hard to derive this. Unlike the incredibly precise Coq formula proof theme, this is all approximate, but more engineering-relevant.
If you were targeting p=0.001 to have a small mistake rate, 1 + log_1000(N) is pretty small (say <~ 1+3=4 for for N <~ 1e9 elements). While it does use 25% space, this Bloom filter would require many more (-log_2(p) =~ 10) probes while the LP hash table would only hit the DIMMs once. Many, but not all, might view a 10x latency reduction as worth 4x the space in the game of space-speed trade-offs.
I should also have said that this competing idea was raised literally in the very same Burton Bloom 1970 paper that the more famous filters come from.
Analysis of speed these days (where a single main memory hit is thousands of superscalar dynamic instructions) is tilted differently than it was in 1969. Still, even back then Bloom's own original paper had a footnote qualifying his superiority conclusion as dependent upon memory system assumptions. Beats me how this gets lost. Call it "The Bloom filter mystique".
The headline and the entire intro are clickbaity. While this is interesting from a theoretical math perspective, it is completely irrelevant in practical use of bloom filters. Typical filters will have a much larger number of bits than the number of hash functions used, making the probability of overlapping bits vanishingly small, and thus the original wrong calculation is a perfectly fine approximation.
So none of this matters for malicious site filters, or anything like that. And the reason why it took 30 years to notice the original problem is that, if you just test the false positive rate for typical Bloom filters empirically, you'd come within error margins of the "wrong" result anyway. It's not like people trusted critically bad math for 30 years. The bad math was formally speaking wrong, but close enough to the right result for nobody to notice nor care.
I know mathematicians have to sell their research, but sorry, with this one you didn't advance the state of the art of practical Bloom filter usage, just the state of the art of formal proofs :-)
For Bloom filters specifically I preferred last year's "On Occupancy Moments and Bloom Filter Efficiency". You wouldn't guess from the title or abstract, but it's fairly comprehensive, including proofs of the expected false positive rates.
I don't know enough about bloom filters or the math involved to say whether or not you are correct.
But what you're saying sounds a lot like many proponents of various flavors of NoSQL, Eventual Consistency, etc. I.e., it sounds like you are saying that lots of people have been using it for a long time, and it's just good enough. The actual correctness isn't really that big a deal.
I might be crossing domain boundaries here and thinking up stuff that really doesn't matter. I mean, bloom filters have a known challenge. No one was ever supposed to use them except for statistical purposes anyway, and with a known error rate, it's fine to add into your models.
But there's also something that feels a little weird about your point. It sounds like you're saying it's good enough for X, Y, and Z use cases; therefore it doesn't really matter how technically wrong it is.
Your comparison isn't really appropriate. Bloom filters work, they worked before this proof and they work the same way after the proof. This result isn't about how correct Bloom filters are or aren't. Their caveats are understood, and even using the "bad" math from the original paper, their performance in practical scenarios could be estimated with enough accuracy to use them in practical applications. I dare say not a single user of bloom filters will change anything about their implementation stemming from this result.
For example, if you size a bloom filter a certain way, the "bad" math might tell you your false positive rate is 0.001%, while the "good" math might tell you your false positive rate is 0.001002%. It makes no difference. The error is orders of magnitude smaller than the number you get anyway. (I made those numbers up, but I've used Bloom filters and they should be in the ballpark for the sizes I've worked with). The bad math might be strictly speaking incorrect, but it's a good enough approximation for all practical purposes.
This is different from Eventual Consistency stuff, which has real practical implications from not having certain guarantees. Those limitations are real, and they have real consequences, not just a rounding error in a number.
The difference between approximate and exact false positive rate calculations in real-world scenarios is insignificant. It's fine to use the approximation.
If you had very specific tolerances here you wouldn't work in terms of the expected false positive rate anyway; you'd build your filter and validate on the real thing.
Yes, that's a fair comment. I took some creative liberties with the title to try and make this theoretical result more relatable to the average reader, but it's possible I may have gone too far.
You went too far. Your title is wrong in the same way your research shows bloom filters are wrong. You chose relatable over correct. By gettnig the obvious stuff wrong right at the top, you shatter reader's trust that you got the non obvious stuff right.
Thanks for the feedback - I figured presenting the exact Coq proofs directly would probably be too dry, and thus was trying to give some context for the work to make the presentation more interesting. In hindsight, I can see how this has been detrimental to some other parts of the article.
If you look at the related work section of the paper, we actually present a multitude of papers in the literature (even some recent as 2019) that actually still incorrectly refer to Bloom's expression as an exact bound, so I thought at the time that the "debunking" title was not inaccurate.
I'll keep your advice in mind the next time I write an article about research work.
Yes absolutely too far: clickbait and that's unfortunate because it otherwise looks like a really well written article. It completely ignores all the work that happened in those 30 years before this Coq proof which only replicates (with more trustable steps) other work.
As someone totally unaware of bloom filters, I appreciate the click bait title because it made me aware of them, and then almost immediately understand what they are thanks to the clear explanation and visuals :)
When I was attempting to prove Bloom's original incorrect bound, the work never progressed to the point where I was actively working directly on proving his bound - I managed to prove some intermediate theorems, but was unable to work out a way to compose them. The issue ended up being that that I was unable to derive the independence required to prove the inductive step.
Note that you don't really "get <a proof assistant> to refuse to accept" an incorrect result, you just fail to construct a proof for it, as the author talks about in their reply.
What you can do, sometimes, is succeed in proving the negation of the incorrect result.
Yes and it’s even worse than most clickbait, because the way it accomplishes its attention grab is to misleadingly throw shade on bloom filters themselves, which the content in no way justifies. Leaves me feeling a little sick to my stomach. Ugh.
Then again, HN doesn't auto-collect, it's all user-submitted, which makes "click bait" basically irrelevant: a real user though this was worth sharing, and enough folks agreed to get it on the front page?
'Clickbait' can still be submitted by real users and upvoted, though. If the title is a tease that forces you to click on the link to resolve it, it's clickbait.
You could just keep hashing until you have enough non-colliding outputs, throwing away any collisions. This is sound as far as I can tell.
Relatedly, mapping real hash functions to bits is also nontrivial to get 100% mathematically correct. As far as I know, the only sound way for hash functions with a fixed number of output bits and a non-power-of-two bloom filter is to truncate to the next power of two, then throw away any results that overflow the filter size and keep trying with new hashes (or repeated hashing). This is thus easy to integrate with avoiding duplicate bits, since you're retrying anyway.
In practice, none of this matters, you can just take a hash output with enough extra bits modulo the Bloom filter size and call it a day. It'll be close enough to uniformly random and non-colliding anyway, for practical filters.
It's not too difficult: hash function 1 generates an integer from 0 to n-1, that index element is deleted, then hash function 2 generates an integer from 0 to n-2, that element is deleted etc.
This is an O(n^2) algorithm, which you can improve to O(nlgn) using fenwick trees.
In practice however, it is much quicker to generate random hashes and repeat until they're distinct.
In k^2 time, which is maybe faster than running a hash an additional time:
function get_bits(input, hashes, n)
local bits = {}
for i=1,#hashes do
local bit = hashes[i](input) % (n-i+1)
for j=1,i-1 do
if bits[j] <= bit then
bit = bit + 1
end
end
bits[i] = bit
while i > 1 and bits[i] < bits[i-1] do
bits[i], bits[i-1] = bits[i-1], bits[i]
i = i - 1
end
end
return bits
end
Your % operation is not uniformly distributed if the hash domain isn't an integer multiple of the output size, which is a bigger problem for formal correctness than the colliding output issue (hashes are not uniformly distributed).
You need to truncate to an even number of bits and retry until the output is < n, at which point you're retrying anyway, so you can just retry on collision instead of having all that complicated logic to skip bits.
Nice try though, but if you want formally perfect results like the OP you have to try harder :-) (except real hash functions are only assumed to have perfectly distributed functions anyway, that is not proven and probably not provable, so basically you're screwed either way and none of this matters :-) ).
Oh, you're right. The beginning of the original derivation assumes that the bits chosen by each hash function for a single input are independent (so it does not always produce k distinct bits) but the end assumes that it does produce k distinct bits. So it would still be incorrect.
I expect that you would get a lower false-positive rate, though, if some inputs could not randomly query fewer than k bits of the filter!
There are variants of the Bloomfilter that do try and do this - one way is to have each hash function map to a separate distinct subsequence of the bitvector.
It does indeed reduce the false positive rate, but comes at the cost of increased space usage. As always, utility of this modification would depend on where you wanted to balance space-vs-accuracy constraints.
I would hope not to use any additional space. A naive approach is to choose 1 bit out of N from the first hash, 1 out of the remaining N-1 from the second hash, and 1 out of the remaining N-2 from the third, and so on.
That would fix the final independence problem, but it would also require further changes to other stages of the proof.
For example, this strategy would then mean that when calculating the probability of a single bit being set, the hash outcomes are no longer independent, which means that a different expression would be needed.
Additionally, from a practical sense, this variation might also be more costly to execute, as setting bits would go from a single memory access to a linear scan.
I think it might be interesting to look into though.
The original false negative rate approximation can be proved (correctly) using Martingale arguments as done by Mitzenmacher and Upfal in 2005. The Wikipedia page also shows this version.
Ah, that's a new change on the wikipedia page, when I started work on this proof back in December, the Wikipedia page actually had the incorrect derivation.
Additionally, Bloom's original bound is given (and typically quoted) as an exact expression for the false positive rate, so while it may be correct as an approximation, I'd say its fair to say that the original bound is wrong.
>"Using a probabilistic data structure known as a Bloomfilter, Browsers maintain a approximate representation of the set of known malicious URLs locally. By querying this space-efficient local set, browsers will only send up a small proportion of URLs that have a high likelihood of actually being malicious."
Surely the original 'false positive rate' is only wrong if the hash function is 'bad', and thus Bloom wasn't 'incorrect' as much as assuming a perfect hash function.
I don't think that would solve this issue - a perfect hash function is guaranteed to not have any collisions for any element in some predefined set. What Bloom's proof requires is that all of the k hash functions should not have any collision for any input that is inserted into the Bloom filter, which is not covered by just having each function alone be perfect. That aside, Bloom does not make any assumptions about the chosen hash functions being perfect or not.
In the extreme case, imagine that your k hash functions are nearly identical: hash function `H_i` differs from `H_1` only in that the outputs for the `1`st and `i`th elements in the input space are swapped. Of the `N` elements in the input space, all but `k` will completely collide.
You're confusing statistically uniform with cryptanalytically secure. By that logic you can 'debunk' plain old hash tables because someone might feed you keys that all hash to the same slot.
Here’s a derivation of the formula Bose et al give for the false positive probability:
To calculate the probability, we will work out the number of ways to assign kl hashes to m bits (I.e. functions from a set of size kl to a set of size m), and for each of those ways we will work out how many ways we could assign k hashes to the bits which are set (I.e. ways to get a false positive). We then count the number of possible ways to assign our kl hashes of existing elements and k hashes of the tested element to m bits and divide the former by the latter. For the argument to be valid, each assignment of hashes to bits must have equal probability, which is true if the hashes are independent.
The simple way to count the number of assignments of kl hashes to m bits is easy: for each hash there are m possible bits so we get:
m^(kl)
Similarly for kl + k hashes:
m^(k(l + 1))
Now we will break this count up by the number of bits which are set. Suppose i bits are set. Then the number of possibilities for those set bits of the m total bits is (m choose i). And the number of ways the kl hashes could be assigned to the i bits is equal to the number of surjections from a set of kl hashes to a set of i bits, which is i!{kl; i}, where {s; t} is the sterling number of the second kind, the number of ways to partition s labelled objects into t unlabelled non-empty partitions [the author’s paper claims this is the number of surjections which is slightly wrong]. This gives the number of assignments given exactly i set bits as:
(m choose i) i! {kl; i}
And the total as:
m^(kl) = Sum_(i = 0)^m (m choose i) i! {kl; i}
Given exactly i bits are set, how many ways can we assign k hashes to those i bits? Easy: i^k. So the number of lists of kl + k hashes (integers from 1 to m) such that the last k all appear in the earlier list of kl is:
Sum_i i^k (m choose i) i! {kl; i}
Finally we divide by the number of possible lists, m^(k(l+1)), to get the probability given by Bose et al (this is valid because the hashes are iid so each list has equal probability of occurring)
Their prose description of a bloom filter firstly has a significant flaw, and secondly falls victim to what I believe is a fallacy in many discussions about bloom filters.
The flaw is that they do not specify that the size of the bit array should be a prime number. This omission alone is astounding. To be fair, they state that they assume the bits from the hash functions are randomly distributed over the bit vector, so with this assumption they skate past this issue even though they have apparently missed that crucial detail of part of how it is accomplished. One wonders if they were unaware.
The fallacy imho, but this is where I depart from the community, thus imho, so take me with a grain of salt if you wish, is that you don’t need multiple independent hash functions. You just need multiple inputs. For example instead of hashing the word “salad” three times, just hash the tokens salad1, salad2, and salad3. If your hash function is worth its salt (npi) then you will be just fine.
This is a good point about practical implementations of hashing based randomization - incorrect table sizes can invalidate assumptions about uniformity and cause guarantees to fail. However, this work was primarily about the theoretical analysis of these data structures, which presupposes that we already have a means of generating uniformly randomly distributed bits over the table space, hence the lack of discussion about table sizes.
What if a distinct bit vector is used for each hash function? Then isn't the original false positive rate correct?
Can you use less storage by having distinct bit vectors of each hash? This seems like a natural question once we know the false positive rates are different.
Maybe Bloom was misinterpreted and right all along?
Good point, You are right that our correction does not address errors in Bloom's original definition of a Bloom filter - which uses distinct bit vectors, but rather in the definition of the Bloom filter that is typically used (and referenced) in practice (Knuth's version) and in the literature, which functions in the way described in the article. This version is easier to implement as the hash functions are independent, but which does use incorrect reasoning in it's correctness proof. It's wrong to attribute this to Bloom, so I've added a note to address this.
I'm curious about the properties of the new approximate membership algorithms you discovered as part of this research. Are they better?
"We instantiated this interface with each of the previously defined AMQ structures, obtaining the Blocked Bloom filters, Counting Blocked Bloom filters and
Blocked Quotient filter along with proofs of similar properties for them, for free."
So it sounds like the new AMQ algorithms you allude to are blocked (more cache friendly) variants of existing AMQ algorithms. Are the bounds you proved all good in some sense? Do you know yet if they're actually faster in practice on real hardware, or do optimized implementations still need to be written?
The corresponding bounds for the variant structures we construct do result in lower false-positive rates than a standard Bloomfilter - so, in that sense, you could say that they are better. However, they also require more space, striking a slightly different theoretical trade-off between space and accuracy.
For practical purposes, you would have to take the effect of caches into account, and the performance may vary depending on the particular choice of hardware. Our work stuck mainly to the theoretical side, so we didn't do any empirical testing of these new data structures. I guess the jury is still out on whether these variants are actually better in practice than the existing ones.
This was very accessible! Thank you for writing it.
There were a couple of typos - might be worth getting someone to proof read it... (I am out atm so don't have a good way of laying out suggestions for corrections)
Thanks for the feedback. My main aim with this post was just to provide a slightly more accessible introduction to the full work, so I'm glad that it was successful in that sense.
It was mostly just a quick transcription of the corresponding presentation for the paper, so some typos may have crept in. I'll do a second pass and try and fix that later today.
That's mostly a really nice explanation, the one thing that bugged me (which is my pet peeve in math-y CS papers) is a not-perfectly-clear explanation of your variables. The ones in the equations interspersed with the diagrams of the original derivation are easy enough to pick up from the diagrams, but you switched from $n$ to $l$ for the number of inputs in the correct expression (and the Bose paper also used $n$, so I'm not sure why).
I think that was actually a mistake in my writeup, I just copied the latex from the paper without adjusting it to the notations used in the article. I'll make sure to fix it.
I didn't realize Bloom filters were used in this way. Thinking adversarially for a moment, doesn't this provide an easy way to get a target website marked as a false positive?
I can't follow all the maths but the introduction is surely wrong based on simple probability alone.
It's claimed that the URLs which browsers send up (having been diagnosed positive by the test) will ”have a high likelihood of actually being malicious", but that by no means follows from the test's low false positive rate. You need to consider the background rate.
Just like in the classic example of a positive diagnosis from a low false positive test for a rare desease.
I think in this case the fact that there are no false negatives means that the low false positive rate is enough to infer that a positive result implies high likelihood of actually being malicious.
You can reason roughly as follows:
P[ pos | mal ] = 1 (no false negatives)
= P[ pos /\ mal ] = P[ mal ] (Bayes)
A low false positive rate means that:
P[ pos | ¬ mal ] ~= 0 (low false positive rate)
P[ pos /\ ¬ mal ] / P[ ¬ mal ] ~= 0
(P[ pos ] - P[ pos /\ mal])/(1 - P[mal]) ~= 0
(P[ pos ] - P[ mal ])/(1 - P[ mal ]) ~= 0
From this fraction we can conclude:
P[ pos ] - P[ mal ] ~= 0
Returning back to a the likelihood of being malicious given a positive result:
P[ mal | pos ]
= P[ mal /\ pos ] / P[pos]
= P[ mal ] / P[ pos ]
~= 1.0
False negative rate doesn't matter. A low false positive rate can be much larger than the true positive rate.
Imagine an extremely rare toxin that always turns your skin blue. Only 100 people in the world have it. But 1000 people are wearing blue face paint at any given time. A diagnostic test "are you blue?" Would have no false negatives and very low false positives for the toxin, just like a Bloom filter. But a positive test would not mean highly likely to have the toxin; it would mean highly likely to be wearing face paint.
Say we have a million urls, and a thousand of them are malicious. Our filter returns a positive result for all the malicious 1000 (no false negatives) and for the safe 999000 urls only 1% will return positive (low false positive), but that's still 9900 false positives. So a positive result only has a (1000 / (1000 + 9900)), i.e. 9%, chance of actually being malicious.
Even with a false positive result of only 0.1%, the probability of a positive result actually being malicious only rises to 50%, so still not in "high likelihood" territory.
Nice example - you were right, my wording on that section was incorrect. sukilot quite nicely presents the error in my derivation - I guess I probably shouldn't try coming up with proofs on the spot - especially if they're not machine checked.
I think what I probably should have said instead was that the low false positive rate means that only a small proportion of the honest URLs will be sent up.
> Conversely, sending every URL that a user visits to some external service, where it could be logged and data-mined by nefarious third parties (i.e Google)
This suggests an interesting question to put on an exam in whatever class teaches about Bloom filters.
----------------
Q: It is proposed to make the local DNS resolver handle IPv4 by using 64 Bloom filters, BO1, BZ1, BO2, BZ2, ... BO64, BZ64.
Filter BOn answers the question "is bit n of the IP address a 1?", and BZn answers the question "is bit n of the IP address a 0?".
Your resolver checks all these Bloom filters. If BOn return "no", then it knows bit n of the address is 0. If BZn returns "no", then it is knows bit n of the address is a 1. Only if BOn and BZn both return "maybe" for some n must the resolver actually do a DNS query over the internet.
Explain why this proposed resolver would not be useful.
These would have to be pretty large and updated regularly. Why not use as k-anonymity scheme? A decent amount of password managers use that to see if the password a user generated is unique.
It's a shame that the OP went to huge effort to make a mathematically perfect proof, and that wrote such a deceptive article about it.
It's an ironic demonstration that we shouldn't trust prose. The author's implied thesis is that papers are worthless and only code matters, which applied to the author's paper too!
Apologies if the article came off as deceptive, my intentions were not to try and mislead anyone. While the result may not have practical implications, I don't think that the "debunked" part of the story is incorrect.
To reiterate a point I made in an earlier response: In the paper, we actually present a large number of other papers in the literature (even some recent as 2019) that actually still incorrectly refer to Bloom's expression as an exact bound, so I do think that this is important and somewhat justifies the debunked narrative.
You should remove the unwarranted "nefarious" slam. It's simply incorrect; the actual reason the browser does not send the URL to a central service is user's expectations of privacy do not allow their browsing history to be logged like that, even if the purpose is for malware protection. They have not given proper informed consent, and fortunately don't need to in order to detect malware.
From "Google Chrome Privacy Whitepaper":
Chrome checks the URL of each site you visit or file you download against this local list. If you navigate to a URL that appears on the list, Chrome sends a partial URL fingerprint (the first 32 bits of a SHA-256 hash of the URL) to Google for verification that the URL is indeed dangerous. Chrome also sends a partial URL fingerprint when a site requests a potentially dangerous permission, so that Google can protect you if the site is malicious. Google cannot determine the actual URL from this information.
Good point, my inclusion of Google was mostly in jest, but in hindsight, it come off as misleading, as actual browsers don't actually function in the simplified way presented in the article. I will remove the reference to Google.
The theory of hashing never matched up well with how I've experienced it in the real world. For example almost all analysis of the hash table (for proving stuff about load factor, chain length, or probe clustering sizes, etc) starts off with a "uniform hashing" assumption. But that assumption have basically never held true given how languages define their default hashes.
This is interesting stuff! It's especially nice to see some more details behind the math -- I did some stuff with bloom filters for work and found that the math didn't always seem to line up with what we'd expect during tests. I wonder if it needs some adjustment...
The new formula doesn't appear in the Wikipedia article for Bloom Filter, although it does say that the old formula is only an approximation because of the incorrect independence assumption.