Hacker News new | past | comments | ask | show | jobs | submit login
A visual interactive guide to Bloom filters (samwho.dev)
286 points by flyingsky 11 months ago | hide | past | favorite | 54 comments



Sam's visual tutorials are as good as ever (doggos are very important, by the way).

And if you want to try some code, I've prepared three toy Bloom filter implementations to accompany the article:

JavaScript: https://codapi.org/embed/?engine=browser&sandbox=javascript&...

Python: https://codapi.org/embed/?sandbox=python&src=gist:e7bde93f98...

Go: https://codapi.org/embed/?sandbox=go&src=gist:e7bde93f98c5e4...


I'm the author of this post, and am happy to answer any questions. :)


I really like this husky graphic, did you create it?

https://samwho.dev/images/sage-warning.png


I didn't, it is AI generated. I wasn't sure if I'd re-use them after the Memory Allocation post, but they've proven to work really well and people like them. I've been considering getting proper, consistent sets commissioned. One of the things I want to do in future is make it so you can pet them, which could require some animation work. Open to artists reaching out to me about this. :)


DALL-E 3? I've seen some similar images out of it while trying for more pointilist views. (DALL-E 3 has a certain obsession with regularity & centering.)


Midjourney! The prompt for the puppy with his head cocked to one side was: “playful happy cartoon huskie puppy asking an insightful question looking at the camera, question mark above, muted colors, block print.” Never could convince the model to give me that question mark.


Ah. A little surprising, MJ's tend to be messier and more textured.

But yeah, I too know the sorrow of getting MJ to generate specific text in images - funnily enough, I too wanted a question-mark for my cat image (for https://gwern.net/review/cat thumbnail ) and I just couldn't get MJv5 to do it and had to settle for making the entire cat a question-mark: https://gwern.net/doc/cat/psychology/2023-11-04-gwern-midjou... It looks cool and I'm satisfied with it, but it wasn't what I had been going after.

In MJ's defense, MJv6 (c. January 2024) is considerably better at text, and you could probably get that question-mark now with some prompting & maybe region-inpainting.


article about a related app - that some guy created in 30 mins to allow people to "pet" virtual animals: https://www.theguardian.com/games/2024/feb/15/stroke-of-geni...


What do you use to create the graphs?


https://pixijs.com/ and https://gsap.com/. All of the source code for my posts can be found at https://github.com/samwho/visualisations :)


Thanks to the author for all the hard work. Its easy to consume and understand for beginners too.


This is really reassuring to hear, thank you! I make sure my posts are reviewed by people with a range of experience and familiarity, without them the posts wouldn't be as accessible as they are. I'm glad this is working. :)


Hey Sam, how did you create the charts with the slider? I really like those!


They’re also using PixiJS. The code for them is here: https://github.com/samwho/visualisations/blob/main/bloom-fil...

Glad you like them, they were hard to get right :D


I remember discussing bloom and cuckoo filters a while back and someone linked.to.great papers from late 2010s with much more advanced probabilistic structures that were handling modification much better or had other interesting properties. Anyone here knows where to look for these? I didn't take notes...


Not sure, but I recently came across xorsat filters (https://eecs.ceas.uc.edu/~weaversa/XORSATFilters.pdf). It's not a great replacement for bloom filters in many use cases because they are not modifiable but have some other interesting properties:

1. They are very close to the information theoretic minimum needed to encode the information so are more space-efficient than bloom filters or cuckoo filters.

2. They have a natural way to store metadata (at the cost of additional space of course) so you can associate a few bits of metadata with each set bit at pretty reasonable space cost.


Interesting! 2 makes me wonder if it could work as a very bad but space efficient key-value store ;D


Maybe quotient filters?


Yes, that's one of the bloom successors I was looking for. Thanks.

Link for people reading this: https://systemdesign.one/quotient-filter-explained/

I remember there's more obscure newer stuff someone showed me in 2019 tho.


What are the business use cases we use bloom filters? Some examples would be great. Thanks!


The first I heard of a Bloom filter was when I was tasked with implementing one in iBooks. I used a Bloom filter for search.

My recollection is a little fuzzy now but as I recall, as we parsed each pagination of the ePub, we would hash or index (the first three characters of?) every word on the page. Each page got its own bitfield or some such structure to hold these hashes.

When a user typed a word to search we could potentially eliminate a large percentage of the pages of the ePub. If a page had no hash matches to the hashed search term then we knew the word was not contained on the page at all.

A best-case example might be searching for "Jabberwocky" in the book "Through the Looking Glass". Very likely all but one page of the ePub will fall away.

As is pointed out in the article, if we did get a hash match for a page it was not a guarantee the word was there — but then we would drop down to a more standard string compare to find not only if the word truly existed on the page but where it was as well (or if it existed in multiple places on the page).

The worst case was of course entering a search term like the word "the" where potentially every page would have to be exhaustively string-compared. But that kind of falls in the lap of the user — why did you search for "the"?


This is really interesting, thanks for sharing! I'd imagine the three-character trick would have failed on programming-based ebooks, right? With all those underscores and stuff. Do you remember the preprocessing steps you took for this?

On a side note, I just spent like an hour on your blog. Fascinating stuff!


Thanks. No, I don't remember the preprocessing.

WebKit, at the time, was not very good at pagination, something ePub requires. iBooks work resulted in a number of "Radars" filed against WebKit with regard to pagination. As a side effect I suspect printing a web page from Safari (who does that?) is markedly better since.


One example: I personally used it to deduplicate recently viewed ads in an app, to avoid spamming users with the same ad too much. One interesting challenge there was that we needed to make sure that the client and server encode and decode the filter exactly the same, despite being implemented in different languages.


Ooh that sounds fun.

Something I spent time thinking about, but wasn’t able to find a huge amount of information on, is how you could use compression alongside bloom filters. You could make enormous bloom filters that make use of run length encoding or sparse bitmaps. You sacrifice insert and lookup speed but you could make enormous bloom filters this way.


Bloom Filters are already lossy compression, so depending on how sparse they are I'm not sure you'll get too much benefit out of it, but maybe I'm just thinking of much, much smaller filters.

BTW we ended up open sourcing that BF library that encodes and decodes filters in multiple languages, the company has been out of business for nearly a decade but the project is still out there https://github.com/EverythingMe/inbloom


I'm not sure if it's a big deal for business, but for my bachelor's thesis, we showed how we could use a type of counting bloom filters (mentioned in the blog) known as Spectral Bloom Filters, for client-side searching on static websites. Basically, it means doing all the searching and ranking directly on the client side, without needing to send any requests to the server.

https://pncnmnp.github.io/blogs/spectral-bloom-filters.html


Wow, this is extremely cool! Thank you for sharing.


It's used a lot in databases for query processing. Specifically, if you're performing a hash join, probing the hash table is more expensive than probing the Bloom filter. If you are pretty sure you're not going to find very many matches in the hash table (i.e. the join is very selective), you can skip going into the hash table by first checking the Bloom filter.


the article provide 2 real world examples at Akamai and Google


On top of those, the Wikipedia page also has a list of examples: https://en.wikipedia.org/wiki/Bloom_filter#Examples


A wonderful article, thank you.

Rather than using multiple hash functions, would it make more sense to use a single algorithm over (prefix | input), with k different prefixes? This may allow computing those hashes in parallel, using SIMD for example, and caching the prehash state of the prefixes.

Edit: looks like there has been some research on this: https://ieeexplore.ieee.org/document/8462781


It’s fairly common to use two hash functions and then compute the remaining n hashes as hn(x) = h1(x) + n * h2(x).

Paper: https://link.springer.com/chapter/10.1007/11841036_42


There's a lot of optimizations you can do on top of the classic Bloom filter. In short, you can use a single hash to compute an offset into a table of bit patterns. SIMD lets you perform multiple lookups in parallel. I wrote a blog post about more advanced Bloom filters if you're curious:

https://save-buffer.github.io/bloom_filter.html


this is a great article! I love how you explained it. here are my takeaways from it: - Bloom filter is a space-efficient data structure - You're trading off some accuracy (possibility of false positives) for the significantly smaller memory footprint and faster membership checks


Very good explanation of Bloom filters, but are they optimum from a complexity point of view given a false positive rate and a bounded number of bits? Are there other kind of filters if we relax de condición of zero false negative rate?, are there other better but more complex data structures able to provide the same effect?


For whom that are interested I suggest also to have a look on Cukoo Filter: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf


One of my reviewers suggested covering these too but they felt a little more complex than I wanted for the post. They are cool, though!


I appreciate the author's dedication to making animations accessible!


Thank you so much. I still have a way to go. People who struggle with low contrast will struggle with the default colour scheme, for example. But I'm trying to make progress with every new post <3


These are beautiful interactives along with the rest by Sam!


You're very kind! I'm a big believer in writing you can really lose yourself in, and try and achieve that with interactivity and visuals. The aesthetics of my posts is very important to me.


Awesome resource! Bloom filters are one of my favorite data structure (that I wish I got to employ more often in my line of work… but alas).


i-just-think-they're-neat.gif

Have loved bloom filters for a long time. If probabilistic data structures that make use of hashing are your thing, you should also check out: https://djhworld.github.io/hyperloglog/


Very cool! I was going to read up on bloom filters and this popped up the next day!


I read your mind.


Amazing! Very well explained.


Thank you!


Flows beautifully. Looking forward to the next one Sam!


Thank you, ctxc! Topic is decided but it's going to be a tricky one to get right, so it's a few months away at least I'm afraid.


Would it be better to store the hash of each output into 3 different arrays?

e.g. currently, it's doing:

    [input] -> [3 hashes] -> [single storage]
but, I'm wondering if it's better to do:

    [input] 
    -> [hash a] -> [storage a]
    -> [hash b] -> [storage b]
    -> [hash c] -> [storage c]
...this way, the output of the 3 hashes don't affect each other during the set and membership-check. I wonder how that would affect how much more data you can store. I'm sure someone has considered this.


I can’t remember the name but I remember reading about a bloom filter variant that does this. :)


very well written and informative.


Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: