Hacker News new | past | comments | ask | show | jobs | submit login
What are Bloom filters? (2015) (medium.com/the-story)
202 points by trumbitta2 on Aug 5, 2016 | hide | past | favorite | 69 comments



Whenever considering a bloom filter also look at cuckoo filters. There are pros and cons of each approach.

See https://bdupras.github.io/filter-tutorial/ and http://11011110.livejournal.com/327681.html as well as the corresponding HN discussions https://news.ycombinator.com/item?id=12124722 and https://news.ycombinator.com/item?id=11795779


For that matter, cuckoo hashing [1] in general is an interesting topic. Take a look at its application to cache oblivious dictionaries [2]. They work quite well in practice, especially for space constrained hash tables (e.g. caches).

[1] https://en.wikipedia.org/wiki/Cuckoo_hashing

[2] https://arxiv.org/abs/1107.4378


I read about halfway into this article, and when the author started talking about Javascript, the "this" keyword, and the politics about the job, I stopped.

What does that have to do with bloom filters? Get to the point!


Bloom filters are cool , but the writer is pretty annoying. It's written like they discovered bloom filters or something.

Just read the last section.


I don't know, the first 5 paragraphs were about his lunch.


He admits that he's trying to imitate Paul Ford (because somebody told him to) and failing, which is a fair assessment of the situation. You can just skip to the "In Bloom" heading... there isn't really any information before that.


the post is how he's returning a favor, which he explains in the post, in the "returning the favor" section


But that doesn't tell me how Bloom filters work. I don't care about his dinner or what favors he owes, unless he's planning on turning those expository details into a clever metaphor later in the article, or if they really are something that's important for me to understand.

I'm all for educational pieces that are told through a conversation between characters - you can get some creative writing out of that. However, the bug that he's asked to fix, the Polish typographer friend, and the dinner scene are barely connected to the subject. We don't even get a nice technical explanation of why the bug is impossible to fix or what that means with regards to the trade-offs in using Bloom filters.


There is cool and very quick explanation of bloom filters:

https://www.youtube.com/watch?v=-SuTGoFYjZs


Bloom filters are beautiful and highly underutilized.

I used a Bloom filter in an interview and they looked at me like it was some form of arcane sorcery.

Isn't this standard in algorithms and data structures courses? Does anyone know if it has been removed from the curriculum?


Bloom filters have a lot of problems for critical processes in practice. They are best used when being wrong is a non issue or as a type of cache, but preforming a hash is generally really slow. Also, the way virtual memory works looking up an answer often makes storing the new value much faster.

So, while cool they have generally been replaced by far more useful and less complex topics.

One example is if your looking for duplicates among huge data sets. You farm out your Filter to 1,000 machines merge them, then re run to get a list of possible collisions. Then compare that list. But, well merge sort can do the same thing a little slower, it's a rare problem, and the pathological case for Bloom filters is really bad. So, even if you know about Bloom filters you may go with the simpler idea and be done sooner.

PS: In that example hash speed might mean sorting is actually faster.


The place where bloom filters really shine is doing lookups on data where most of the time the data isn't there.

For example, when you load a page on reddit, it has to check for a vote on every single comment on the page to see if you voted on it. Chances are 99% of the time you didn't vote on an item. Using the built in bloom filter in Cassandra, it can very quickly tell if you voted on an item without having to hit the data store most of the time, which saves a ton of time.

(As a side note, there is an extra optimization that makes things even faster -- reddit tracks the last time you voted on anything as an attribute of your user, so if the page is newer than your last vote, it just assumes you couldn't have possible voted on it, skipping the lookups altogether).


A list of votes per user per page would also work. ~90% of the time that list would be empty, if it's not your likely going to eventually need to know which specific items where voted on so might as well just load them all.

Further, voting records are just not that big a data structure. Sure, it's much better than a poor implementation and might be a great optimization to bolt onto a different design, but again it's niche with minimal befits when starting from scratch.

Edit: Now, I really want to know how they actually do this.


> A list of votes per user per page would also work. ~90% of the time that list would be empty,

Not really. People tend to vote on one or two things on a page. So you'll have a long list of people who voted on one or two of the 1000+ items, and now you've had to do an extra lookup.

It's a lot faster to just go straight for the votes and take advantage of the bloom filter.


Real world data wins so if you have real data then great a write up would be cool.

However, it's 100% empty the first time someone opens a link. If someone up votes there is no need to refresh, so they need to up vote and then refresh the page which I assume is less common. With lot's of pages that never get refreshed and a long tail of pages people go back to. So, my assumption is you can easily keep a vote list in memory for popular pages and everything else is basically a non issue.

As to matching up votes with items, I kind of think that's for client side JavaScript in most cases.


You work(ed) at reddit, d(o|idi)n't you?


Worked, yes.


>A list of votes per user per page would also work. 90% of the time that list would be empty, if it's not your likely going to eventually need to know which specific items where voted on so might as well just load them all.

How? This means you're now duplicating data, instead of organizing as

page has comments have comments have ... have votes

you now have

page has comments have comments have ... have votes page has list of special comments where you don't know where they are or what their parental hierarchy is and still need to look up in the list anyway.

you're duplicating the "has been voted on" field.


> How?

Page has a list of people who have voted, and that list points to a list of their votes. Each comment has a a number for vote counts, but not a list of scores.

User opens page, they get a list of comments with scores. And a possibly empty list of votes. 99.9% is probably less than 20 votes from that user so 160 bytes or less. Worst case is what 8 byte comment ID * 1,000 up-votes = 8k.

On user vote, server looks at past votes on page and adds it if missing then updates comment score.

Ok, you need every vote from each user and caching comment scores is really likely a no brainier, but I guess you could recalculate it.

PS: Not all comments are going to be loaded, but again if your sending <200 bytes for the full list 99.9% of the time then doing something else has limited value.


But that requires like entirely rearchitecting the reddit backend to be less efficient.

A page has comments a comment has comments a comment has votes a vote has a user

That is to say that a vote is a many to many join from user-comment.

To do what you want, you'd also have to give the page the concept of "people who have voted on a comment that belongs to be or one of my children"

>User opens page, they get a list of comments with scores. And a possibly empty list of votes. 99.9% is probably less than 20 votes from that user so 160 bytes or less. Worst case is what 8 byte comment ID * 1,000 up-votes = 8k.

The problem isn't sending additional data over the wire, its that, say I've voted on 500 things on a page, then we're at 250,000 operations (for each comment check if its in my list of voted on comments), whereas a bloom implementation is O(n). And also the whole making your database structure repetative and icky.


In your approach comments have votes the server must check the bloom filter for each comment then do a fetch for each match in the case of 500 votes that's 500 fetches which are at least votes * (log comments votes) because you can't trust the filter.

My approach that's client side most of the time. Further it's O (n) to compare matches between two sorted lists so just sort them in O (n log n). Further log 500 is faster than most hashes.

And of course for the most common case of no votes it's a single lookup and done.

PS: As to changing the back end, yes if the initial implementation is bad then you might have a point. But, again fixing the problem also works and produces far less code to maintain.


At Reddit's size, wouldn't that just shift the problem from looking up votes to looking up lists? Millions of users times millions of articles? Seems like a fast way to get an early "no" would still be useful.


Your looking it up once per page vs. once per comment. So, yes it might be faster, but when something takes 1% of total time speeding it up has massive diminishing returns.


Still, 1% of a very large number (I don't know Reddit's server costs) can be a large enough number to be worth pursuing.


Agreed, HN uses 1 server and /. had 4 so I assume Reddit is of a similar size, but if they need a bunch of HW then sure it might be useful.


Shouldn't this be a problem we can apply SSE or other vector operations to? It seems like an insufficiently explored avenue.


Stable bloom filters are a nice variant with a less bad pathological case.


It is not standard, and was never added to most uni programs in the first place.


> Does anyone know if it has been removed from the curriculum?

I don't think there is such a thing as a single curriculum taught by all colleges that it could have been removed from.


I'm speaking more of generalized educational practice than standardized curricula.


I do believe you specifically asked:

Isn't this standard in algorithms and data structures courses?

And the answer was 'no'.

It's okay to be publicly wrong. You wanted to learn something and now you know. But acting like you didn't ask it because you're feeling self-conscious just makes the whole thing awkward for everybody.


I certainly didn't get the impression that the OP's follow-up comment was trying to cover up a public display of ignorance. Rather it seemed to me that the comment about lacking a centralized curriculum was needlessly snarky and did not interpret the OP's comment charitably.


I... mis-parsed the reply as saying that bloom filters haven't been part of a curriculum from which to remove it, and took it as a dig at the lack in CS curricula (which sentiment I would empathize with).

When Google first brought it to light I know most of the people I had access to had never encountered Bloom Filters in their college carreers. I can think of a handful of other things we 'should have learned' in school that just never came up (and I'm not even talking about practical skills, just theory).

Apologies for the mixup. Interesting to see the mix of up- and down-votes I got from this one. I wonder if some people made the same mistake I did...


My comment was genuinely trying to help explain that there isn't a single curriculum! I didn't post the telling off that followed his response!


Fully agreed. Your comment was a fair question for clarification of what I meant by "curriculum". I thought I answered it fairly by clarifying that I meant topics generally covered by most universities in their in algorithms and data structures courses.

I'm not sure of the reason for the snark in the follow up. It was a fairly straightforward question and answer.


You are quite right. I must have conflated the tone of one comment and the text of the other in my mind when I was replying.


> Isn't this standard in algorithms and data structures courses?

It's not in my Cormen, Leiserson, & Rivest book, though my copy's nearly 20 years old so it could have been added in a later edition.


Did you go to MIT, or did you, like me, attend a college that copied MIT?


I didn't attend MIT -- why do you think it's copying MIT, because it's from MIT press? I believe that's the closest thing there is to a "standard" algorithms textbook but things may have changed since my college days...


It's taught in the intro data structures course at Queen's University in Ontario. Really cool stuff.


Bittorrent uses them for distributed calculation of statistics. For counting of set unions specifically.

http://bittorrent.org/beps/bep_0033.html#abstract


It was covered in our DSA unit, at least when I took it a couple years back.


Bloom filters originally upset me because I expected the word "Bloom" to refer to an action of the algorithm. Not so someone's name.

I'm fond of Concierge filtering as an easy description of how they work. I view them as a brief conversation with the front desk staff at a hotel to determine if someone is there. "Is anyone here wearing a hat? Is anyone here over 6'? Did anyone come in carrying a briefcase?"



>Bloom filters originally upset me because I expected the word "Bloom" to refer to an action of the algorithm. Not so someone's name.

I thought Canny edge detection had to do with the fact that the edge detector was canny.


There ought to be a list of surprisingly-eponymous things like Page rank, Killing vector, Poynting vector, ...


How do you feel about shellsort?

https://en.m.wikipedia.org/wiki/Shellsort


Exact same problem. :)

At least that is sometimes written as Shell's sort.


Agreed!


If you like hash functions and Bloom filters, you'll also enjoy Jon Bentley's description of the original UNIX spell(1) utility, which ran in 64KB by representing its dictionary as a sparse bitmap.


I did something like that in my "Obvious password detector"[1] in 1984. This is for use in password-changing programs, to detect if a password is too obvious. It uses a 3D array of bits, 27x27,27, representing trigraphs seen in the UNIX dictionary, plus a few extras such as "aaa", "bbb", etc. Only about 30% of the possible trigraphs are used in English, so most strings that aren't a word will pass. This provides simple protection against most dictionary attacks.

[1] http://www.animats.com/source/obvious/obvious.c


Marissa Mayer's interview at Google was using Bloom filter for spell check.


Is this in programming pearls? Otherwise where?


I saw it in a column in Comm. of the ACM a long time ago, sorry.


McIlroy's "Development of a Spelling List" (1982) paper is doi:10.1109/TCOM.1982.1095395 available at https://web.archive.org/web/20120324185456/http://unix-spell... . I found it from https://code.google.com/archive/p/unix-spell/ , which also says:

> It is interesting to note that one of the world's best compressor (paq8l) can compress the 250kb word list down to 48.5kb, less than the space taken by the lossy compression methods proposed in the paper! For comparison, regular modern compression methods (such as gzip, lzma etc) only achieve twice that size (85-90kb).

Bentley's commentary is in "A Spelling Checker", CACM 28(5): 456-462 (1985), doi:10.1145/3532.315102 , behind the paywall at http://dl.acm.org/citation.cfm?id=315102&dl=ACM&coll=DL&CFID... and in the book "Programming Pearls", Second Edition, according to http://www.linuxjournal.com/article/3846 .


He takes a long time to get to the point, doesn't he?


OpenAM uses bloom filters to blacklist sessions. Neil has a nice write up here:

https://neilmadden.wordpress.com/2016/02/25/stateless-sessio...

[Disclaimer, I work for ForgeRock]


I also have researched this topic lately. You can look at my scientific results. https://github.com/m00dy/BloomFilter


Brilliant! I never anticipated Bloom filters would be so simple. What comes to my mind is that for "forgetful" Bloom filters, one could remove a hash from the table, also erasing all equivalent entries from the memory. Is this a useful trade-off in practice?


No, it isn't.

The problem here would be that you'd be essentially invalidating arbitrary hashes from the table with each operation where they share the same "row" (or address in a bitmap) with other hashes.

Suppose we have a table such that "foo", "bar", "baz" all have overlapping entry -- if we attempt to remove any of them, we'd remove the rest also.

This being said, you might be interested in `Counting filter' instead, which support delete operations.

[0] - https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters


Cuckoo filters also support deletion. https://github.com/efficient/cuckoofilter

(To the other person who replied: In order to delete from a counting bloom filter, or a cuckoo filter, or any other data structure that doesn't store the original item, you have to know the item that you're deleting was present. Otherwise, you have the problem you noted.)


Thanks. That's what I meant with deleting "equivalent entries", I was wondering if there are cases where that would still be a net gain. But it makes sense that there are more specialized versions for such things.

PS I don't understand why my comment was downvoted, is there something wrong with asking such questions here?


I don't understand why my comment was downvoted, is there something wrong with asking such questions here?

Not AFAIK, but sometimes people downvote questions that are based on misunderstandings or false premises. Give it some time and you'll probably be voted back up to where you started.


I don't see how those could work. Since bloom filters can have false positives, the counting filter could end up letting you remove an item that was never added, thereby corrupting the entries for other items in the process.


You're right. However, in cases where we know the maximum number of items we'd be storing, we can adjust the "bucket size" of each row. Quoting the above wiki link:

> Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.


Understanding Bloom filters helps you understand Count-min sketch. https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch


This was a god-awful pain to read. Meanders around the point of bloom filters endlessly.


Fyi... the previous discussion about venison glazed in honey (or was it bloom filters? I can't remember for sure):

(July 2015) https://news.ycombinator.com/item?id=9918365


Ah, another article about Bloom filters on HN.

That reminds me, does anyone know if Windows 10 uses bloom filters? Specifically, if I have a folder with a lot of files inside and I paste a file into it, if the file name is different from all the names of the files in the folder, it is nearly instant (true negative, no further testing needed) but if the file name matches one of the names, it takes significantly longer (true or false positive of bloom filter -> check all files manually for a match).

It's just something that's been going through my mind lately when I moved files.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: