For that matter, cuckoo hashing [1] in general is an interesting topic. Take a look at its application to cache oblivious dictionaries [2]. They work quite well in practice, especially for space constrained hash tables (e.g. caches).
I read about halfway into this article, and when the author started talking about Javascript, the "this" keyword, and the politics about the job, I stopped.
What does that have to do with bloom filters? Get to the point!
He admits that he's trying to imitate Paul Ford (because somebody told him to) and failing, which is a fair assessment of the situation. You can just skip to the "In Bloom" heading... there isn't really any information before that.
But that doesn't tell me how Bloom filters work. I don't care about his dinner or what favors he owes, unless he's planning on turning those expository details into a clever metaphor later in the article, or if they really are something that's important for me to understand.
I'm all for educational pieces that are told through a conversation between characters - you can get some creative writing out of that. However, the bug that he's asked to fix, the Polish typographer friend, and the dinner scene are barely connected to the subject. We don't even get a nice technical explanation of why the bug is impossible to fix or what that means with regards to the trade-offs in using Bloom filters.
Bloom filters have a lot of problems for critical processes in practice. They are best used when being wrong is a non issue or as a type of cache, but preforming a hash is generally really slow. Also, the way virtual memory works looking up an answer often makes storing the new value much faster.
So, while cool they have generally been replaced by far more useful and less complex topics.
One example is if your looking for duplicates among huge data sets. You farm out your Filter to 1,000 machines merge them, then re run to get a list of possible collisions. Then compare that list. But, well merge sort can do the same thing a little slower, it's a rare problem, and the pathological case for Bloom filters is really bad. So, even if you know about Bloom filters you may go with the simpler idea and be done sooner.
PS: In that example hash speed might mean sorting is actually faster.
The place where bloom filters really shine is doing lookups on data where most of the time the data isn't there.
For example, when you load a page on reddit, it has to check for a vote on every single comment on the page to see if you voted on it. Chances are 99% of the time you didn't vote on an item. Using the built in bloom filter in Cassandra, it can very quickly tell if you voted on an item without having to hit the data store most of the time, which saves a ton of time.
(As a side note, there is an extra optimization that makes things even faster -- reddit tracks the last time you voted on anything as an attribute of your user, so if the page is newer than your last vote, it just assumes you couldn't have possible voted on it, skipping the lookups altogether).
A list of votes per user per page would also work. ~90% of the time that list would be empty, if it's not your likely going to eventually need to know which specific items where voted on so might as well just load them all.
Further, voting records are just not that big a data structure. Sure, it's much better than a poor implementation and might be a great optimization to bolt onto a different design, but again it's niche with minimal befits when starting from scratch.
Edit: Now, I really want to know how they actually do this.
> A list of votes per user per page would also work. ~90% of the time that list would be empty,
Not really. People tend to vote on one or two things on a page. So you'll have a long list of people who voted on one or two of the 1000+ items, and now you've had to do an extra lookup.
It's a lot faster to just go straight for the votes and take advantage of the bloom filter.
Real world data wins so if you have real data then great a write up would be cool.
However, it's 100% empty the first time someone opens a link. If someone up votes there is no need to refresh, so they need to up vote and then refresh the page which I assume is less common. With lot's of pages that never get refreshed and a long tail of pages people go back to. So, my assumption is you can easily keep a vote list in memory for popular pages and everything else is basically a non issue.
As to matching up votes with items, I kind of think that's for client side JavaScript in most cases.
>A list of votes per user per page would also work. 90% of the time that list would be empty, if it's not your likely going to eventually need to know which specific items where voted on so might as well just load them all.
How? This means you're now duplicating data, instead of organizing as
page has comments have comments have ... have votes
you now have
page has comments have comments have ... have votes
page has list of special comments where you don't know where they are or what their parental hierarchy is and still need to look up in the list anyway.
Page has a list of people who have voted, and that list points to a list of their votes. Each comment has a a number for vote counts, but not a list of scores.
User opens page, they get a list of comments with scores. And a possibly empty list of votes. 99.9% is probably less than 20 votes from that user so 160 bytes or less. Worst case is what 8 byte comment ID * 1,000 up-votes = 8k.
On user vote, server looks at past votes on page and adds it if missing then updates comment score.
Ok, you need every vote from each user and caching comment scores is really likely a no brainier, but I guess you could recalculate it.
PS: Not all comments are going to be loaded, but again if your sending <200 bytes for the full list 99.9% of the time then doing something else has limited value.
But that requires like entirely rearchitecting the reddit backend to be less efficient.
A page has comments
a comment has comments
a comment has votes
a vote has a user
That is to say that a vote is a many to many join from user-comment.
To do what you want, you'd also have to give the page the concept of "people who have voted on a comment that belongs to be or one of my children"
>User opens page, they get a list of comments with scores. And a possibly empty list of votes. 99.9% is probably less than 20 votes from that user so 160 bytes or less. Worst case is what 8 byte comment ID * 1,000 up-votes = 8k.
The problem isn't sending additional data over the wire, its that, say I've voted on 500 things on a page, then we're at 250,000 operations (for each comment check if its in my list of voted on comments), whereas a bloom implementation is O(n). And also the whole making your database structure repetative and icky.
In your approach comments have votes the server must check the bloom filter for each comment then do a fetch for each match in the case of 500 votes that's 500 fetches which are at least votes * (log comments votes) because you can't trust the filter.
My approach that's client side most of the time. Further it's O (n) to compare matches between two sorted lists so just sort them in O (n log n). Further log 500 is faster than most hashes.
And of course for the most common case of no votes it's a single lookup and done.
PS: As to changing the back end, yes if the initial implementation is bad then you might have a point. But, again fixing the problem also works and produces far less code to maintain.
At Reddit's size, wouldn't that just shift the problem from looking up votes to looking up lists? Millions of users times millions of articles? Seems like a fast way to get an early "no" would still be useful.
Your looking it up once per page vs. once per comment. So, yes it might be faster, but when something takes 1% of total time speeding it up has massive diminishing returns.
Isn't this standard in algorithms and data structures courses?
And the answer was 'no'.
It's okay to be publicly wrong. You wanted to learn something and now you know. But acting like you didn't ask it because you're feeling self-conscious just makes the whole thing awkward for everybody.
I certainly didn't get the impression that the OP's follow-up comment was trying to cover up a public display of ignorance. Rather it seemed to me that the comment about lacking a centralized curriculum was needlessly snarky and did not interpret the OP's comment charitably.
I... mis-parsed the reply as saying that bloom filters haven't been part of a curriculum from which to remove it, and took it as a dig at the lack in CS curricula (which sentiment I would empathize with).
When Google first brought it to light I know most of the people I had access to had never encountered Bloom Filters in their college carreers. I can think of a handful of other things we 'should have learned' in school that just never came up (and I'm not even talking about practical skills, just theory).
Apologies for the mixup. Interesting to see the mix of up- and down-votes I got from this one. I wonder if some people made the same mistake I did...
Fully agreed. Your comment was a fair question for clarification of what I meant by "curriculum". I thought I answered it fairly by clarifying that I meant topics generally covered by most universities in their in algorithms and data structures courses.
I'm not sure of the reason for the snark in the follow up. It was a fairly straightforward question and answer.
I didn't attend MIT -- why do you think it's copying MIT, because it's from MIT press? I believe that's the closest thing there is to a "standard" algorithms textbook but things may have changed since my college days...
Bloom filters originally upset me because I expected the word "Bloom" to refer to an action of the algorithm. Not so someone's name.
I'm fond of Concierge filtering as an easy description of how they work. I view them as a brief conversation with the front desk staff at a hotel to determine if someone is there. "Is anyone here wearing a hat? Is anyone here over 6'? Did anyone come in carrying a briefcase?"
If you like hash functions and Bloom filters, you'll also enjoy Jon Bentley's description of the original UNIX spell(1) utility, which ran in 64KB by representing its dictionary as a sparse bitmap.
I did something like that in my "Obvious password detector"[1] in 1984. This is for use in password-changing programs, to detect if a password is too obvious. It uses a 3D array of bits, 27x27,27, representing trigraphs seen in the UNIX dictionary, plus a few extras such as "aaa", "bbb", etc. Only about 30% of the possible trigraphs are used in English, so most strings that aren't a word will pass. This provides simple protection against most dictionary attacks.
> It is interesting to note that one of the world's best compressor (paq8l) can compress the 250kb word list down to 48.5kb, less than the space taken by the lossy compression methods proposed in the paper! For comparison, regular modern compression methods (such as gzip, lzma etc) only achieve twice that size (85-90kb).
Brilliant! I never anticipated Bloom filters would be so simple. What comes to my mind is that for "forgetful" Bloom filters, one could remove a hash from the table, also erasing all equivalent entries from the memory. Is this a useful trade-off in practice?
The problem here would be that you'd be essentially invalidating arbitrary hashes from the table with each operation where they share the same "row" (or address in a bitmap) with other hashes.
Suppose we have a table such that "foo", "bar", "baz" all have overlapping entry -- if we attempt to remove any of them, we'd remove the rest also.
This being said, you might be interested in `Counting filter' instead, which support delete operations.
(To the other person who replied: In order to delete from a counting bloom filter, or a cuckoo filter, or any other data structure that doesn't store the original item, you have to know the item that you're deleting was present. Otherwise, you have the problem you noted.)
Thanks. That's what I meant with deleting "equivalent entries", I was wondering if there are cases where that would still be a net gain. But it makes sense that there are more specialized versions for such things.
PS I don't understand why my comment was downvoted, is there something wrong with asking such questions here?
I don't understand why my comment was downvoted, is there something wrong with asking such questions here?
Not AFAIK, but sometimes people downvote questions that are based on misunderstandings or false premises. Give it some time and you'll probably be voted back up to where you started.
I don't see how those could work. Since bloom filters can have false positives, the counting filter could end up letting you remove an item that was never added, thereby corrupting the entries for other items in the process.
You're right. However, in cases where we know the maximum number of items we'd be storing, we can adjust the "bucket size" of each row. Quoting the above wiki link:
> Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.
That reminds me, does anyone know if Windows 10 uses bloom filters? Specifically, if I have a folder with a lot of files inside and I paste a file into it, if the file name is different from all the names of the files in the folder, it is nearly instant (true negative, no further testing needed) but if the file name matches one of the names, it takes significantly longer (true or false positive of bloom filter -> check all files manually for a match).
It's just something that's been going through my mind lately when I moved files.
See https://bdupras.github.io/filter-tutorial/ and http://11011110.livejournal.com/327681.html as well as the corresponding HN discussions https://news.ycombinator.com/item?id=12124722 and https://news.ycombinator.com/item?id=11795779