Yes, it's a hash table without conflict resolution. No, that's not the joke.
From the article:
The simplest data structure that fits the description of a “opposite of a Bloom filter” is a hash map. With a large corpus like the one we were dealing with, the memory consumed by a hash map is far too large and its growth is bound only by corpus size. However, from this simple idea a more fruitful one comes out.
The design I pitched uses an array, a hash function, and the ability to swap ids in and out of the array concurrently. The hash function and the array are used in the same way they are inside of a hash map implementation. Unlike a typical hash map implementation, the array size is bounded and hash collisions are not avoided.
As I understand it there's more to it than simply being a hashmap without collision resolution: performance.
They mentioned hundreds of thousands of events per second. At that level typical Java objects and locks simply do not cut it anymore. That is why, for example, LMAX wrote their disruptor pattern using gigantic primitives array and fast CAS, no locking, operations.
So here they're using a very fast "hashmap" updated by using CAS operation which do not lock.
Why would it be a joke that said!? There are use cases for Bloom Filters, so there probably are for what they created here too no!?
I'm amused by the above comment, and all the wild assumptions it makes.
Whereas a Bloom filter is relatively well known (in compsci circles), how many people know:
1) Who Wilt and Wither are?
2) What they "described back in 2003"?
Are they scientists? Did they wrote a paper in 2003 describing something similar? Have them or anybody else implemented anything based on their description?
The use of the word "just" especially, is, err, especially misguided.
Can you point to other implementations? If not then it's not "just an implementation", it's the FIRST FRIGGING IMPLEMENTATION!
Out of interest, is there any difference between this, and a cache?
It is still nice to see a bloom filter as the opposite of a cache of course. In effect a cache contains a subset of the things it has seen, and a bloom filter keeps a superset.
I have used something like this to great effect when doing calculations which do many complex trigonometric calculations, which often end up using a small set of input values. Something similar can be used before any expensive pure function call, where you expect to see the same values called repeatedly, but don't want to use unbounded space, or have to write a complicated cache management system.
I responded to a first comment but then read the others...
I think you guys are all missing the point: they're talking about crazy high perfomances.
They've implemented it using non-locking algorithms, like LMAX's disruptor pattern (gigantic primitives arrays, CAS operation).
It's not a "joke" to be able to process 12 millions events per second on a single core in Java (that's where the LMAX disruptor patter is at right now, and it shall go higher once Oracle fixes the JVM so that it uses a faster CAS operation when possible [it's possible on many mainstream CPU and there are RFEs for this]).
The point here is that it is fast. As in "faster than anyone commenting here that is a joke could have ever build".
Why should I take what we all learn in computer architecture 101 seriously? Ya, it's crazy fast, but someone in the 50s or 60s invented it, and we should all know this class of algorithms like the back of our hands.
Bloom filters are interesting because they reverse what we typically do, the reverse again is then not that interesting. So I thought this was a joke.
I hope this won't come off as tactless, but is there something going on in there more significant than CAS and writing to the database if it's a different value? Because I do non-blocking programming as a hobby and I'd love to know.
Yes, it's a hash table without conflict resolution. No, that's not the joke.
From the article:
The simplest data structure that fits the description of a “opposite of a Bloom filter” is a hash map. With a large corpus like the one we were dealing with, the memory consumed by a hash map is far too large and its growth is bound only by corpus size. However, from this simple idea a more fruitful one comes out.
The design I pitched uses an array, a hash function, and the ability to swap ids in and out of the array concurrently. The hash function and the array are used in the same way they are inside of a hash map implementation. Unlike a typical hash map implementation, the array size is bounded and hash collisions are not avoided.