
Why we didn't use a bloom filter - DrJosiah
http://dr-josiah.blogspot.com/2012/03/why-we-didnt-use-bloom-filter.html
======
TeMPOraL
_"Have you ever used the STL? I did for about 6 months professionally, and I'm
glad that I could go back to C and Python. More seriously, object orientation
in C++ doesn't come for free. Aside from the pain of the syntax of using C++
templates (and difficult to parse error messages), all of that object
orientation hides the complexity of method dispatch in what are known as
vtables. If your system has to call them to dispatch, you are wasting time in
that processing when you could be doing something else."_

This fragment made me WTF, so either I missed something in the C++ development
of last few years, or...

a) How can you make standard-OOP-like method dispatch _faster_ than vtables?

b) Since when STL uses so much of virtual functions anyway? Last time I
checked, it avoided any kind of polymorphism at all, for speed reasons. (and
it doesn't really need to use much; C++ templates are nice tools that can make
a pointer to function and a function object work in the same place _just
because_ you can stick "()" to the right of the passed parameter and it will
compile, no run-time work needed).

c) (it seems to be implied, though not explicitly stated). Going faster than
STL... in Python?

~~~
conroy
In response to c, pypy has been getting ridiculously fast as of late. In some
very specific (ie pointless) benchmarks, it can outperform C [1]. And it's
only getting faster [2].

[1] [http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-
c-a...](http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-again-
string.html)

[2] <http://speed.pypy.org/>

~~~
shin_lao
Every language has got at least one benchmark saying "We're faster than C".

~~~
ajays
I was about to post that. Someone once joked, "a language isn't mature until
it has a benchmark showing that it is faster than C".

------
ComputerGuru
Really an unfair article. The premise, title, and bulk have nothing to do with
what actually happened.

With his _custom software_ he's solving a completely different problem with a
different algorithm that was only possible based on observations unique to his
dataset:

 _I originally started with code very similar to the C++ STL set intersection
code and was dissatisfied with the results. By observing that the longer
sequences will skip more items than the shorter sequences, I wrote a 2-level
loop that skips over items in the longer sequence before performing
comparisons with the shorter sequence. That got me roughly a 2x performance
advantage over the STL variant._

Let's call this apples. Then he talks about how using C++ STL wouldn't do the
trick since he doesn't care about the data, only the count (oranges). And how
bloom filters applied naïvely to the raw data would still take longer
(zebras).

He's solving a completely different problem, and lambasting perfectly viable
technologies for taking longer _to solve a completely different problem_.

Now obviously there's a good blog post and moral in there: don't solve the
wrong problem. Generic algorithms can't be both generic and special at once -
they're not always going to give you what you're looking for and only what
you're looking for, and that's a cost to be taken into consideration at any
time that you're trying to choose a solution. But don't criticize them for
being slow, and then in a postscript say "and I don't need the actual results
of this algorithm, anyway."

~~~
Confusion
I don't understand your response. Whom or what was he unfair to? What
technology was he 'lambasting'?

    
    
      But don't criticize them for being slow, and then in a 
      postscript say "and I don't need the actual results of
      this algorithm, anyway."
    

I don't see him criticising any solution for 'being slow' in general. Only for
'being too slow for this specific problem'.

It's very common to use a (already available) algorithm to calculate a result
related to what you actually want and then infer the answer from that result.
E.g. calculate the intersection of two sets and then count the members,
instead of intersecting and counting simultaneously and never having the
intersection available. It's usually a wise decision to use such the first
approach: the performance penalty is acceptable and the advantage of code
reuse (and not having to write and test a new algorithm) is larger. His point
was that in this case, it wasn't.

------
radiospiel
I actually like both this and the "1000 fold performance improvement"
articles. There is a problem, and then there is a solution, which works, and
is faster in a spectacular way.

However, both articles mention that a C++/STL solution would bring in a vtable
runtime penalty. This is just not true - quite the opposite: using C functions
a la qsort(3) require you to explicitely pass in a pointer to the comparison
function. Now: this is the vtable approach, but in plain sight. A C++ version
would quite likely not call any functions via pointers.

I prepared a small test which shows that in simple cases C++ can actually be
faster than C by a factor of up to 3: <http://radiospiel.org/sorting-
in-c-3-times-faster-than-c>

------
comex
Interesting article, but I find it hard to believe that STL code would use
virtual methods. Certainly it makes sense to use a custom algorithm when only
a count is required, but was the claim about vtables actually tested?

Also, if speed is so important, why use qsort? It requires an indirect
function call for every comparison; C++ sort or any inline implementation is
faster.

~~~
caf
The impression I got is that the overall algorithm requires O(n) qsorts and
O(n^2) intersections, so the speed of the sorting is not important.

------
zeroonetwothree
The intersection problem they are solving is pretty trivial. I don't know why
you would even consider using a bloom filter for it, that's not exactly the
best domain for it.

Also, I would have just used the C++ set_intersection function. It seems
unlikely that the 2x speedup matters since they already got it down from 7
seconds to 6ms.

~~~
DrJosiah
With the work we were performing, the 2x improvement may not have been
necessary, but it was useful. We could run everything on 1 box and keep up
with roughly 30-50% system utilization continuously. Had we used the STL
version, we would have needed another machine, and may not have been able to
use mmaps to share data between cores.

------
gorset
Performing the intersection of sorted lists is a well studied problem in
information retrieval, so nothing new or surprising here...

The performance can be improved even further by for example using a compressed
integer list with an embedded skip list or by using compressed bitmaps. Using
sorted int32 lists is the naive solution :-)

~~~
alecco
It's not that simple. Those solutions for easy lookup often require certain
types of distributions. For example to use compression of the style of
PFOR/PFOR-Delta. If compression isn't 1-to-1 in some way, lookup becomes a lot
harder.

------
dhruvbird
Is this guy serious or joking?

"Have you ever used the STL? I did for about 6 months professionally, and I'm
glad that I could go back to C and Python. More seriously, object orientation
in C++ doesn't come for free. Aside from the pain of the syntax of using C++
templates (and difficult to parse error messages), all of that object
orientation hides the complexity of method dispatch in what are known as
vtables."

Seriously?

1\. Python is faster than C++?

2\. All OO features (method invocations) in C++ use the vtable?

IMHO, just using the right iterator abstraction and using the STL algorithms
should be within performance limits.

~~~
DrJosiah
"Using C, I wrote a handful of lines of C code that took a sorted sequence of
4-byte integers that represented a Twitter user's followers, and I intersected
that sequence against thousands of other similar sequences for other Twitter
users."

In this case, a custom C algorithm is very fast. I didn't implement the C++
version, but I theorized that the C++ version would be slower because I'd
already implemented the C++ algorithm first, and it was slower than what I
ended up with.

I've also updated the article to remove the vtable stuff.

------
antirez
Not strictly related, but in Redis intersections between sets encoded as an
array of fixed length 16, 32, or 64 bit numbers (we use this encoding for
small sets) may be performed very fast because they are ordered, so you can
simply take two pointers in the two sets and advance only taking common
elements.

Of course there is a tention between fast intersections (that require a sorted
data structure) and O(1) existence test of elements. It depends on what you
are trying to accomplish.

~~~
DrJosiah
But it's only for limited sized sets, 512 entries with the default
configuration. Our sets were millions of entries.

~~~
antirez
Sure, but we may implement this soon or later so that small sets will be
intersected very fast.

~~~
DrJosiah
With caching, 512 intset entries, and 64 bit intset values, that's under 1
microsecond to intersect using the naive binary search algorithm for
intersection.

I don't believe that small sets are an issue for performance.

For me, the real question is whether there are ways of getting good
performance _and_ lower memory overhead across the entire range of object
sizes in Redis.

------
leon_
Talks about nanoseconds and ultra fast code ... doesn't lose a word about
cache locality?

~~~
brazzy
You may want to reread the article. One of his main points is how his custom
algorithm accesses the dataset sequentially and thereby massively outperforms
bloom filters which need random access.

