
BitFunnel: Revisiting Signatures for Search [pdf] - jsnell
http://danluu.com/bitfunnel-sigir.pdf
======
ot
I'm one of the authors of the paper they compare against (Partitioned Elias-
Fano), and since I saw the StrangeLoop talk I've been curious to see a more
detailed description of the index, so I was very eager to read the paper as
soon as it came out.

I think it's an interesting approach, but I have a few questions about its
practical applicability.

For one, space utilization seems to be a serious issue: bits per posting are
more than double than the Elias-Fano indexes in their experiments, and
furthermore, they claim that they used the Java Partitioned Elias-Fano
implementation of MG4J, but MG4J does not have one! It only has an Elias-Fano
index; since they get the same space from PEF and MG4J, this means that either
they are using a random ordering of the document identifiers (otherwise they'd
get another 2x improvement from PEF with proper docid ordering) or they're
misusing the PEF code.

Also, they claim to use 5 corpora, but they're actually 5 slices of the same
corpus (Gov2), sharded by document size classes, so both the collection sizes
and document sizes vary and it's not easy to correlate them with the observed
performance characteristics. A trend shows up anyway, that as the collection
size grows the performance difference between BitFunnel and EF-based indexes
gets thinner, which suggests that on very large collections they might even
cross. This is not surprising, as BitFunnel query execution is linear in the
collection size, while inverted indexes have empirical performance closer to
the number of returned results. Also, I could not find in the paper the exact
definition of "false positive rate": is the denominator the number of results,
or the number of documents in the collection?

The experiments are on a 4-core desktop-class machine; would the results hold
up on, say, a recent 28-core Xeon processor, or the memory throughput would
become the bottleneck given that scanning the Bloom filters requires to go
through a large fraction of the data?

I got the impression that a good use case for BitFunnel is not to replace
inverted indexes altogether, but just for the "fresh" part of a large index
(say, newly crawled documents). In that case we have fast insertion of new
documents, and a small collection (where the performance gap with inverted
indexes seems to be high). However, there does not seem to be an efficient way
to update existing documents (add/remove terms), because changing the term set
could change the document size shard.

~~~
matt4711
I was at the SIGIR'17 presentation of this paper (won best paper award btw)
and have some comments in general:

\- They mentioned (from what I remember) that they now use BitFunnel as they
core of the complete Bing search engine not just the fresh parts.

\- When I read the paper and looking at the code, it looks like their index
doesn't include frequency information whereas your PEF code does. It is
unclear what was counted in the experiments.

\- If you look at the code, they are actually doing much more complicated
stuff than just regular bloom filters by "bin packing" the hash positions for
each term to reduce false positive rates (see
[https://github.com/BitFunnel/BitFunnel/issues/278](https://github.com/BitFunnel/BitFunnel/issues/278)
). I'm nor sure if it is "fair" to compare a system developed by 10+ engineers
over many years to a "phd student" code base developed over short period of
time. I think the PEF code is excellent but I'm more talking about that
engineering efforts can have a large impact on performance.

\- I'm fairly sure you are right regarding the lack of URL-sorting. However,
this can have another cause. If you consider Figure 4 in the paper which shows
how "higher ranking rows" group documents together to allow faster
intersection. URL sorting causes clusters in document-ids. Say, in the example
in Fig. 4 there might be a cluster for that specific term for documents
0,1,2,3. This would mean the "higher ranking" row approach becomes worse (more
false positives) when clustering occurs in the collection. So while URL-
sorting helps PEF, it will most likely make BitFunnel worse.

~~~
ot
> They mentioned (from what I remember) that they now use BitFunnel as they
> core of the complete Bing search engine not just the fresh parts.

I find it hard to believe this. Their main index is certainly not all-RAM
(there must be some flash and maybe even disk), and the throughput would just
not be enough for something like BitFunnel.

> When I read the paper and looking at the code, it looks like their index
> doesn't include frequency information whereas your PEF code does. It is
> unclear what was counted in the experiments.

In PEF the frequencies are not interleaved with the postings, so if you don't
read them you don't pay any computational overhead (they mention this in the
paper). However, it's not clear whether they included them when measuring the
space.

> I'm nor sure if it is "fair" to compare a system developed by 10+ engineers
> over many years to a "phd student" code base developed over short period of
> time.

I'm not trying to compare the code :) On the contrary, I'm mostly concerned
about the behavior as the collection size grows. Gov2, especially if split
into 5 pieces, is relatively small.

> So while URL-sorting helps PEF, it will most likely make BitFunnel worse.

That's possible, but I don't see why they could not use different docid
orderings for BitFunnel and PEF. If they use the one that is better for
BitFunnel, that's not fair to PEF.

~~~
matt4711
> I find it hard to believe this. Their main index is certainly not all-RAM
> (there must be some flash and maybe even disk), and the throughput would
> just not be enough for something like BitFunnel.

From looking at the github repo it does look like the system runs entirely in
main memory.

~~~
ot
Yes, I meant that I don't think they're holding an entire web index in RAM.

------
antics
I'm a co-author on the paper, and a couple of interesting quick notes:

* BitFunnel processes every query that comes into Bing, but (at least when I left MSFT) indexed "only" the superfresh tier.

* Dan and Mike spent a few months removing cruft and porting it to Linux before we open sourced it: [https://github.com/BitFunnel/BitFunnel](https://github.com/BitFunnel/BitFunnel)

* It did win best paper at SIGIR (probably the most prestigious IR conference) this year.

~~~
dwenzek
I had a glance the engineering diary
[http://bitfunnel.org/](http://bitfunnel.org/)

I find really interesting for a code repository to have such a companion.

~~~
antics
The original idea was to write little entries as we continued the Linux port,
explaining why certain things are they way they are -- a sort of tour of a
very serious production system. I still love the idea, but unfortunately, I
think it is safe to say that this goal failed. :(

------
majke
Transcript from Dan's talk about Bitfunnel:

[http://bitfunnel.org/strangeloop/](http://bitfunnel.org/strangeloop/)

The talk itself:

[https://www.youtube.com/watch?v=80LKF2qph6I](https://www.youtube.com/watch?v=80LKF2qph6I)

