
Pilosa: open source, distributed bitmap index in Go - jaffee
https://github.com/pilosa/pilosa
======
drfuchs
Hint to the uninformed: don't look up "bitmap" and wonder why anyone would
index them. Look up "bitmap index" instead. This will save you many clicks in
the github project as well as in your favorite search engine. It's about
databases, not pixels.

~~~
jaffee
Good call - wikipedia has a nice overview:
[https://en.wikipedia.org/wiki/Bitmap_index](https://en.wikipedia.org/wiki/Bitmap_index)

The key addition with Pilosa is that it's distributed and can scale
horizontally :)

~~~
throwasehasdwi
>The key addition with Pilosa is that it's distributed and can scale
horizontally :)

Doesn't this also make it several orders of magnitude slower? Every time I
find myself using Bitmaps it's when speed is extremely important. What are
some use cases for this not covered by in-process Bitmaps, Bloom filters, and
HyperLogLog?

I'm just not aware of many use cases for bitmap indexes willing to trade that
much speed for database-like access. I think this project would be more usable
factored into a library than something external

~~~
jaffee
The use case for Pilosa is usually as an addition to something like Cassandra
or HDFS where you have terabytes (or more) of data that you want to be able to
query more flexibly - especially if there are a very large number of
attributes that you want to filter and segment on (think tens of millions).

I think you're right though, that there are use cases which would benefit from
a library exposing this functionality - you need to have quite a lot of data
before compressed bitmaps representing the relationships in that data start
overflowing memory on a single machine.

------
cejast
This is exactly something I've been looking for to experiment with user
segmentation, but crucially for fast, ad-hoc segmentation. Seems to bear some
similarity to what Facebook [1] does for its audience insights.

[1] [https://code.facebook.com/posts/382299771946304/audience-
ins...](https://code.facebook.com/posts/382299771946304/audience-insights-
query-engine-in-memory-integer-store-for-social-analytics-/)

~~~
jaffee
User segmentation was actually the original use case that prompted Pilosa's
development! Segmenting hundreds of millions of users with tens of millions of
potential attributes.

~~~
menegattig
Very nice job.

Looks like we were trying to solve the same problem (user segmentation), in
the same industry (DMP), at the same time (2013-2016). LOL.

I'm the founder of a DMP too.

[https://blog.slicingdice.com/why-we-built-
slicingdice-1beffc...](https://blog.slicingdice.com/why-we-built-
slicingdice-1beffc643571)

I will email you guys.

------
mrdmnd
How does this compare to
[https://github.com/RoaringBitmap/CRoaring](https://github.com/RoaringBitmap/CRoaring)?

Is the difference that this one is distributed?

~~~
jaffee
I'll add to travisturner's response - Pilosa has a few bells and whistles
beyond being a straight up bitmap index including:

\- associating each bit with a timestamp (at various granularities) and
queries over time ranges.

\- adding arbitrary key/value metadata to each row or column

\- automatic sorting/caching of bitmaps to support "TopN" queries

~~~
mrdmnd
Neat, thanks for the clarifications. Looks like a cool tool.

------
daemonk
Based on what I can gather from the data model, this would be great for
categorical type data where there are a limited amount of categories for each
event? Would this be fast for querying quantitative data where values might be
unique?

~~~
alanbernstein
Pilosa was built for high cardinality data, so queries on unique numeric
values would still be fast. Translating quantitative data to a categorical
form fits Pilosa’s data model better, so the index would be more powerful. We
have explored a few basic techniques for this, which you can read about here:
[https://www.pilosa.com/use-cases/taming-transportation-
data/](https://www.pilosa.com/use-cases/taming-transportation-data/)

------
solidsnack9000
I appreciate how they developed a query language that is more code-like than
SQL but still basically declarative and parsimonious.

------
bpicolo
As I understand, this sort of index is roughly how Elasticsearch is able to do
filters quite quickly after warmup. They store bitwise mappings of filter
results on a document by document basis.

(That said, this is one ES thing that's impossible to find documentation for).

------
jksmith
Great way to encode discrete answers for qualitative survey data. Reads are
smoking fast.

~~~
orixilus
could you please elaborate or give some example?

~~~
jksmith
Sure. Say you have 40k respondents in a survey who were each asked if they use
toothpaste X (yes/no). Save all answers in 5k bytes, then map to any
particular respondents answer by shifting right by 3 and checking remainder.
Then test the bit you indexed to using the remainder. All fast operations.

------
NKCSS
I love the website[0] for the product; very pleasing.

[0]: [https://www.pilosa.com/](https://www.pilosa.com/)

------
menegattig
Very similar to S1Search:

[https://blog.slicingdice.com/slicingdice-uncovered-
part-2-s1...](https://blog.slicingdice.com/slicingdice-uncovered-
part-2-s1search-33d2240a96c1)

[https://blog.slicingdice.com/slicingdice-uncovered-
part-3-s1...](https://blog.slicingdice.com/slicingdice-uncovered-
part-3-s1search-in-depth-bb72f3955c27)

