
Pilosa: An open source, distributed index - josephturnip
https://www.pilosa.com/docs/latest/data-model/
======
joshuaellinger
Technically, it is very interesting -- it uses Roaring Bitmaps under the hood
and builds a query engine on it. So an easy way to think about it is that it
maps categorical data into a giant compressible distributed bitmap.

I've been planning to see if I can (mis)use it as an OLAP replacement but I
haven't had time to get to it.

~~~
jaffee
You _definitely_ can... the feature set keeps growing. We have multi-field
filtered GROUP BY now. It's amazing to see how flexible Roaring Bitmaps can
be!

------
jaffee
Source code:

[https://github.com/pilosa/pilosa](https://github.com/pilosa/pilosa)

------
eismcc
>Pilosa is a standalone index for big data. Its goal is to help big data
storage solutions support real time, complex queries without resorting to pre-
computation or approximation. Pilosa achieves this goal by implementing a
distributed bitmap index which provides a compact representation not of the
data itself, but of the relationships present in the data.

[https://www.pilosa.com/pdf/PILOSA%20-%20Technical%20White%20...](https://www.pilosa.com/pdf/PILOSA%20-%20Technical%20White%20Paper.pdf)

------
sktrdie
Not sure who this document is aimed to. It's not technical enough to appeal to
programmers that are working closely with Pilosa. And it's not written in a
way to make it easy to understand for people that don't know anything about
Pilosa (such as myself). I mean a subtitle called "Time Quantum" is enough to
make me confused. Would appreciate a more generic "what is this" intro if
possible.

~~~
dmos62
I found the use cases section the most informative. You can click on a use
case to get a write-up. Here's a few excerpts from transportation:

> Pilosa is a distributed bitmap index that sits on top of a data store. The
> key to understanding and then using Pilosa is converting data such that it
> is represented in ones and zeros. This dramatically reduces the size as well
> as accelerates query times.

> For example, timestamps are important information, but we tend to be
> interested in individual components of a timestamp, especially when
> analyzing data with cyclic trends. Timestamp components are stored as groups
> of bitmaps, known as “frames”. We create one frame for the day of the week,
> as illustrated in the following table. Along with similar frames for year,
> month, and time of day, this accelerates queries that ask questions about
> rides belonging to any logical combination of these time groups.

> [...]

> Because each data point includes pickup/dropoff times and total distance
> travelled, it’s easy to determine the average speed of the trip. As an
> example, we use this as a first order approximation of congestion. We
> created a frame representing average speed, with a spacing of 1 mph.

> In order to answer questions about congestion, we needed to first determine
> what speeds constitute slow traffic. One of the basic queries in Pilosa is
> the TopN function, and we used that to get a list of all the different
> average speeds. By performing a count on each we built a histogram of how
> many rides fall into each speed bucket, and decided from there which buckets
> deviate enough from the norm to constitute congestion.

~~~
dTal
>converting data such that it is represented in ones and zeros

er, what? Isn't it all?

>This dramatically reduces the size

huh? There is no symbolic encoding less efficient for length than binary.

~~~
fnordsensei
Well, you could decide that one axis is monkeyIndex, and the other is
amountOfBananasOwned, and have a quite compact representation of which monkey
owns what number of bananas.

I.e., decide on a symbolic meaning for the axes rather than converting data
wholesale.

~~~
dTal
That doesn't sound compact at all! Every monkey's banana count uses a fixed
number n of bits, where n = max(amountOfBananasOwned). That's horribly
inefficient, when an ordinary binary counter uses n =
log2(amountOfBananasOwned).

Which is not a criticism of Pilosa - I'm sure it's doing something very clever
- I just don't understand what.

~~~
jaffee
Bit-sliced indexing is the clever magic here. This post goes very deep on it
[https://www.pilosa.com/blog/range-encoded-
bitmaps/](https://www.pilosa.com/blog/range-encoded-bitmaps/)

But really, you use one bitmap for each binary bit of an integer, and it turns
out you can generate arbitrary range queries on your dataset by doing various
combinations of boolean operations on those bitmaps.

~~~
kazinator
This page perpetrates an apparent contradiction. Starts off with:

> _Pilosa is built around the concept of representing everything as bitmaps._

What this literally means is that not a single datum is stored that is not in
the bitmap.

But then the diagrammed examples look like this:

    
    
                    manatee loris sea_horse
      Vertebrate    1       1     1
      Invertebrate  0       0     0
      Breathes Air  1       1     0
    
    

Pardon my ignorance, but I see here a bitmap of 9 bits plus six character
strings that are obviously not inside that bitmap.

If those strings are removed, the bitmap means jack squat.

Are they understood to be in another bitmap?

~~~
jaffee
One does have to maintain some understanding of the how integer row and column
ids are linked to what they actually represent.

Sometimes this is a function which might map (for example) row 3 to the letter
'd', 4 to 'e', and so on. Sometimes it has to be a lookup table which can be
kept within Pilosa, or externally. Sometimes the IDs map directly to what they
represent (day-of-month, year, passenger count, etc.)

So strictly speaking, not _everything_ is a bitmap, but the bulk of the heavy
lifting in terms of serving queries is computation on bitmaps.

------
continuations
How do you handle race conditions? E.g. an app updated the persistent store
but crashed before it could update Pilosa?

~~~
jaffee
Pilosa is best used in conjunction with something like Kafka with (e.g.)
separate consumers for Pilosa and a persistent data store.

------
ahazred8ta
"Continuous Analysis on Really Big Data - Pilosa is an open source,
distributed bitmap index" [https://www.pilosa.com/](https://www.pilosa.com/)

------
shuzchen
Any chance there'll be some built-in support to perform collaborative
filtering? Seems like a database of relations like this would be awesome for
user-based collaborative filtering.

~~~
jaffee
Great question! We have actually done some experiments with this in the past
and will likely be rolling out features like this on top of Pilosa as part of
Molecula [https://www.molecula.com/is-your-data-ai-
ready/](https://www.molecula.com/is-your-data-ai-ready/)

