
Building data structures that are smaller than an array and faster (in C) - siganakis
http://siganakis.com/challenge-design-a-data-structure-thats-small
======
nadam
What kind of compressed data structure I would use depends heavily on so much
things (the probability distribution of the data, the probability distribution
of different access patterns) that it is very hard to design a data structure
for a generic case. Designing for a specific use case is a lot of fun though!

For example this: "Use less memory than an array in all circumstances" is just
plain impossible if your data distribution is uniform. You just cannot store
1000 32 bit random intergers (preserving their order) in a smaller place than
1000*32 bits; (it comes from from simple combinatorics considerations.)

~~~
alexgartrell
I agree with your point 100%, but if they're sorted you could probably just
store the deltas an then use something like the varints that google protocol
buffers to compress those deltas down from 32 bits (with a thousand of them,
you are guaranteed that some of the deltas will be less than 22 bits because
1000 * 2^22 > 2^32).

With custom compression, where there's a will there's generally a way :)

------
gorset
Wavelet trees are awesome. There are several ways to speed up the rank/select
operations, which the article doesn't mention. For example, you can create a
simple (popcount, offset) index for each bitmap. Then when you need to find
the nth bit (the select operation), you can simply do a binary search to find
the offset with the nearest count. Likewise for the access (or the rank
operation) you can find the popcount for the nearest offset in the index.

I use wavelet tree heavily in my job. They are very useful for compact
representations of order preserving collections of strings which supports fast
lookups.

Update: Sux4j has some interesting implementations of fast select/rank:

<http://sux.dsi.unimi.it/>

[http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Selec...](http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Select.html)

~~~
alexbowe
Glad you mentioned this. The author linked to my blog post, where I mention
fast rank/select structures. There are faster ways to do it without a
supporting structure too of course :) but I do like this post.

I'm interested in what you use the WT for in your job... I'm just starting off
my PhD in succinct data structures, so it's nice to know that people use them
:P

~~~
gorset
Cool, WT and succinct data structures are definitely an interesting field
these days.

I'm primarily using as a compact order preserving structure to support fast
lookups for both (token -> position) and (position -> token). I'm using an
updatable variant to support O(1) appends, and then use it for realtime
indexing/compression. I probably should do a blog post at one point...

Would be interesting to hear about faster ways to do rank/select without using
support structures... I'll keep an eye on your blog :-)

~~~
ot
> I'm primarily using as a compact order preserving structure to support fast
> lookups for both (token -> position) and (position -> token). I'm using an
> updatable variant to support O(1) appends, and then use it for realtime
> indexing/compression. I probably should do a blog post at one point...

That's very interesting, are you using raw or compressed bitmaps (such as RLE
or RRR)? A limitation of (updatable) wavelet trees is that you need to know in
advance the set of symbols that you will see (otherwise you would need to
change the tree structure). Is it a problem for you? I wrote a paper with a
solution to this problem, I can't share it until mid march, let me know if you
are interested.

~~~
gorset
My solution is to use a fixed tree with bloated leaf nodes (use more bits per
element in the leaf node). A high number of nodes with a low number of
elements causes a high overhead, so I simply rebuild the tree at a few given
thresholds to keep the overhad and the cost of the bloated leaf nodes down. We
can discuss further by email, if you want. I'd love to read your paper! :-)

------
gregschlom
So wait... You have 64 bits keys and 32 bits values. Then why not store the
values in the keys? Use the lower 32-bits for the value, and the higher 32
bits to count how many duplicates you have.

    
    
        int32_t big_array[2^32] = {0};
        
        /* returns the key */
        int64_t write(int_32 value)
        {
            big_array[value]++;
            return big_array[value] << 32 & value;
        }
        
        /* returns the value */
        int32_t read(int64_t key)
        {
            return (int32_t) key;
        }
    
        /* all indexes between 1 and the return value are valid higher half values 
        for the key. The lower half is the value itself. */
        int32_t seek(int32_t value)
        {
            return big_array[value];
        }
    

memory: 16GB

access: O(1)

seek: O(1)

Problem solved!

:)

edit: I'm missing the "Use less memory than an array in all circumstances"
requirement, but hey, you talk about "billion of records" in your post.

~~~
gorset
You also get order with a wavelet tree. You can ask "Give me the absolute
position for the 4th value 42 in the array". By using a wavelet tree with an
index, you can answer that question in almost constant time.

------
tabbyjabby
Bit confused here --

The author rewrites the initial values as:

1 0 2 0 2 3

He defines the keys array as:

2 3 4 6

Using the method defined in the article to find the values of each of the
elements of the rewritten array, wouldn't the values be:

3 2 4 2 4 6

which does not equal the original values of:

3 2 4 6 2 6

~~~
siganakis
My bad, I got that wrong.

The translated array should be:

1 0 2 3 0 3.

I am correcting the article now. For a non-rushed, accurate description please
look at:

<http://www.alexbowe.com/wavelet-trees>

Edit: Got it wrong again... I guess this is why we have computers and
debuggers!

~~~
amatus
Looks like you still have some bugs:

    
    
      rewritten data: 1 0 2 3 0 3
      first row     : 0 0 1 1 0 1
      group a       : 1 0     1
      group b       :     2 2   3
    

Group a and group b don't seem to match up with the data. Or I've
misunderstood something.

~~~
foamboarder
I'm in the same boat. siganakis?

------
nkurz
Depending on your data set, there might be a way to use Compressed Bit
Vectors: <http://sdm.lbl.gov/fastbit/>

General intro: <http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf>

"Word Aligned Hybrid" compression: <http://crd-
legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf>

Bit maps as alternative to inverted index: <http://crd-
legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf>

Bit maps for range queries: <http://crd-
legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf>

~~~
siganakis
Bitmaps are very good at "Seek" operations (where I try to find a value), but
poor at "Access" (where I try to find the value at an index). This is because
"Access" requires that I check all the bitmaps to see which one has the set
value.

Also the "Word Aligned Hybrid" method of RLE compression is patented.
<http://www.freepatentsonline.com/6831575.html>

~~~
nkurz
It's true that simple Access is O(N) as is, but you aren't limited to using
the same index for forward and inverse lookups. It's even possible that you
could use compressed bitmaps for both in less space than a fixed array. Single
or multilevel indexing into the bitmap can also speed things up. The "Inverted
Index" paper linked above has an interesting take.

Patents are thorny, and it's generally not recommended that developers read
them: willful infringement equals treble damages, caveat lector. The license
may make better reading <[http://crd-
legacy.lbl.gov/~kewu/fastbit/src/license.txt>](http://crd-
legacy.lbl.gov/~kewu/fastbit/src/license.txt>). Search for "software patents
pose a constant threat to the existence of any free program".

If it remains a concern, and realizing that any other scheme you choose is
also likely to be encumbered, you may be able to substitute a different
compression such as PForDelta. The important part is cache awareness and
branch prediction -- this is likely where your wavelet code is falling short.

Good luck!

------
anemitz
I've seen slab allocation schemes (not sure if that's a standard term for it)
work in scenarios like inverted indicies where you want to keep your datum key
size small (32 vs 64bit).

The intuition is that you don't need granular memory addressing for storing
large chunks. You only need to have addressability at the min(sizeof(doc))
level. So you can usually store the key in a 32bit int or less.

More detail can be found here: <https://gist.github.com/1947190>

------
wollw
A very interesting data structure. Reading this inspired me to give an
implementation a try. It was a lot harder than it seemed at first and it's
certainly a naive attempt but if anyone is interested you can take a look at
it here:

<https://gist.github.com/1947371>

------
lsb
So you have 64-bit integer keys and 64-bit integer values, and you need to
store these key-value pairs such that you get good lookups from key to list of
values and from value to list of keys?

How often are you updating? What is reading from this data structure? What's
the query structure?

If you're adding lots of records often, I'd go for a B+tree. There's also a
trie (and a B-trie), if your keys share lots of common prefixes.

If you are making consecutive queries, then you'd do well to take advantage of
what pages are in L1 and L2 cache, but if you're randomly picking things,
that's less important.

~~~
siganakis
What we have is 32 bit integers (or anything really) that we store using bit
vectors (which are implemented as 64 bit integers).

These structures are read only. I believe that there is some work on making
updatable Wavelet trees but I think that this would be very expensive.

Reads are primarily Seeks (maybe 70%) I think, but Access is also important -
so the order must also be encoded.

Storing keys and values is expensive in terms of memory (storing everything
twice) so what we are trying to do is to manipulate the data such that we can
do these operations quickly while using the minimum number of bits.

Edit: Fixed iPhone related spelling / grammar.

------
helveticaman
_(particularly in the “refine” and “access” methods, where utilizing the
popcnt SSE4.2 instruction would no doubt have a massive impact on
performance._

)

