

Dense bitpacking - hywel
http://writing.londonstartuptech.com/2014/04/24/dense-bitpacking/

======
eloff
The article author talks of the number of CPU ops, as if they were all the
same. Since he's using compile time constants, he probably never noticed that
division (20-100 cycles) is so horribly slow. The compiler uses a trick like
in Hacker's Delight to convert division into multiplication (4 cycles). But
shifts and masks are cheap (1 cycle each). This kind of trickery is rarely
worth the small space savings (to be fair, the author seems aware of that, by
calling it immature optimization.)

~~~
asQuirreL
Also worth noting that storing them in a long probably means 2 memory reads
per access on most modern architectures, whereas storing them in bitfields
will allow the compiler to select the containing CPU word, which will be 1
read. I would be interested to see actual performance measurements (maybe a
performance vs memory usage chart).

~~~
TheLoneWolfling
I thought that modern architectures tend to read in an entire cache line at
once?

In which case bit packing can be problematic as it can stretch across a cache
boundary?

~~~
uxcn
No, memory bus transfers happen in word sizes. This is why caches sometimes
use _critical word first_.

However, straddling word boundaries is problematic.

~~~
eloff
If you're reading sequentially in a loop, the prefetcher will detect that and
the cache lines will be loaded ahead of the accesses and the cost for
straddling will be almost nothing. If you're doing random access of misaligned
64bit words, roughly 1/8 of your accesses will need two cache lines, which we
can naively say is twice as slow as the other 7/8ths, so your runtime would be
9/8ths. That's a bit of a hit, but still not that bad.

~~~
uxcn
If the words are just unaligned, you not only run into issues spanning cache
lines, but spanning pages as well. The prefetcher can detect general linear
patterns ( _+ /-c_, etc..) after a couple misses, but it can't prefetch across
pages. Misses are still an order of magnitude slower than hits though, and
this would imply at least two misses per page (~100 cycles), plus at least the
number of pages you span. Assuming they were tightly packed, and depending on
the number of instructions per value, it can nearly double the cycle count.
Thankfully, unalinged access is generally as fast as aligned access with Intel
though.

When you span the word size (e.g. add a single bit), you not only add two
(dependent) instructions to access the value (consume a register), but without
a linear access pattern (i.e. id index access), you can add anything up to
~200 cycles per value access. So, depending on the number of instructions per
value, this could be anywhere up to 200x slower. This also doesn't consider
the other possible issues like polluting the cache.

------
uxcn
This is kind of an abuse of c/c++ bitfields. Arguably you don't gain much by
using them as opposed to a memory chunk and simple accessors. Bitfields are a
way to get an explicit memory layout, but using them subverts word alignments,
word sizes, etc...

Is there any reason you didn't just pre-process the values and use a huffman
coding on them?

------
eska
In the context of the overall algorithm, wouldn't you cluster the movie
ratings first anyway (O(n)), so most of the algorithm would do computations on
those clusters which would have less data than single movies to begin with?
I'd worry about minimizing the size of the clusters instead. You'll also
probably want some kind of hierarchical data structure to use the cache
efficiently. This doesn't help with that. If the movie ratings are sorted, a
lot of the data becomes redundant and can be left out entirely. Those values
should be all 0-based too, since it saves you some bits (3 bits for 1-5 vs 2
bits for 0-4). With that "optimization" alone you save more bits than with the
early bit packing attempt. The solution with primes also doesn't scale well.
The bigger the primes get, the more empty space you create.

~~~
hywel
The actual rating is a tiny part of the data per movie, so there's not much
saving there. And clustering would have to be done instead of indexing by
movie / user, so it would probably make performance worse overall.

Indexing by movie / user is done exactly for the reason of using the cache
efficiently. Unfortunately, you have to iterate through both movies and users,
so you either store the sparse matrix twice (once movie-indexed, once user-
indexed) OR you deal with lots of cache misses half the time.

And, yes, all the values are stored 0-based for exactly that reason :) It's an
even bigger saving for storing timestamps.

Not sure what you mean about the prime solution not scaling well - 3 primes of
~2^20 can be stored in ~2^60 (i.e. within 8 bytes) as opposed to within 3
4-byte integers.

When it really sucks is when you're storing lots of small integers, e.g. 20
things in [0,1,2,3] - that gets very inefficient fast, and it'd be much more
efficient to use normal bitfields.

------
TheLoneWolfling
Are there any languages that _suggest_ optimizations? Where you can explicitly
enable them?

Something like this seems like something a compiler could relatively easily
do. Have a bunch of different ways of storing structs (word-aligned, size-
aligned, bit-aligned, modulo/remainder, modulo-next-prime, modulo-next-easy-
prime, etc {what else am I missing here?}), and be able to specify to the
compiler via an annotation or similar which you want, or what your typical
sizes are and let the compiler choose from there.

~~~
hywel
That'd be hard for a compiler to suggest for this case - it requires knowing
the range of each value that you're storing in the struct (and not just the
number of bits).

But if you're happy to do that, there's no reason it couldn't be offered by a
compiler. However it's only useful for an unusual use case, when you have data
that could just fit in memory, but doesn't fit in memory without the bit-
packing.

~~~
TheLoneWolfling
There are many cases where this sort of thing would be useful, not just this
edge case. For instance, automatically rewriting between a recursive function
and a function with an explicit external stack. Or swapping between an eagerly
evaluated function and a lazily evaluated one. Or swapping between different
types of allocations (stack, heap, region + bump pointer, etc). Or swapping
between different ways of parallelizing a function or loop. Or swapping
between row or column based evaluation. Or specifying two equivalent functions
and have the compiler check as much as possible that they are equivilant,
calling both and asserting on debug builds and calling the faster one on
release builds (or potentially even calling _both_ in separate threads and
waiting until either one returns). Etc.

All of these are things that you can do manually, and that you can leave to
the compiler and hope that it'll pick the correct one, but currently good luck
telling the compiler that no, you really actually want to do <x>. Or to try
<x> and <y> and see which is better. And the amount of time required to do so
manually adds up, even though they are all things that could be done by a
compiler.

I personally wish that you could specify / annotate the range that you are
_actually_ intending to store in an integer (and have it optionally bound-
checked) anyway, but that is another matter.

(Actually, I wish that you could specify types with an arbitrary (probably-
pure) boolean function to indicate if something is or isn't a valid value in
the type. But that is another matter indeed.)

~~~
adrianN
Have you tried Ada?

[http://www.ada-auth.org/standards/12rat/html/Rat12-2-5.html](http://www.ada-
auth.org/standards/12rat/html/Rat12-2-5.html)

~~~
nickpsecurity
Ada solved a lot of problems that show up on SO, HN, etc a while ago. Mature
tools, too. Yet, the mainstream preference is that Ada is hard and ugly to use
but trying to re-invent Ada in existing languages is easier. Doesn't _sound_
easier... ;)

------
hywel
It doesn't actually use bitfields, it describes an alternative to them that
was a better choice in this specific instance.

Huffman would have been much slower to access, which would have been
unacceptable for the use case (iterating 100 million data points multiple
times a second).

~~~
uxcn
I'm not familiar with the dataset you were working with, but I would be
willing to bet certain values occur a lot more frequently than others. In
which case, you could work with the compressed values, and only uncompress
them for the results.

------
stuaxo
Awesome, I wondered whether something like this might be possible when I first
found out about bitfields, but maths isn't my strong point (and certainly
wasn't 20 years ago), so I just forgot about it.

------
periodontal
The fastest way to encode these dense bitpackings is almost certainly with the
Chinese remainder theorem.

~~~
hywel
Kind of. The Chinese Remainder Theorem tells that you that there exists a
number that satisfies, and that the number is unique, but it doesn't tell you
how to calculate it.

The standard way is to do it recursively - you can see my implementation here:
[https://github.com/hcarver/Netflix/blob/2136aa5d28a209f902d4...](https://github.com/hcarver/Netflix/blob/2136aa5d28a209f902d4bbb2da8cfe731547bf5e/NetflixDataNG.cpp#L944)

~~~
periodontal
CRT is often given with a closed form for the satisfying solution, although
requiring computing multiplicative inverses mod each of the n_i (your primes).
Since all of these values are known at design time, they can be precomputed
and the run time for encoding becomes a constant M multiplications, M-1
additions and one mod at the end where M is the number of items you are
encoding.

[https://en.wikipedia.org/wiki/Chinese_remainder_theorem#Gene...](https://en.wikipedia.org/wiki/Chinese_remainder_theorem#General_case)

For example, the 3, 5, 7 packing yields a closed form of 70a_1+21a_2+15a_3
(mod 105). Plugging in 2, 4, 3 like your example yields 59 directly.

------
hywel
Although obviously there's still a preprocessing step to make that possible.

------
nickpsecurity
Clever and interesting scheme for encoding the data.

------
mc_hammer
good read cheers.

