
Fast integer compression in Java - dangoldin
http://lemire.me/blog/archives/2013/07/08/fast-integer-compression-in-java/
======
andrewflnr
In what way is compressing an array of integers "less general" than a general
compression algorithm? Is that not exactly what a normal compression algorithm
does, or at least can be construed as doing, with, say, 8-bit integers?

~~~
damian2000
I was thinking the same myself - couldn't it work just as well on a random
byte stream by just passing in every 4 bytes as a 32-bit integer?

~~~
DannyBee
No, because everything would be an outlier, so it would save nothing. Here is
a basic explanation (there are niggly little details i'm not going to go over,
since the papers do a better job)

The way FOR (i'll get to PFORDELTA in a sec) codecs work is by taking numbers
say 128 at a time, finding the max number of bits necessary (so if all numbers
are between 0 and 32, it would be 5), and then encoding them as 5 bit numbers

FOR has the problem that if you have a list of numbers like 1 2 2 2 3 728 3 2
1, the 728 screws you.

PFORDELTA instead finds the number of bits it takes to represent, say, 90%
percent of the numbers (IE it finds the b such that 90% of numbers are < 2^b),
encodes those numbers in fixed size blocks of b bits, and along the way,
"patches" out the 10% of outliers, and encodes them at the end.

It is normally used on posting lists.

In a sufficiently large random byte stream, for random 4 byte streams, either
b == 32 all the time, or over time, the outliers will take more space than you
save elsewhere. (where the amount of "more space" will be whatever the control
overhead of the outlier encoding is).

So you will either save nothing, or grow in size. This is guaranteed by the
pigeonhole principle.

~~~
ygra
From what you're writing it seems like this _might_ work for bytes when you
treat them as such and not group four of them together as integers. Especially
text probably has pretty much the same patterns (lots of code points in a
range [lower-case letters] with a few outliers [upper-case letters and
spaces]).

But then again, for text there probably are much better-suited alforithms.

~~~
gizmo686
For text, you would probably want an encoding hard coded into the algorithm.
Because (in a given language) the distribution of characters is relativly
consistent, you do not loose much by deciding which characters only need 2
bits and which ones need 10; and you save on having to put this information in
the compressed data.

But you are right that their are far better algorithms for text compression
(although if doing this character re-encoding makes sense, I would imagine
most implementations include it along with the main algorithm, and several
other bonus compression algorithms).

------
beagle3
FTA:

> Though you cannot reach the same kind of speed in Java as you can in C++,
> there are many good reasons to use Java instead of C++. How good is Java at
> this task? Direct comparisons between Java and C++ are difficult. I would
> estimate that the difference is a factor of 3 and more. But Java can still
> be more than fast enough.

That has also been my experience, but there's always an army of people that
claims "Java can be just as fast or faster than C++". Anyone care to prove
Mr.Lemire wrong? (My experience was more like twice as slow for memory
intensive code, but thrice as slow still sounds like a reasonable estimate to
me)

~~~
noelwelsh
The answer is that if you're prepared to off piste, mainly by using
sun.misc.Unsafe, you can get within a gnat's whisker of C++. For example:

[http://mechanical-sympathy.blogspot.co.uk/2012/07/native-
cc-...](http://mechanical-sympathy.blogspot.co.uk/2012/07/native-cc-like-
performance-for-java.html)

and

[http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-
of...](http://mechanical-sympathy.blogspot.co.uk/2012/10/compact-off-heap-
structurestuples-in.html)

It's interesting to look at what Java lacks to achieve this level of
performance with less pain. The main issue, I believe, is control over memory
layout. For example, avoiding boxing in arrays of objects. Avoiding GC is also
an issue.

I think Rust is interesting as it allows the programmer to talk about these
things without going "outside" the language like one must in Java, while still
retaining a modern programming style.

~~~
beagle3
That's a good answer, but it isn't really Java - it's assembly in disguise.
You lose basically everything Java can give you when you do that. And I don't
know since when it actually works well - as of 2011 with Java 1.6, I had
similar code that used memory mapped buffers, and crawled like a snail because
(apparently, I couldn't directly verify) the optimizer would not inline the
memory mapped access, meaning that every array access cost ~10 times as much
as it should.

~~~
noelwelsh
I agree it isn't great, but it works, and if you need that extra performance
in a small part of your application it can allow you to stay on the JVM. I'd
rather do that than write everything in C (YMMV.)

The blog posts I linked compare performance to nio ByteBuffers, and show
improvement for Unsafe over ByteBuffer. That might be relevant to your case.

~~~
beagle3
I rewrote in C, and got 10x speedup and x4 less memory. So, it's no longer
relevant :)

------
nivertech
Is there lossless compression encoding like PForDelta but for floats/doubles?

~~~
gizmo686
Yes. Floats/doubles can be strictly ordered. Now, instead of storing a value
as a float, store the distance (in your strictly ordered list) from the
previous value. This is effectivly mapping your floats/doubles to ints; but it
works because the mapping maintains the ordering that is central to PForDelta.

~~~
nivertech
I doubt it will be lossless, unless using the same ordering and same compute
device. My primary use encoding on CPU, decoding on GPU.

~~~
gizmo686
I have not worked heavily with floats, but I think numerical comparison (IE,
greater than/less than) produce a non-ambigous ordering. Assuming that both
ends have the same set of numbers representable as floats, it should be
lossless. If they do have a different set of floats, then you need to solve
lossless conversions before talking about lossless compressions.

~~~
nivertech
How do you decide how many bits distance should be encoded in? How many bits
for mantissas, sign bits?

I worked with half floats (FP16) and other weird partial word float formats
... I don't think it's trivial.

I guess the most straightforward way is to convert to fixed point real numbers
first, then it's essentially the same as integers.

Or maybe just do it per block of each 128 floats?

