Hacker News new | past | comments | ask | show | jobs | submit login
Sorted Integer Compression (ayende.com)
63 points by GordonS on June 18, 2021 | hide | past | favorite | 40 comments



For anyone who wants really good compression of data with a particular characteristic, a very useful approach is to first shrink or reformat the data based on that characteristic, then pass the result through a general purpose compressor.

For this case that might mean convert the data into deltas, then write those deltas in a byte-oriented variable length integer fashion. The general purpose compressor can then pick up most of what you have left on the table by not being super careful with the balancing of the chosen integer format, and it might find some other useful patterns in the data without you having to worry about those.

Theoretically a single compressor crafted purely for the exact characteristics of the data at hand would be able to do a better job. But in practice the skill and time required to beat the hybrid approach tends to be a showstopper.


> For this case that might mean convert the data into deltas, then write those deltas in a byte-oriented variable length integer fashion.

Calculating a trend line and storing deltas off of that trend line might be mildly better for some series. Having an AR model and storing errors of the model might work even better, although the obvious (and common) problem is which AR model of the many possible ones you go for.


If you know the specifics of your data, maybe. But try to err on the side of simpler models, every detail you add slows your code down, and the difference after compression is often small.


Or simply a delta off the last delta (or mean of last n deltas, I guess), so you can do the compression in a stream.


If I understand you & GP, these are both just particular instantiations of Linear Predictive Coding [1]. For both of these approaches, you’d probably get similar or better results just throwing your data into an off-the-shelf LPC, and with less effort.

[1] https://en.m.wikipedia.org/wiki/Linear_predictive_coding


That might also be equivalent to one of the ARMA models, although I'd have to do the math to prove that (and to derive which one corresponds to what you're describing).

Edit: It seems that if you detrend the data, the "delta off the last delta", if I understand it correctly, is an MA(1) model. I think. Maybe. My math is wobbly.


Original poster here: I have tried doing that in the past, and the issue is that the overall cost in computation power is higher, and the compression rate is poorer than when you can optimize for that.

Other aspects that touch on that is that we are usually not dealing with large amount of data, so there isn't enough room for a general algorithm to play its strengths.

For various reasons, we'll rarely see the raw data higher than 1 - 2 KB for each segment.


> For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers

Then PFor(Delta) is not the best option, that class of algorithms is intentionally a trade-off between compression rate and decoding speed, as it's meant to store posting lists in inverted indexes.

If you want higher compression, a better option would be Binary Interpolative Coding [1], which has some of the highest compression rate among the schemes that are relatively simple to implement. Then you can go down the rabbit hole and start to use entropy compression, complex modeling, etc... But there are diminishing returns.

> The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.

This is not true, the author is describing binary packing of deltas, and mistaking it for PFor. What PForDelta means is

- P -> Patched: Outliers are stored separately in a list of "exceptions", so that does not affect the bit size of the rest of the block

- For -> Frame of Reference: the min value of the block is subtracted from the values in block, which are then encoded with the minimum number of bits necessary for the adjusted maximum (this is just binary packing)

- Delta -> Deltas are encoded instead of values, only used if your sequence is monotonic. If it is strictly monotonic, you can encode v[i+1] - v[i] - 1 and save a little more.

> This encoder will [...] Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately.

What this is describing is an algorithm for partitioning lists in order to optimize for the overall binary packing compression rate; but it is greedy, and optimal algorithms exist based on dynamic programming [2]

EDIT: Forgot to mention that everything I wrote above is also explained in the paper cited in the post [3], which is indeed a very good survey of this stuff.

[1] https://link.springer.com/article/10.1023/A:1013002601898

[2] https://dl.acm.org/doi/10.1145/1871437.1871592

[3] https://arxiv.org/pdf/1209.2137.pdf


In a sorted list, all the outliers are grouped at the beginning or the end. It's hard to understand how storing the single big delta is using any bits unnecessarily.


The outliers the article is talking about here are in the *deltas*. If you're sorted list is

0, 1, 2, 3, 1000, 1001, 1002, 1003, ...

your deltas (assuming strict monotonicity) are

0, 0, 0, 0, 996, 0, 0, 0 ...

and that 996 breaks a run that you could otherwise encode in 0 bits (plus block metadata).


Only if you naively use the number of bits required for the largest delta for all deltas. If your goal is good compression, you will use a variable length encoding, probably based on the frequency so that it is independent of the magnitude of the deltas. Why should a hundred times one compress any better than a hundred times a million? Arithmetic coding would probably work equally well.


"Naively"? Binary packing encoders are still competitive with the state of the art :) Which notably hasn't included variable-length encodings for quite a while.

Variable-length coding requires that each integer somehow encodes its width. That can be very wasteful. Block-based methods exploit the fact that consecutive deltas usually require a similar number of bits, so it's better to waste a bit on each integer, but amortize the block bitwidth, than having to encode each width independently.


So 1000, 1001, 1000, 1002, 1000, 1003, 1000, 1004 requiring 80 bits is good if you could get it to 16 bits plus a Huffman tree which would admittedly eat up the 64 bits difference in this tiny example? Storing the delta is essentially predicting that x[i + 1] = x[i] and storing the difference to that prediction. One could do better by having a better prediction, for example x[i + 1] = x[i] + median_gap_size which would bring the prediction error down to 0, 1, 0, 2, 0, 3, 0, 4 in my example, which could be encoded in 24 fixed length integers plus the median gap size. This could probably be further improved by adaptively predicting the gap size. Making the prediction errors small and independent of the gap size makes the use of short fixed length integers in the common case and switch to longer integers for outliers by reserving one short code as an escape code a much better choice.

In the end the result is probably heavily dependent on the exact structure of the integer lists you want to compress but I do not see how fixed length deltas can be competitive on arbitrary inputs.


I would suggest you actually read the whole thread before commenting, as it started by explaining how outliers *do not* make the rest of the list encode with 4 bits per integer.

> I do not see how fixed length deltas can be competitive on arbitrary inputs.

I also already explained this, most lists in practical applications exhibit local properties that are exploited by blocking.

Using Huffman/arithmetic coding generally doesn't work well because the cost of storing the probability tables does not break even with the savings. There was a recent line of work [1] that attempted this with ANS, a faster cousin of arithmetic coding, and the results were honestly underwhelming considering the complexity involved.

[1] https://dl.acm.org/doi/abs/10.1145/3159652.3159663

> In the end the result is probably heavily dependent on the exact structure of the integer lists

Yes, I did not claim that this works with any list, for any encoding scheme you can construct adversarial cases that do not compress well, that's just information theory. But that applies also to the schemes you're floating.


You are misunderstanding me, I am not primarily talking about outliers. I am saying if the deltas are big in magnitude but concentrated on only a few different values, then encoding the deltas directly is wasteful. Whether you then use entropy coding, which also deals with outliers for free, or whether you use something more sophisticated than deltas, for example double deltas or subtracting the minimum delta, was not really my point. Maybe it was a bit confusing that I first brought this up in the context of outliers.


This book has a nice treatment of that kind of compression:

https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

For instance you might be keep track of facts like

   the word "the" is contained in document 1
   the word "john" is contained in document 1
   the word "the" is contained in document 2
   ...
   the word "john" is contained in document 12
and you code the gaps; the word "the" appears in every document and the gap is always 1, but the gap for "john" is 11. With a variable-sized encoding you use fewer bits for smaller gaps -- with that kind of encoding you don't have to make "the" be a stopword because you can afford to encode all the postings.


That book is great, and relies on a combination of Golomb-Rice, delta coding, and arithmetic coding. I wonder how this compares with the OP's compression.


... it doesn't just "rely on" that stuff, but it explains the fundamentals of them really well!


That book is really dated (1994), but unfortunately there aren't much better options.

It is unfortunate because the understanding of the field has improved dramatically over the past decades, and explanations from a more modern perspective would benefit even novices.


The Standard MIDI file spec had something similar back in 80s. They code gaps between successive events with increasingly longer byte sequences. Never mind that they waste a whole bit to do that.


It reminds me of the classic interview problem, attributed to Google: sort 1 million 32 bit integers (provided as a stream) using 2 megabytes of memory. Most people's first instinct is that it can't be done, you need almost 4 megabytes to store the input, but of course it can be, you keep the sorted list in compressed form.

I haven't found a really good writeup, most of the hits on Google (SO, earlier HN stories) fail to get the answer.


Guido van Rossum's answer, of course with Python:

http://neopythonic.blogspot.com/2008/10/sorting-million-32-b...

It sounds like "break up into smaller sorted sets, then do a merge sort". Does the Google interview question allow for temp file storage?


There are two versions of the question. One allows external storage (Guido's answer and the top hit on SO), the other does not.


I think there are two difficulties here.

1. How to even represent 1,000,000,000 sorted integers in only 2 megabytes, which means you can use (slightly more than) 2 bytes per integer. The central observation is that the sum of the deltas is no more than 4294967296. 4294967296 / 1000000 = 4295 which is about 13 bits, so it seems like it should be possible, but it's not easy.

For example, if you use a standard variable length encoding where the highest bit indicates continuation, you would have 7 data bits per byte, but you could have 743,665 times 128 and 256,335 times 16384, which would require 743,665*2 + 256,335*3 = 2,256,335 bytes, which is slightly over 2 megabyte.

If you use the first two bits of the initial byte (so that K bytes encode 8*K - 2 bits). You could have 740,749 times 64 259,251 times 16384, for a total of 259251*3 + 740749*2 = 2,259,251 bytes, slightly worse even.

With 1 continuation bit per nibble the math similarly doesn't work out. So this is starting to look, if not impossible, at least very hard.

2. Imagine that you could represent 1,000,000 sorted integers in 2 megabytes somehow. Then the next problem is to actually create such a sorted sequence from the input.

  - Insertion sort works, but O(N^2) time complexity is not great with a million elements.
  - Quick sort doesn't work since you'd start with an unsorted array which you cannot represent in memory.
  - Heap sort doesn't work because it requires random access to the array, which doesn't work with a variable-length encoding.
  - Merge sort works in the beginning, but you need temporary space equal to the size of your input, so towards the end you're in trouble.
I think you could make merge sort work if the requirement is "output the elements in sorted order" rather than actually constructing a sorted representation of the array. In that case, you could create sorted sequences of the first 500,000, 250,000, 125,000 etc. elements, and do a 20-way merge in the end, which is O(N log N) overall.

This is still somewhat tricky because the average delta of e.g. 500,000 elements can be twice as large as for the overall array, so you might need slightly more space to encode it, so we would need a really efficient compression scheme to make this work.

All in all I'm gravitating towards: this problem isn't generally solvable under the given constraints. With 3 megabytes I think the scheme outlined above works.


Encoding-wise, Elias-Fano requires exactly 2 + ceil(log(2^32 / n)) bits, so it would provably fit in the memory budget.

Algorithm-wise, yeah I can only think of quadratic ones. But if in the question the bottleneck is memory, maybe that's fine? An optimization would be to do selection sort, but in each round select the next group of integers that share the top 32 - ceil(log(2^32 / n)) bits, as that is a cluster in the EF representation, and then sort the lower bits, which can be done in-place in O(k log k) where k is the group size. So if your groups are large enough you get some speedup.

EDIT:

Ah I think that the trick is to always use half of the memory left to read and sort.

- Read 256k integers (1MB), sort in-place, construct EF representation (~500kB)

- Read 128k, sort, and produce another EF (~280kB)

- Keep halving

You do a logarithmic number of rounds, so it's O(n log n). At the end you're left with log(n) EF sets, which fit in less than 2MB (in fact there's some space left, and you could save a few rounds) and you can merge them with a log(n)-way heap into a stream.


It's doable. First, you use a Golomb coding for the deltas. Tuning the parameters just right, you can get it to fit inside your budget with a little room to spare.

Then, to get reasonably good time complexity, you use that extra space as a scratch buffer - read in a chunk of input, sort in place (heapsort is good for worst-case time complexity), and merge with the compressed stream. It's not simple to implement, but it does work.


Can you explain how you merge a compressed sorted stream with the sorted chunk, to obtain a bigger compressed stream (without extra memory, or quadratic complexity) ?

I can see a solution with memory usage : size of( compressed sorted stream ) + size of( compressed sorted chunk ) + size of(compressed stream merge output)

This would work but it means you need to fit your compressed sorted stream in less than a single MB instead of 2MB (and I'm ignoring memory fragmentation problems)


There are a bunch of ways to do this. One of the more straightforward is to use a temporary buffer big enough to hold both the new values and the existing stream, and start by shifting the existing stream to be right aligned (using memmove or the equivalent). Then read bits from the old stream while writing bits into the new stream. Using Elias-Fano coding (which I believe is an instance of Golomb coding), the write position will never overtake the read position, so the merge can be done in-place.


Thanks. I think I understood it now : https://news.ycombinator.com/item?id=27553076

In my solution though when the extra temporary buffer is very small,in the worst case, we need to make sizeof(compressedStream)/sizeof(temporaryBuffer) memmove which can potentially make the algorithm quadratic.


One method is to move the compressed output from the top to the bottom of the memory with each cycle:

A = Existing compressed sorted chunk; B = Compressed merge output; C = New sorted chunk

Start with:

AAAAAAAAAAAAAAAAAAAAAAAAAAA-----CCCCCC

Merge into B, start at top of A and working your way down:

AAAAAAAAAAAAAAAAAAAAAA-----BBBBBCCCC--

Finish this iteration like:

--------BBBBBBBBBBBBBBBBBBBBBBBB------

Then refill C and sort it:

--------BBBBBBBBBBBBBBBBBBBBBBBBCCCCCC

Merge B back into A, but forwards this time:

AAAAAAAAAA------BBBBBBBBBBBBBBBB---CCC

AAAAAAAAAAAAAAAAAAAAAAAAAAA-----------


Thanks, I got it. The main trick is the standard "interview" trick of writing from right to left.

I'm not sure your solution work though without memory moving, because in your second step if the values of A are larger than the values of C, the growing B will hit the non decreasing set A.

To convince my self, I wrote for myself the following sequence.

00000224446666666666888888_11133333355555

BytesOfTheCompressedStream_NewSortedChunk________________

BytesOfTheCompressedStream_NewSortedChunk________weNsetyB

OfTheCompressedStream_SortedChunk________________weNsetyB

OfTheCompressedStream_SortedChunk________detroSfOweNsetyB

TheCompressedStream_Chunk________________detroSfOweNsetyB

TheCompressedStream_Chunk_______detroSehTdetroSfOweNsetyB

CompressedStream_Chunk__________detroSehTdetroSfOweNsetyB

Here notice that we hit the existing streams so we need to memmove

CompressedStream_Chunk_esserpmoCdetroSehTdetroSfOweNsetyB

dStream_Chunk__________esserpmoCdetroSehTdetroSfOweNsetyB

dStream_Chunk____knuhCdesserpmoCdetroSehTdetroSfOweNsetyB

Stream_____maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB

___________maertSknuhCdesserpmoCdetroSehTdetroSfOweNsetyB

Then reverse it in place and memmove


The entropy of 1e6 sorted 32 bit integers is log2((2*32+1e6-1) choose 1e6) bits or about 1.61 MiB, so it should be possible. Really difficult, obviously, but possible.


When that problem came out I solved it quickly (it was 1 million 8 digit numbers in sorted in 1MB of RAM, which I think is the original problem that gained notoriety) with basically arithmetic compression of the delta stream, since the number of possible states sits decently inside the required RAM.

I also saw I can sort it it using zero bytes of storage using a lot of code :)

Think nested switches that eventually print a lot of numbers.


Er... is it just me, or is this a terrible interview question? Assuming no external storage, I can imagine spending a few hours working out this problem for fun, but to do it in 45 minutes on a whiteboard seems exceptionally difficult if you're not familiar with it already. (Or is this one of those "let's talk it through, you don't need to arrive at a perfect answer" deals?)

I could be wrong. Maybe I should just sit down and try it, but timed problems stress me out in general, even if nothing's at stake.


There is the Hutter Prize for English text compression (http://prize.hutter1.net/), I wish there was an equivalent for the https://oeis.org database: https://github.com/void4/notes/issues/72


The Burrows-Wheeler transform (https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...) might be handy for this problem case; encode the deltas with something like UTF-8 first.


Elias-Fano would probably be a better option here


EF doesn't compress that well if your data exhibits regularities, it uses exactly a number of bits that only depends on the sequence upper bound and length.

There are ways to make it compress better though, like splitting the sequence into chunks, optimizing for the overall compressed size.



I've been having loads of fun and success with roaring bitmaps recently, to create this type of index for large data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: