Hacker News new | past | comments | ask | show | jobs | submit login

This book has a nice treatment of that kind of compression:


For instance you might be keep track of facts like

   the word "the" is contained in document 1
   the word "john" is contained in document 1
   the word "the" is contained in document 2
   the word "john" is contained in document 12
and you code the gaps; the word "the" appears in every document and the gap is always 1, but the gap for "john" is 11. With a variable-sized encoding you use fewer bits for smaller gaps -- with that kind of encoding you don't have to make "the" be a stopword because you can afford to encode all the postings.

That book is great, and relies on a combination of Golomb-Rice, delta coding, and arithmetic coding. I wonder how this compares with the OP's compression.

... it doesn't just "rely on" that stuff, but it explains the fundamentals of them really well!

That book is really dated (1994), but unfortunately there aren't much better options.

It is unfortunate because the understanding of the field has improved dramatically over the past decades, and explanations from a more modern perspective would benefit even novices.

The Standard MIDI file spec had something similar back in 80s. They code gaps between successive events with increasingly longer byte sequences. Never mind that they waste a whole bit to do that.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact