
Show HN: How to store a set of four 5-bit values in one 16-bit value - andromeduck
https://github.com/isometric/BucketCompressionTrick
======
amptorn
Ah, without preserving order. That's a set of four 5-bit values, not a
sequence of four 5-bit values.

~~~
ajkjk
To answer the obvious next question:

Four 5-bit values in order have 20 bits of entropy, so cannot be stored in 16
bits.

Four 5-bit values without order have 20 - log2(4!) =~ 20 - 4.59 = 15.41 bits
of entropy (corresponding to log(2^20/4!) possible configurations), and thus
can fit in 16 bits of data if you're clever about it.

~~~
Buttons840
Is there a specific topic of study that taught you those formulas?

~~~
SilasX
As cameldrv said, information theory. The late David MacKay has a good free
book:

[http://www.inference.org.uk/itprnn/book.html](http://www.inference.org.uk/itprnn/book.html)

The intro chapters cover the topic of measuring informational entropy.

Aside: One interesting application is to the problem of “use a balance scale
only three times to determine which of twelve balls has the wrong weight (too
heavy or light)”. You choose the weighings so as as to maximize the entropy of
the outcome i.e. so there are many possible outcomes of equal probability.

------
yshklarov
Taking this to the extreme, you could "store" 65535 1-bit values in one 16-bit
value. You just have to keep track of how many of them are equal to 1.

~~~
uniformlyrandom
Thank you - this is a much better explanation of how this 'storage' works.

------
utopcell
For completeness: The number of possible sets that include 4 numbers from the
range [0..15] is (16+4-1) choose 4, or 3876.

Let's assume that our set is (a, b, c, d). It is ordered, therefore a <= b <=
c <= d. For example, a valid set is (2, 4, 4, 10).

Consider all possible numbers that may be in our set, laid out in order.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

For each element in the set, we mark its position in this order by placing a
marker to the left of the corresponding number.

For our example, we first mark element 2:

0 1 m 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Then 4:

0 1 m 2 3 m 4 5 6 7 8 9 10 11 12 13 14 15

Then, 4 again:

0 1 m 2 3 m m 4 5 6 7 8 9 10 11 12 13 14 15

And finally, 10:

0 1 m 2 3 m m 4 5 6 7 8 9 m 10 11 12 13 14 15

Now, out of these 16+4 places, only 19 of them could be a valid placement for
a marker, as a marker should always be placed to the left of a number (or said
differently, "15" will always be in the end.)

Therefore, there are 19 choose 4 ways to pick the locations of the markers.

~~~
shaklee3
Thanks. That was really helpful.

------
bringtheaction
That is one big-ass makefile though. I realize it has some extra niceties but
I have to ask, did you ever try

    
    
        make main
    

without any makefile at all?

If you haven't done so, delete the makefile now (you have it in version
control anyway) and give it a try.

~~~
torstenvl
GNU Make has a number of implicit rules. A blank target rule (or no target
rule) in the Makefile (or no Makefile at all!) will cause `make` to try to
create the specified target `foo` from `foo.c` or `foo.cpp` (or others) if
they exist. More complete explanation here:

[https://www.gnu.org/software/make/manual/html_node/Catalogue...](https://www.gnu.org/software/make/manual/html_node/Catalogue-
of-Rules.html#Catalogue-of-Rules)

Note: I was initially annoyed with parent's "do this and see what happens"
post and its lack of substantive communication, and responded poorly. This
comment is substantially edited.

~~~
LukeShu
Your response of linking the GNU Make manual comes of as even more self-
impressed and coy than the parent.

The parent suggests removing the makefile entirely and running `make main`,
and observing the result. That would be a learning experience (and much better
than your "RTFM")! But, if you don't want to humor that, the thing to say is:
The result is that it works, and correctly builds the program, without a
makefile at all.

The explanation is that GNU Make has a catalog of built-in rules (which you
linked to); but it's quite a leap from knowing that fact, and even making use
of them, to realizing that it means in some cases you don't need a makefile at
all.

~~~
bringtheaction
> The parent suggests removing the makefile entirely and running `make main`,
> and observing the result. That would be a learning experience! But, if you
> don't want to humor that, the thing to say is: The result is that it works,
> and correctly builds the program, without a makefile at all.

Exactly my intention. Thank you, I was worried after the first response from
the other person that I'd worded myself too poorly and that more people would
think I was being "self-impressed and coy" when that was not the intention at
all.

------
nightcracker
It is possible to do this in O(1) memory, for arbitrary sized collections,
efficiently.

The first trick is to make a function that can calculate the nth set with k
elements from some universe U (in this case U = {0..2^5-1} without order (no
duplicates) directly. This is done using the
[https://en.wikipedia.org/wiki/Combinatorial_number_system](https://en.wikipedia.org/wiki/Combinatorial_number_system).
This is very efficient.

Then, to encode duplicates you use the stars and bars trick.

~~~
tveita
Arithmetic coding will work too, without any special trick to code duplicates.

~~~
rafael859
Can you elaborate? I am not very familiar with arithmetic coding, but I
thought it required arbitrary precision floating point numbers, so not O(1)
memory.

~~~
tveita
You don't need arbitrary precision, just enough precision. Specifically, in
this case you only need 16 bits. You can start with an upper and lower bound,
and narrow it down for each number.

Here's an example with numbers dumped from a quick prototype encoder. The
exact range numbers aren't that important, but note how the range gets smaller
for every symbol we encode:

Say we want to encode the ordered numbers 10, 15, 31, 31.

We start with the full range [0, 65536)

There are C(32 - 10 + 2, 3) = 2024 combinations that start with 10, out of
52360 possible, so that narrows it down to around 4% of the full range, we'll
use [49702, 52236)

For the next number we only allow numbers 10 and above, 15 has 153 out of 2024
possibilities, range [51022, 51214)

For the next number, 31 is only 1 of the 153 possibilities, so it gets a tiny
range, [51212, 51214).

And now 31 is the next number in 1 out of 1 of cases, so the range is again
[51212, 51214)

We can code the sequence as either 51212 or 51213, since we used a range that
is slightly larger than needed some combinations have multiple codes. We could
have started with the smaller range [0, 52360) instead to get a bijection.

------
mgelbart
I wrote a blog post several years ago that's basically about this same effect,
although from a different angle:
[https://hips.seas.harvard.edu/blog/2013/06/05/compressing-
ge...](https://hips.seas.harvard.edu/blog/2013/06/05/compressing-genomes/)

------
barrkel
The fact that it's using lookup tables to store the mapping between the set of
numbers and the encoded 16-bit number makes it less interesting; it basically
just enumerates the set of all possible combinations of numbers, and uses the
ordinal as the encoding. A directly calculated scheme without a lookup would
be niftier, though I suspect in practice the way it steals bits from
redundancy of duplicates it wouldn't be pretty.

~~~
hinkley
It does seem like it should be calculable without the lookup table. Something
with modular math perhaps.

------
huftis
This reminds me of the following old puzzle.

Ten people are standing in a line. Each person is wearing a black or a white
hat and can only see the colours of the hats of the people _in front of_ them.
The goal is to guess the colour of one’s own hat. First, the last person in
the line shouts out their guess (‘black’ or ‘white’), then the person in front
of them shouts out their guess, etc. No other form of communication is allowed
(no clapping, no touching, no texting – and no tricks involving, e.g.,
encoding information in the delay between it being ‘your turn’ and actually
shouting out your guess). The group ‘wins’ if at most _one_ person guesses
wrong.

The group is allowed to discuss a strategy before taking part in this puzzle
(i.e., before being given their hats). Which strategy should they choose to
maximise the chance of winning?

They’re only allowed _one_ wrong guess, so this is basically trying to encode
9 bits of information in 1 bit. And here the ordering of the bits actually
_matters_. Still, it’s possible to choose a strategy which gives the group
100% chance of winning! Good luck trying to solve this one. :)

~~~
steve_musk
Seems impossible to me.

~~~
TekMol
The way huftis stated it, it is indeed impossible.

In the popular, solvable version of the puzzle, the group also gets the
information if each guess was wrong or right. It's not at all about encoding
"9 bits of information in 1 bit".

~~~
huftis
No you don’t need information on whether the each guess is right. (And, of
course, if the group is playing it correctly, _every_ guess except possible
the first one is correct.)

~~~
TekMol
Damn, you are right! I wonder why the first two instances of this puzzle I
googled included that info.

Still, it's not about encoding "9 bits of information in 1 bit".

~~~
huftis
You’re of course right about this not being encoding ‘9 bits of information in
1 bit’. That would be impossible; to encode 9 (arbitrary) bits of information,
you _need_ 9 bits.

On the other hand, you must encode information on the colour of 9 hats, and
only the last person in the line is _free_ to provide some information. He
can’t possibly know the colour of his own hat, so he can only guess – _or_ use
his turn to provide 1 bit of information to the rest of the group (while the
other people _have_ to shout out their correct hat colour). The difficulty
lies in figuring out why the puzzle as stated does _not_ need you to encode 9
bits using 1 bit …

------
utopcell
The title is deceptive. It should state that the goal is to store a _set_ of
four 5-bit values.

------
raphlinus
This is quite related to the problem of sorting a million 32 bit integers
using only 2M of RAM (and no disk). It can be done.

~~~
Matheus28
A list of deltas should do it. In the worst case, the deltas would use
log2[2^32/1000000] * 1000000 bits, so about 1.5 MB. Plus some space because of
base 128 encoding (it increases size up to 37/32, rounded up per byte).

I got a worst case of exactly 2 MB (1.907 MiB) (all deltas being 4294, so the
list is 0, 4294, 8588...), but maybe it's possible to get better than that.

It would be uber slow though, probably n^2.

~~~
lalaland1125
The idea is good, but base128 won't work. The worst case scenario is around
250k offsets of 2^14 (requiring 3 bytes each) and 750k offsets of 2^7
(requiring 2 bytes each). That's 2.25 MB

~~~
Matheus28
That's true. A 5 bits header before each number that specifies how many bits a
number uses would be enough.

------
uiri
Conversely: the order for 5 4-bit values itself has 4-bits of information.

~~~
olympus
The order for _five_ four bit values has about 6.91 bits of information.
log_2(5!)=6.91

More relevant to the problem, the order for _four_ five bit values has about
4.59 bits of information (log_2(4!)=4.59). Not all of the sets actually have
four numbers- many have duplicate numbers, like the list [1 1 2 3] is a set of
numbers {1 2 3}- and that gets us down to 4.08 bits of information which is
small enough to make this example work.

At least that's how I understand this. I could be wrong.

------
AlphaGeekZulu
Ahh, just yesterday in HN: Operation Gunman

[http://www.cryptomuseum.com/covert/bugs/selectric/index.htm](http://www.cryptomuseum.com/covert/bugs/selectric/index.htm)

Quotes:

"The data from the 6 magnetometers (i.e. 6 bits) was somehow digitally
compressed into 4-bit words and then stored in a magnetic-core buffer that
could hold up to 8 such 4-bit data words." "It is unknown why and how the data
was compressed from 6 to 4 bits, and the NSA report is very vague on this
point. It is possible that the Soviets used 4-bit logic and had to spread the
6-bit data over more than one 4-bit data word, but it is more likely that they
used frequency analysis"

"According to the NSA report, the Russians compressed the 6-bit data into a
4-bit frequency select word. Although the report doesn't explain what they
mean by this, we can make a few educated guesses. The reason for compressing
it into 4-bits, was probably the fact that the Russians only had access to
4-bit digital technology at the time. The problem with 4 bits however, is that
each data word has just 16 possible combinations"

...

Maybe a similar "hack"?

------
YesThatTom2
Which reminds of of this old programming joke:

While developing a CAD system we saved a lot of RAM compared to our
competitor's product. They were storing each (X1,Y1)(X2,Y2) as four integers.
We figured out how to the same thing in just 16-bits.

Yes, you guessed it (wait for it....) we could fit a edge in word-wise!

(Disclaimer: I didn't say it was a GOOD joke!)

------
LukeShu
I had to add `#include <algorithm>` to get it to compile.

------
rasen58
Why is this first line true: "This works because there are 3876 possible
unique values for a set of 4 4 bit values"

Each 4 bit number has 16 possible values. And you can order them in the set in
4! = 24 ways. So I thought you would get 24 * 16 = 384. Can someone explain?

~~~
karlding
They're taking the multiset [0].

So it's \binom{2^4 + 4 - 1}{4} = 3876

In other words, there are 3876 multisets of cardinality 4 with elements taken
from the set containing all 4 bit values.

[0]
[https://en.wikipedia.org/wiki/Multiset](https://en.wikipedia.org/wiki/Multiset)

~~~
rasen58
Can you explain the top number (2^4 + 4 - 1)? I understand the bottom number 4
is probably because you're choosing 4 values out of the set of numbers from
the top, but not sure how you got 2^4 + 4 - 1.

------
RandomCSGeek
I haven't learned about compression of data, but from what I know, one has to
make some assumptions to store n bit data into m bit space, where m < n. Here,
the assumption seems to be that numbers are necessarily 5 bit, and that there
are only 4 numbers.

What I fail to understand is how this helps to compress the data in this case.
The code isn't really that nice, and the comments seem to assume prior
knowledge is this field, which I don't have.

So can someone give a ELI-Noob?

~~~
SilasX
It’s not perfect compression; it’s lossy. Per the top comment, this method
treats it as a _set_ i.e. ignore the order. So you’re throwing out some
information.

[https://news.ycombinator.com/item?id=16249092](https://news.ycombinator.com/item?id=16249092)

------
psyc
This is a great little lesson in information entropy. I work on similar
problems quite often in my work, and understanding this was a pleasant
epiphany.

------
ttoinou
Could one practical use of the general case be to store factorial based
numbers efficiently ?
[https://en.wikipedia.org/wiki/Factorial_number_system](https://en.wikipedia.org/wiki/Factorial_number_system)

------
sabujp
how are you getting 3876 unique values for a set of 4 4 bit values (16, 16,
16, 16) ?

~~~
jandrese
He doesn't care about the order so 1, 2, 3, 4 is stored the same as 4, 3, 2,
1. That's where the savings is coming from.

~~~
lostatseajoshua
I really don’t understand why not caring about the order would save space.
Could you explain in simple terms please?

~~~
jandrese
If you care about the order then

1, 2, 3, 4 has to be stored differently than 4, 3, 2, 1, as well as 1, 3, 2,
4, and so on.

If you don't care about the order all of those can be stored in exactly the
same way. Think of it like lossy compression, if you don't care about some of
the detail you an ignore it (the order of the numbers in this case) and save
some space.

------
XtremAlRaven
Any reason why store last 4 bits as is and not just making a LUT out of 4
sorted 5-bit values?

------
lostatseajoshua
Can someone explain this in very simple terms without math? An ELI5 example.
Thank you.

~~~
hinkley
If you’re playing Yahtzee or cards it doesn’t matter what order you get the
dice/cards. It just matters if you got them at all.

If you have two values, you can arrange them two ways. Three values you can
arrange six ways. Four values you can arrange 20 ways. 0-20 takes a little
over four bits to represent.

If you have a set, you don’t care about order. So if you save the data without
an order to it, it takes a little less space to hold it. About four bits
worth.

With five values you can save almost seven bits (120) and with six it’s over 9
bits (720).

------
anonu
What's the practical use for this?

~~~
andromeduck
Saving space in data structures that use buckets such as hash tables. I
actually discovered this in a paper on hash tables that kind of just mentioned
that it was a thing but I had a harder time finding out exactly how it was
done.

~~~
pm215
Perhaps you could mention the motivation for the trick in the readme? My
initial reaction was "either this is an april-fools style joke, or it's a cute
but pointless trick", and it wasn't til I got to the bottom of the HN comments
that I found an explanation of why it would be worth knowing...

~~~
andromeduck
done :)

------
espeed
De Brujin Sequences

------
grawprog
It's pretty cool and all but what are the uses for this? What set of 4 un
ordered 5-bit numbers would i need to store that I couldn't just store as 3
bytes? I waste only 4 bits while preserving the order if need be of the values
I'm storing. I can think of a very very small few occasions 4 bits would
matter over order but nothing realistic. Again It's a cool trick and I'm not
trying to be a dick about it. I like cool little tricks like this even if
there's no purpose. I'm honestly curious about some realistic use cases for
this or some variation of this with larger numbers.

~~~
joshuak
It's 4 5-bit values, not 5 4-bit values. 5-bits is enough to enumerate each
bit in a 32-bit bitfield. This means for any application that might set 4 or
fewer flags out of a possible 32 need only use a 16-bit bitfield.

As I said in another comment I just tested this with a 32-bit hash array
mapped trie, and it would result in a significant reduction in the storage
overhead of the trie (which is already more efficient than typical hash tables
for map like data structures).

