Hacker News new | past | comments | ask | show | jobs | submit login

As the article states, you need to apply some heuristics to choose between cases (2) and (3). I'm not a big fan of that, because it means the encoding is not deterministic (as in, it leaves room for interpretation/optimization by encoders).

Just for fun I implemented this algorithm: https://go.dev/play/p/72GyYmVnwyc

The algorithm described in the article uses prefixes 0, 10, and 11. My algorithm uses different prefixes: 0, 10, 110, 1110, 11110, [...]. By default, the subsequence that is used is identical to that containing differing bits during the previous iteration. The prefixes describe previously can be used to widen the subsequence:

- 0: Keep the subsequence the same width.

- 10: Make the subsequence (1<<1)-1 = 1 bit wider, both on the left and right side.

- 110: Make the subsequence (1<<2)-1 = 3 bits wider, both on the left and right side.

- 1110: Make the subsequence (1<<3)-1 = 7 bits wider, both on the left and right side.

- ...

For the sequence of numbers presented under "Tweaking the algorithm", the algorithm in the article produces 969 bits of data. My algorithm produces just 823.

Edit: Tiny bug in original code.

Edit 2: maybe it’s smarter to widen the subsequence using triangular numbers. Doing it exponentially is a bit too aggressive, it seems.




What you're working towards here is essentially Golomb coding. The optimal divisor will depend on how the data is distributed.

https://en.wikipedia.org/wiki/Golomb_coding


As I'm not allowed to edit the post above: using a unary number system instead of exponential/triangular makes the implementation even simpler, and cuts down the output of the example output even further to 798 bits.


> The prefixes describe previously can be used to widen the subsequence

So you can only widen the subsequence, never shrink it. Shrinking would address the issue raised by TFA:

> If a series contains an outlier value that requires a very large window to encode, and all subsequent values have nonzero values only in a much smaller subwindow, the inefficient larger window will get locked in because we never hit the condition that would trigger a reset of the window size.


Shrinking happens automatically, because every iteration takes the actually meaningful width of the previous iteration.


> By default, the subsequence that is used is identical to that containing differing bits during the previous iteration

If I'm understanding correctly, this means that if the first choice a sized subsection is oversized, this will carry through to future samples (unless the encoder chooses to re-size it explicitly). I wonder if, with your cheaper "loosen" encoding, it is worth automatically tightening the bit delta. For example, if I go to an initial window of "5 zeros, the 7 bits 1001101, remainder zeros" (which I'll call 5;7), and the next encoding is "same, 0101110", the leading and trailing zeros adjust the window to (in this example) 6;5, "tightening" around the non-zero bits. With your low-cost "loosen" doing this aggressively (instead of, say, if it happens for multiple samples in a row) seems sane. It also means that the fact that your loosen encoding is symmetric (which makes sense in terms of bit efficiency) is somewhat mitigated because the implicit tighten can break the symmetry.


> > By default, the subsequence that is used is identical to that containing differing bits during the previous iteration

> If I'm understanding correctly, this means that if the first choice a sized subsection is oversized, this will carry through to future samples (unless the encoder chooses to re-size it explicitly).

And that's exactly what I do: I resize it. The subsequence that is picked during round n is the one that actually contained the differing bits in round n-1.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: