Hacker News new | past | comments | ask | show | jobs | submit login

> BWWWWBWWWW can be saved as 1B3W1B4W

I've never understood, in explanations of run length encoding, why they always default to putting counts before literals. Isn't it sensible to know what a thing is before knowing how much of it there is?

Count-first also tends to waste encoding space, since it's not clear what a code ending on a count should mean. Literal-first makes it more intuitive to make the code bijective (all codes corresponding to exactly one decoding), and why wouldn't you want to?




I'm curious what you mean. Why does it matter which order the count and literal are encoded? You always need both anyway to do anything useful.


Some RLE encodings use negative values for a block of non-repeating data. So in its simple form AAABCD would be encoded as 3A -3BCD . Without negative values this would be encoded as 3A 1B 1C 1D. The .TGA RLE format worked roughly like this. All of this means you need the length before you can look at the next bytes.

See e.g. page 24/27 of https://www.dca.fee.unicamp.br/~martino/disciplinas/ea978/tg...


The reason in explanations is clear, it's how people naturally think about it "You have 1 B, then 3 Ws...". People don't say aloud "You have B times 1, then W times 3".


Well, it's how people talk about it, in English. Unless they're talking about dollars or yen, where the unit comes first for some reason!


In english writing, sure.

I don't think I've ever heard someone say "That'll be dollar 50" though? I they did, I'd assume they were talking about something that costs $1.50


Which he doesn't: both times it's four Ws. Seemed odd to me, but then bugs get into everything :)


I do RLE for my interview question, so I've considered it a lot.

Either way works ok. If you're doing RLE at the bit level, you only need counts and not values, but that's obscure.

If you do count first, you can use a zero count to indicate a run of single repeats. There are many other ways to do that, though.

I've never seen anyone come up with an encoding that was ambiguous to decode. Of course, there's usually many ways to encode something.


Whenever there are many ways to encode something, there's wasted room in the encoding space :) Not usually a big deal, but when the goal is compression, I think it's nice if it can be avoided.


Regardless of how you do your RLE, you're almost certainly going to have many ways to encode something: you can almost certainly encode a run of four as two runs of two (you wouldn't want to, of course, but you could). It's not an encoder you tend to use if you want to use the absolute smallest amount of space, it's good if your data has enough repeats to get compression and your encoding/decoding systems are very resource constrained.

It makes a nice interview problem because compression is a real problem, but RLE needs almost zero previous knowledge, has many reasonable answers, doesn't have a gotcha solution, and most people haven't done it before. Lots of typically better compression algorithms out there, but I don't think any are suitable for general interview.


Yeah, in a bijective RLE, one needs to take into account that, say, A2A2 can't mean the same as A4.

However, since you also want to be able to represent runs longer than 255 (if you're doing it on byte level), you can solve both problems. If you encounter repeated runs of the same literal, just combine them into a bigger number. So A2A2 should be read as "A times (0b00000010 00000010)" whereas A4 should be read as "A times (0b00000100)."

There are some subtleties still to make it a bijective code, related to end handling and how to represent variable length numbers uniquely, but it's perfectly possible, and once you get it right it isn't really larger than "regular" RLE.

It's an interesting challenge to write bijective encodings. I feel it's a bit like writing quines, you have to keep in mind a thing you usually don't. For quines, "what this does to the text you need to output", for bijective encodings, "what this does for the inverse function". Not saying you should give it as an interview question, though!


> Literal-first makes it more intuitive to make the code bijective (all codes corresponding to exactly one decoding), and why wouldn't you want to?

I don't understand how literal-first make all codes correspond to exactly one decoding. Do you have an example?


Also, why 1B and not just B? As such the compressed size would be at most the same as original, but never larger than?


Well that one is more understandable. You can't tell literal from counts except by convention. If we assume ASCII encoding, you don't know if it's a 'B' or a count of 66. It makes sense when you're encoding something with lots of runs, that you use every second word for literals and every second word for counts.

I just don't understand why they start with counts, that's all.


Ah, so obvious indeed.


So is '2' a one-byte sequence or the start of a three-byte sequence?


91 ? 9 : 111111111




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: