because then you don't know when you jump in the middle of a stream and see a 11...

kstenerud · on Oct 10, 2019

Yes, the second one is not self-synchronizing, so it's out. However, I don't see much utility in getting first-character detection from the data format itself. You're very unlikely to miss the beginning of a stream of characters in any modern system, and in the event that your medium has no error detection, it would be trivial to add a zero synchronization byte to the beginning of the field.

My point is that embedding transport level metadata into the data format seems like a poor tradeoff because of the sheer inflation potential of the data encoding (potentially 10%), when a single guard byte per field would solve the problem of first character truncation detection.

labawi · on Oct 11, 2019

> I don't see much utility in getting first-character detection ..

> .. in the event that your medium has no error detection ... add a zero synchronization byte

How would such encoding deal with non-utf8-safe editors, copy-pasting, programs truncating, then inserting previously broken sequences, etc?

Encoding obviously can't fix all errors, but it is quite useful if broken sequences are obviously broken and non-broken sequences remain valid when handling text in non-aware/non-safe applications.

I think in UTF8 two splices can generate a random character, but in a characters + splice combination, the character remains recognizable in any order and combination and a lone splice is also recognizable as an error.