because then you don't know when you jump in the middle of a stream and see a 11xxxxx whether it's the beginning of a valid multibyte character and you have to keep it or part of a multibyte you have to discard it
same with the second encoding, if so happens that one 1xxxxxxx takes the value of 11010101 how would you know it's a multibyte start or a continuation?
basically in the current solution if you read the head of a multibyte character you know it's a valid head, if you read the head of a multibyte in the proposed encoding you can't know if it's valid.
Yes, the second one is not self-synchronizing, so it's out. However, I don't see much utility in getting first-character detection from the data format itself. You're very unlikely to miss the beginning of a stream of characters in any modern system, and in the event that your medium has no error detection, it would be trivial to add a zero synchronization byte to the beginning of the field.
My point is that embedding transport level metadata into the data format seems like a poor tradeoff because of the sheer inflation potential of the data encoding (potentially 10%), when a single guard byte per field would solve the problem of first character truncation detection.
> I don't see much utility in getting first-character detection ..
> .. in the event that your medium has no error detection ... add a zero synchronization byte
How would such encoding deal with non-utf8-safe editors, copy-pasting, programs truncating, then inserting previously broken sequences, etc?
Encoding obviously can't fix all errors, but it is quite useful if broken sequences are obviously broken and non-broken sequences remain valid when handling text in non-aware/non-safe applications.
I think in UTF8 two splices can generate a random character, but in a characters + splice combination, the character remains recognizable in any order and combination and a lone splice is also recognizable as an error.
same with the second encoding, if so happens that one 1xxxxxxx takes the value of 11010101 how would you know it's a multibyte start or a continuation?
basically in the current solution if you read the head of a multibyte character you know it's a valid head, if you read the head of a multibyte in the proposed encoding you can't know if it's valid.