
UTF-8 bit by bit (2001) - networked
https://wiki.tcl-lang.org/page/UTF%2D8+bit+by+bit
======
puzzledobserver
Here's a potentially random question: Is UTF-8 essentially a fancy way of
representing arbitrary precision BigInts where the character-to-integer
mapping is specified by Unicode, and which coincides with ASCII encodings on a
certain range of numbers?

~~~
masklinn
UTF-8 is a "fancy" way of representing 32 bit integers in a self-synchronising
byte format.

I can see an extension to 7 bytes (by making the leading byte 11111110 which I
think would still be unambiguous) going up to 37 bits, but at 8 bytes you'd
start getting collisions in the byte patterns and would lose the self-
synchronising properties.

~~~
kstenerud
You could in theory extend the pattern up to 42 bits:

    
    
        0xxxxxxx
        110xxxxx 10xxxxxx
        1110xxxx 10xxxxxx 10xxxxxx
        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        11111111 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

~~~
masklinn
True, not sure why I figured FF was not an acceptable LBP.

Possibly because I assumed you'd need a special continuation byte and all of
them would be unavailable, but the existing continuation bytes work fine so
that's just not correct. FF could even lead to special patterns of
continuation bytes allowing for smuggling more data in there I guess.

------
klyrs
This leaves a question: what of 0xFF? It can't be a "trailer", and has no
trailing zero required of a "locomotive". No worries, it's simply forbidden.

~~~
tialaramex
If you are processing UTF-8 and can't handle "forbidden" (for example you are
in a language with no exception handling, or this is code which has a must-
never-fail constraint) you should treat whatever forbidden sequence you've
encountered as U+FFFD ("Replacement Character") and continue. The replacement
character is commonly visually represented as a black diamond with a white
question mark inside it, so humans will recognise that this means something
went wrong, while a machine can't confuse U+FFFD with any letter, digit, white
space, or other commonly used element it may be treating specially so this
prevents many potential attacks on a parser.

The article is unfortunate because it's actually handling a common non-UTF-8
encoding in which U+0000 is encoded differently so as to keep C-style null-
terminated string semantics while allowing the NUL character. You should
probably avoid the necessity for such semantics.

~~~
electrum
This special encoding of the null character is Modified UTF-8 and is used by
various Java and JVM APIs and in Java class files.

[https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8](https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8)

------
stormdennis
I found this analogy of trains and railcars good but I'm lost on this bit

(110)00000 (10)000000 => C0 80

EDIT: OK I asssume he's referring to C and octal here and not a hexadecimal
and decimal numbers?

~~~
duskwuff
Binary 11000000 10000000 = hexadecimal C0 80.

~~~
stormdennis
Thank you, got it now!

I think he might have been clearer if he'd said

"To represent a NUL byte without any physical NUL bytes, we _don 't discard
the indicators and instead_ treat it like a character above ASCII, which must
be a minimum two bytes long 11000000 10000000 => hexadecimal C0 80."

------
dependenttypes
One small fun fact: utf8 uses big-endian.

