It’s so easy to play games with specs like this though. I could choose any of a ...

lelanthran · on Oct 6, 2022

You know, the funny thing is, I broadly agree with you :-)

If you're coding to the spec, code to the spec. If you're using something reserved (for something that it isn't reserved for) then you deserve everything you get.

The trouble is, I'm not going to get any problems: I have a better chance of winning the National Lottery Jackpot and never having to work again than running into a system that uses U+0000 for anything useful (other than termination).

If I read input until U+0000, there is never going to be a problem as that input did not come from a language and is thus not unicode anyway. If I am writing output and emit U+0000, the receiver (if they don't use it to delimit strings) isn't going to be able to display it as a language anyway, because there is no glyph for it in any language.

At this point in time, my feeling is that this particular ship has sailed. Enforcing U+0000 as reserved, but not for any language, is not a hill worth dying on, which is why I am always surprised to see the argument get made.

> Just use FF and FE as your delimiters and sleep easy with full domain support.

The BOM marks? Sure, that could work, but is has additional meaning. If reading UTF8, encountering 0xFFFE or 0xFEFF means that the string has ended, then this new string is a UTF16 string (either LE or BE), and you need to parse accordingly (or emit a message saying UTF16 not supported in this stream).

rcoveson · on Oct 6, 2022

> The trouble is, I'm not going to get any problems: I have a better chance of winning the National Lottery Jackpot and never having to work again than running into a system that uses U+0000 for anything useful (other than termination).

It's not up to the wrapper protocol or UTF-8 processing application to decide whether the characters in the stream are sufficiently "useful". And you will encounter them, if you process enough text from enough sources. Web form, maybe not. But a database, stream processing system, or programming language will definitely run into NUL characters in text. They were probably terminators to somebody at some point, or perhaps they are intended to be seen as terminators by somebody downstream of your user. For example, I could write a Java source file that builds up a C in-memory layout, so my Java is writing NUL characters. I can write those NULs using the escape sequence '\0', or I could just put actual NUL characters in a String literal in the source file. Editors let me read and write it, the java compiler is fine with it, and it works as expected. That Java source file is a UTF-8 text file with NULs in it that I suppose are terminators... but not in the way you're implying. They're meant as terminators from somebody else's perspective.

But yeah, it is easy to look at that and go "no, gross, why aren't they just doing it the normal way, we're not supporting that". On the other hand, it's also easy to just support all of UTF-8 like javac and everything else I used in that example do.

> The BOM marks?

UTF-8 BOM is EF BB BF. FF and FE would appear in a UTF-16 stream, but never in UTF-8. In UTF-8 you can use the bytes C0, C1, and F5 through FF without colliding with bytes in decodable text. There is no "new UTF16 string" in the middle of your UTF-8 stream, at least not one that your UTF-8 processing application should care about. Your text encoding detector, maybe.