Hacker News new | comments | show | ask | jobs | submit login

The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all zeros" is a valid character.

If you claim to support Unicode, you have to support NULL characters; otherwise, you support a subset.

I find most OS utilities that "accept" Unicode fail to accept the NULL character.

FWIW, UTF-8 has a few invalid characters (characters that can never appear in a valid UTF-8 string). Any one of them could be used as an "end of string" terminator if so desired, for situations where the string length is not known up front.

We could even standardize which one (hint hint). I suggest -1 (all 1s).

UPDATE: I meant "strange" as in "surprising", especially for those coming from a C background, like me.




No. NUL is backwards-compatible with ASCII, and is used everywhere. Choosing some arbitrary invalid UTF-8 byte for use as a terminator would be a terrible decision. If you want to handle NUL, simply use length-annotated slices instead of C-style NUL-terminated strings. Anything else is completely wrong.


Did you even read my comment? NULL is a valid UTF-8 character.

If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine.

But it's still valid Unicode. If you claim to support Unicode strings, but don't support the NULL character in those strings, you don't support Unicode strings in their entirety.

Going further, there's nothing in the ASCII spec that requires NULL to only appear at the end of a valid string. That's a C language convention, AFAIK (maybe it started earlier...).


I think what you are trying to say is:

"because UTF-8 has invalid character sequences, we could potentially use one of them to represent end-of-string, which would allow us the flexibility of a null-terminated string (not keeping track of the length) without the restriction of no-nulls-allowed."

You're right! Great. But you are not revealing a "strange thing" about Unicode. You are instead making a general comment about null-terminated strings. So why use such inflammatory and misleading language like "If you claim to support Unicode, you have to support NULL characters"?

Update: I don't object to your idea at all, it's a neat trick! It's just that the way it's phrased, it sounds like Unicode's design contributed to this NULL-terminal problem, when in fact even NULL-terminated ASCII strings cannot 'handle' a null character in this sense.

To augment your idea, though, how about you use '0xFF 0x00' as a terminator? This way, backward-compatibility is preserved in all cases except UTF-8 => ASCII with NULLs, and in this case the string will be truncated rather than a buffer overflow (i.e. "fail closed").


@bobbydavid Thanks for re-stating what I'm saying.

Re: "So why use such inflammatory and misleading language like "If you claim to support Unicode, you have to support NULL characters"?"

I'm not trying to be "inflammatory" or "misleading". IMO (note: opinion), if a given API claims to support Unicode strings, but instead disallows certain Unicode characters from appearing in those strings, then, IMO, that API only partially supports Unicode strings.

Others could have different opinions (evidently, you do).

I don't know if handling all Unicode characters should be a requirement for an implementation to call itself "Unicode-compliant", but barring another standard, that seams reasonable to me.

A better option IMO would be to have never included NULL in Unicode in the first place, since it is so widely used in system software as an end-of-string terminator. But that ship has sailed...


UPDATE: Yes, I read yours. I suggested a non-NULL terminator for strings that are not length-terminated. Your alternative only applies to length-terminated strings.

You manage this by having the sending end slice things into lengths for the receiver. Is there some globally-recognized standard for doing so that I'm not aware of?

Because if there isn't, my termination proposal (an invalid Unicode byte) is just as valid as your framing protocol.


I'm sorry, did you read mine? "If you want to handle NUL, simply use length-annotated slices instead of C-style NUL-terminated strings."

Seriously though, using 0xFF as a UTF-8 string terminator would be a terrible mistake.


-1 is not a valid Unicode code point. "All 1s" is not adequately defined without saying how many 1s – and Unicode does not specify a maximum bit width. Even if you said "the maximum Unicode code point", that is not all 1s – it is 0x10FFFF.


That's the entire point of choosing -1 as an "end of sequence" marker for a UTF-8 string when the length is not known up front.

A byte containing all 1s is not valid in any Unicode encoding, so if one appears, you'd know you had hit the end of the string.


This doesn't work well with handling ill-formed sequences.

The length is of course known upfront if not of the whole string but at least of the individual small substring.


OK, I thought you meant a code point containing all 1s. Thanks for clearing that up.


> The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all zeros" is a valid character.

This is either false or misleading, depending on what you're talking about.

In UTF-8, '\0' is all-bits-zero, one byte, and means the same thing as it does in ASCII. It cannot occur in the encoding of any other character.

In UTF-16, the byte 0x00 may validly occur within the encoding of a character that is not '\0'. The same is true for UCS-4.

This is a big reason UTF-8 is as popular as it is.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: