Hacker News new | past | comments | ask | show | jobs | submit login

"Undefined" bytes in this context does not equate to "invalid" bytes; they were more like don't-cares. Say, let's assume that they were invalid per se, then ISO/IEC 8859-1 would not allow a newline and tab that are not defined in ISO/IEC 8859-1 but a part of ISO/IEC 6429 C0 control codes. But a character set without a newline sounds... absurd?

It should be pointed out that the historical model of character sets is much different from today. First, recap:

* A (coded) character set is a partial function from an integer to a defined character meaning.

* A character encoding is a total function from a stream of bytes to either a stream of characters or an error.

ISO/IEC 8859-1 is a coded character set, but not a character encoding. It was possible to treat character sets as character encodings, and in fact this separation became apparent only after the rise of Unicode. But as you see 8859-1 does not have a newline, therefore there should be something else to provide them. Thus there had been "adapter" character encodings that makes use of desired character sets: most prominently ISO/IEC 2022 and ISO/IEC 4873. In most practical implementations of both 6429 is a default building block, so as a character encoding 8859-1 contains 6429, although 8859-1 itself was not really a proper encoding.

One more point: 2022 and 4873 were not only character encodings available at that time. One may simply define character encodings by turning a partial function to a total function or defining a total function from the beginning, and that's what IANA did [1]. IANA's version of 8859-1 ("ISO-8859-1") [2] is a proper character encoding with all control codes defined. And I believe the alias "latin1" actually came from this registration!

[1] https://www.iana.org/assignments/character-sets/character-se...

[2] https://tools.ietf.org/html/rfc1345#page-63




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: