Hacker News new | past | comments | ask | show | jobs | submit login
Character Encoding and UTF-8 (fredrb.com)
17 points by fredrb on July 31, 2022 | hide | past | favorite | 17 comments



There's some confused or misunderstood stuff in here but I guess it's just somebody doing the equivalent of musing out loud about what they've learned? I noticed:

Those aren't "ASCII code pages". "Code pages" are a way to talk about character encodings, mostly these days used by Microsoft in its Windows operating systems, but historically because IBM's manuals would dedicate a whole page to each such encoding. They aren't "ASCII" code pages, although many of them reserve the first 128 codes for the same things ASCII put there.

"The upper 128 bits from the ASCII table" is presumably a mistake and means the upper 128 code values maybe?

It's called "us-ascii" because that's the name IANA assigned to the ASCII encoding. IANA keeps registries of a lot of stuff... here's the one with character sets in it: https://www.iana.org/assignments/character-sets/character-se...


> "The upper 128 bits from the ASCII table" is presumably a mistake and means the upper 128 code values maybe?

Probably, the bullets at the start say "upper 128 positions".


> If someone from Brazil writes a message using the letter é to multiple people, they would read as ة in Arabic, и in Russian and as a corner pipe (╔) if they’re using IBM’s code page 850.

Apologies for the minor nitpick: и is in Cyrillic, not Russian. Cyrillic is the script, Russian is the language. There are other languages that use Cyrillic besides Russian (and the script itself was developed around Greece/Bulgaria before Russia even existed).

For Arabic it's the language and the script so you're OK there!


Thanks for pointing this out! Should have looked it up before. Fixed in the post.


> You need to know the encoding of any text, otherwise it’s impossible to decipher the message (although it’s common for applications to assume the encoding).

Even with the caveat in parentheses, this is quite misleading. For example, the following line is some text, with no specified encoding:

> hello world

now, while its true this could be some exotic encoding, or maybe just random binary data, I wouldn't call it impossible to decipher. More accurate, would be "impossible to decipher with 100% certainty". Same issue exists with Protocol Buffers, or any format that is not self-describing. The data is not a black box, its just annoying to deal with.


Right. Impossible might have been an exaggeration, I will fix that. The point is that if you're reading a file with the text "hello world", you can only make out the characters because you know the encoding. Given two completely different encodings that map the same hex values in the message it would be impossible to determine which is the correct string. There is no such thing as plain text.


> Given two completely different encodings that map the same hex values in the message it would be impossible to determine which is the correct string.

Sorry, but I don't agree with this either. You can, as a human being (or smart enough AI), look at the result in both encodings, and make an educated guess as to which is correct. If they are wholly different as you say, then one should be gibberish, and one should map to some dictionary.


What you are feeling is called cognitive dissonance. You have the idea of text so hammered into your mind, that when you realize it's merely a convention that makes it readable in practice without needing to know the encoding, you cannot even concede, despite this being an obvious truth. This phenomenon is called "un*x braindamage".

More or less all possible interpretations of what this person said are correct. But UN*X braindamage also comes with dunning-kruger due to the fact that you've memorized so many factoids after many years and think someone doesn't know what they're talking about when they get them wrong despite their overall idea being correct.


The only reason it's easy to decode (as in, by a casual, not requiring information theoretic techniques or something like file(1)), is because almost all popular character encodings have went far out of their way to map the first 128 bytes to ASCII. This idea that text is the common medium / lowest common denominator is a misconception and why UN*X is buggy and half working. Just because it appears easy to read in common tools doesn't mean you have a correct semantic understanding of it. Text is also inefficient and leads to escaping problems whereby it becomes unreadable again.


I have no idea what you mean by "this idea that test is the common medium/ lowest common denominator", nor what these "escaping problems" are (I have to escape ASCII codes 0x00 through 0x1f, I guess, but it's unclear to me why that makes the result unreadable, especially since I hardly ever have to escape anything but \n and maybe \t. And the claim that "UN*X is buggy and half working" is just bizarre.


The UN*X mantra is that text is the common medium and data should be transferred as plain text, as opposed to any other way of encoding data structures like binary.

Escaping problems as in, you embed data structures into text via JSON or XML, and have to write \uXXXX and \" etc, making it unreadable once again.

No, the claim that UN*X is stable is bizzare.


There's... a lot of wrong stuff here. Tackling some of the highlights:

> ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.

This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there's a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term 'code page' to refer to the different character sets they came up with.

> Unicode provides a unique code for every character, regardless of the language.

That's not really true. Unicode keeps track of "code points". Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence. Thus there's an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.

> When creating a new file using touch, your computer will interpret that file as binary file.

Okay, what's happening here is you've got a command here, the file command, whose entire job is to look at a file and guess what the contents of that file is. For text files, part of that guessing process often involves guessing what the character encoding of the file is. That guessing is not always correct--there's the infamous "the printer can't print on Tuesdays bug" that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There's another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].

With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren't (EBCDIC, UTF-7, UTF-16/UTF-32) are found in relatively constrained environments [3].

[1] https://beza1e1.tuxen.de/lore/print_on_tuesday.html

[2] https://en.wikipedia.org/wiki/Bush_hid_the_facts

[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.


> Unicode provides a unique code for every character, regardless of the language.

Moreover, this is not strictly true even after the generous reinterpretation (assuming "unique-under-normalization", "code point sequence", "abstract character" and "script") because Unicode still doesn't encode some scripts [1].

[1] https://www.unicode.org/standard/unsupported.html


… and Han unification means that you’ll often get one code point representing several different “characters”, and you must convey the language out-of-band (e.g. via an XML or HTML lang attribute) for the text to be correctly understood, sometimes.

https://en.wikipedia.org/wiki/Han_unification


Yeah I've got beef with Unicode. It doesn't support CJK*. Since I work in games and there are a lot of Japanese games that want to be sold in the Chinese market (and vice versa), this is A Problem. I don't know where they got off thinking those character sets were the same, because if I treat them the same, I don't get paid.

*in my particular example, you can say unicode doesn't support Japanese, /or/ doesn't support Chinese. The answer depends on what font you're using. "Han Unification" affects more than just those two languages, but that's what I have experience with.


Thank you for these points. I've made some corrections in the post.

> consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence

If Unicode provides a precomposed combination doesn't it mean that in fact has a code point for every character? Regardless of offering diacritic combination codes?


From my understanding the precomposed ones don't exist for every character, for latin scripts this might be true but other scripts are more complex.

Simple example is emojies where there isn't a precomposed codepoint for all combinations.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: