Hacker News new | past | comments | ask | show | jobs | submit login

There's... a lot of wrong stuff here. Tackling some of the highlights:

> ASCII code pages map the upper 128 positions (0x7F:0xFF) of the ASCII byte. Each page holds a different character set. This is one way internationalisation can be achieved.

This is at best a poor explanation, and at worst outright wrong. The actual key thing is charset--there's a wide variety of charsets. Because ASCII is an inherently 7-bit charset, a lot of charsets were created by setting the first 128 characters to be ASCII and mapping in different characters for these charsets. IBM (I believe) came up with the term 'code page' to refer to the different character sets they came up with.

> Unicode provides a unique code for every character, regardless of the language.

That's not really true. Unicode keeps track of "code points". Several code points may together make up what we think of a character--consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence. Thus there's an entire concern about Unicode normalization that a lot of people prefer to sweep under the rug.

> When creating a new file using touch, your computer will interpret that file as binary file.

Okay, what's happening here is you've got a command here, the file command, whose entire job is to look at a file and guess what the contents of that file is. For text files, part of that guessing process often involves guessing what the character encoding of the file is. That guessing is not always correct--there's the infamous "the printer can't print on Tuesdays bug" that was caused by the date string in the printer file, on Tuesdays, causing the file command to think it was an entirely different type of file [1]. There's another famous bug where starting a text file with a 4-letter word, two three-letter words, and another 4-letter word would cause Notepad to think the text file was in UTF-16 instead of ASCII [2].

With regards to guessing charsets, this is not always a particularly feasible process. Some charsets are more reliable to guess than others are. UTF-8, for example, tends to stick out--continuation bytes form a pattern that most charsets are unlikely to keep up with for long. Guessing ASCII for text that contains no 8-bit values set is pretty safe, since almost every charset is designed with ASCII-subset-safety in mind, and those that aren't (EBCDIC, UTF-7, UTF-16/UTF-32) are found in relatively constrained environments [3].

[1] https://beza1e1.tuxen.de/lore/print_on_tuesday.html

[2] https://en.wikipedia.org/wiki/Bush_hid_the_facts

[3] ISO-2022-* charsets are mode-switching, relying on the ESC character as part of the sequence to switch to different encodings. So you also have to consider the ESC character as a non-7-bit encoding for reliable ASCII detection.

> Unicode provides a unique code for every character, regardless of the language.

Moreover, this is not strictly true even after the generous reinterpretation (assuming "unique-under-normalization", "code point sequence", "abstract character" and "script") because Unicode still doesn't encode some scripts [1].

[1] https://www.unicode.org/standard/unsupported.html

… and Han unification means that you’ll often get one code point representing several different “characters”, and you must convey the language out-of-band (e.g. via an XML or HTML lang attribute) for the text to be correctly understood, sometimes.


Yeah I've got beef with Unicode. It doesn't support CJK*. Since I work in games and there are a lot of Japanese games that want to be sold in the Chinese market (and vice versa), this is A Problem. I don't know where they got off thinking those character sets were the same, because if I treat them the same, I don't get paid.

*in my particular example, you can say unicode doesn't support Japanese, /or/ doesn't support Chinese. The answer depends on what font you're using. "Han Unification" affects more than just those two languages, but that's what I have experience with.

Thank you for these points. I've made some corrections in the post.

> consider that something like à can consist of either a precomposed "à" code point or an "a" + "` diacritic" sequence

If Unicode provides a precomposed combination doesn't it mean that in fact has a code point for every character? Regardless of offering diacritic combination codes?

From my understanding the precomposed ones don't exist for every character, for latin scripts this might be true but other scripts are more complex.

Simple example is emojies where there isn't a precomposed codepoint for all combinations.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact