Why is text such a shitshow?

ben_w · on June 7, 2023

Like everything else, because of all the edge cases.

Every symbol in Morse? Easy. Every symbol in ASCII? Easy. Add accents, and now you need to ask: are these (1) unique letters with their own sort order, or (2) modifiers that go "on top of" other letters? Answers may vary by language. Add ligatures, and now you're also forced to care about character length even if you would rather not, e.g. ﷽ (U+FDFD). Emoji? Simple ones are fine… but the general case has ways to combine characters to make different icons, making them a systematic synthlang in the same style as Chinese and Japanese[0].

What even is the right sort order of "4", "四", and "៤"? Or "a" vs. "A"?

What about writing directions? LTR, RTL, the ancient Greek thing whose name I forget with alternating directions on each line, the Egyptian hieroglyphics where it can be either depending on which way the heads face, or vertical?

What about alphabets like Arabic where glyphs are not in general separate, and where they can take different forms depending on if they're the first in a word, the last in a word, in the middle of a word, or isolated?

What about archaic letter forms? e.g. ſ, þ, and ð in old English.

[0] where I will probably embarrass myself by suggesting 犭(dog) + 瓜 (melon) = 狐 (fox), which, though cute, feels like as much a false etymology as suggesting that in English "congress" is the opposite of "progress". Or perhaps Japanese foxes really are melon dogs — I don't know, I can barely count to 4 in that specific writing system.

tetraca · on June 7, 2023

> where I will probably embarrass myself by suggesting 犭(dog) + 瓜 (melon) = 狐 (fox), which, though cute, feels like as much a false etymology

I don't know Japanese but with what I know about classical Chinese character construction, I'd expect that melon acts as a phonetic hint and dog hints at the meaning (e.g. the word this character represents sounds like "Melon" but is related to "Dog").

Edit: I was curious and looked it up, it's exactly this https://en.wiktionary.org/wiki/%E7%8B%90#Chinese

zqfm · on June 7, 2023

> the ancient Greek thing whose name I forget with alternating directions on each line

I believe the word you're looking for is "boustrophedon"

pklausler · on June 7, 2023

The word comes from the concept of plowing a field.

ben_w · on June 7, 2023

Yup, that's the one. Ευχαριστώ! :)

chubot · on June 7, 2023

Mainly because Windows adopted UCS-2 and the hacky extension UTF-16 around when the superior and ASCII-compatible UTF-8 was invented

And Java and JavaScript followed Windows, and Python is constrained by it

The surrogate pairs of UTF-16 even infected JSON and thus implementations in all languages, but funny enough encoded JSON is specified to be UTF-8, which is better but a bit confusing

Newer, sane languages like Go and Rust are more Unix-like and use UTF-8 natively

It’s basically a Windows vs Unix problem

gpderetta · on June 7, 2023

As a bit of history, Windows NT 3.1 was published in summer '93, and it was the first Windows version with Unicode support (UCS-2, not UTF-16 that didn't exist yet). Presumably development started well before that.

UTF-8 was publicly presented at USENIX at the beginning of '93. Not sure when Unicode incorporated it.

It is unlikely that Windows would have been changed at the last minute to use it, especially as the variable encoding of UTF-8 was significantly more complicated than the fixed size UCS-2.

chubot · on June 7, 2023

Thanks, yeah that's basically what I thought, but it's nice to know it was the same year!

If only UTF-8 had been invented a little earlier, we could have avoided so much pain :-(

The idea of global varables like LANG= and LC_TYPE= in C is utterly incoherent.

Python's notion of "default file system encoding" is likewise incoherent.

You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data. Encodings shouldn't be global variables!

Python 3 made things worse in many ways, largely due to adherence to Windows legacy, and then finally introduced UTF-8 mode:

https://vstinner.github.io/painful-history-python-filesystem...

numpad0 · on June 7, 2023

> You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data.

So, you can't, because Unicode processing can be (though I'm not sure how much is) locale dependent, and that that metadata is NOT attached to data. Unicode Consortium had been messing up non-Latin languages multiple times, causing hacks and new standards to build on top of UTF-8. Han Unification immediately comes to mind[1], but there are others as the Korean Mess[2], Cambodian Khmer problem[3], to name a few. I don't quite understand why it's always has to be like that.

1: Sets of characters from zh-Hans(zh-CN), zh-Hant(zh-TW), kr-KR, ja-JP that were deemed "same" were merqed lnto same code points, in an attempt to keep commonly used UTF-8 in nice 2 bytes

2: Korean Hangul characters were literally relocated between Unicode 1.1 to Unicode 2.0, causing affected characters written in 1.1 displayed in just unrelated characters

3: Reportedly the Consortium simply did not have a Cambodian linguist(???) (partly due to unrest and genocide that took place during 60s-80s)

chubot · on June 7, 2023

Well what I'm saying if you have 2 different web pages, with 2 different declared encodings

Then a decent library design would let you process those in different threads in the same program

A global variable like LANG= inhibits that

So if you have metadata, it should be attached to the DATA, and not the CODE

---

Same thing with a file system. You can obviously have 2 different files on the same disk with different encodings. So Python's global FS encoding and global encoding doesn't make any sense.

They are basically "punting" on the problem of where the metadata is, and the programmer often has NO WAY to solve that problem!

---

The issues you mention are interesting but I think independent of what I'm saying

adrianN · on June 7, 2023

Because human language is very diverse and not optimized for computers.

zokier · on June 7, 2023

And the current day thinking is very conservative, especially Unicode wants to be able to capture and preserve every single little detail of writing that ever existed. Historically human languages were far more flexible, and did adapt to changes; not being able to reproduce everything exactly as hand-writing didn't stop printing press, and various languages added and removed letters from their alphabets. Try writing things in a different way these days and you will be told that stuff is misspelled or otherwise wrong just because it doesn't match what some grumpy old men decreed hundred (or two) years ago.

cryptonector · on June 7, 2023

Because it uses human scripts, of which there are many around the world, and which have a ton of complexity because natural language generally is quite complex.

layer8 · on June 7, 2023

History and organic growth.

adamrezich · on June 7, 2023

Genesis 11:1–9