Hacker News new | past | comments | ask | show | jobs | submit login
Cuneicode, and the Future of Text in C (thephd.dev)
125 points by g0xA52A2A on June 7, 2023 | hide | past | favorite | 107 comments



The author is mistaken about fwrite() being exempt from locale conversions on Windows. Here's a snippet to test this:

    #include <stdio.h>
    #include <locale.h>
    #include <stdlib.h>
    #include <windows.h>
    #include <io.h>

    int main() {
        if (!setlocale(LC_ALL, "Japanese_Japan.932")) {
            puts("setlocale failed");
            exit(1);
        }

        // U+00A7 SECTION SIGN encoded in CP932
        char cp932_string[] = "\x81\x98\n";

        // this converts the encoding:
        printf("%s", cp932_string);
        // this too:
        fwrite(cp932_string, 1, sizeof(cp932_string) - 1, stdout);
        // this too:
        _write(1, cp932_string, sizeof(cp932_string) - 1);
        // this outputs raw bytes:
        DWORD bytes_written;
        WriteFile(
            GetStdHandle(STD_OUTPUT_HANDLE),
            cp932_string, sizeof(cp932_string) - 1, &bytes_written, NULL
        );

        return 0;
    }
And this is the output (compiled with Visual C++ 2022 on Windows 10):

    >chcp
    Active code page: 852

    >.\a.exe
    §
    §
    §
    üś


https://learn.microsoft.com/en-us/cpp/c-runtime-library/refe... says:

> When stream is opened in Unicode translation mode—for example, if stream is opened by calling fopen and using a mode parameter that includes ccs=UNICODE, ccs=UTF-16LE, or ccs=UTF-8, or if the mode is changed to a Unicode translation mode by using _setmode and a mode parameter that includes _O_WTEXT, _O_U16TEXT, or _O_U8TEXT—buffer is interpreted as a pointer to an array of wchar_t that contains UTF-16 data. An attempt to write an odd number of bytes in this mode causes a parameter validation error.

I presume stdout is open in that mode, but can't verify it.

Try adding:

    _setmode(_fileno(stdout), _O_BINARY);
https://learn.microsoft.com/en-us/cpp/c-runtime-library/refe...

But you are right, that's not mentioned in the linked-to examples.

Does setlocale() have a side-effect of modifying the mode?


I think it isn't documented, but the console is a special case. The default mode of stdout is _O_TEXT.

_write(), which is used internally by stdio, performs encoding translation if all of the following are true:

- The current locale isn't "C".

- The _O_TEXT flag is set.

- The fd points to the console.

When you redirect the stdout (e.g. with > in a shell), it's no longer considered a console handle, so the translation won't be applied.

Personally, I think it was a weird design decision to include encoding and CRLF translation in what Microsoft calls "low-level I/O".


Windows isn't unusual for the CRLF translation part. Unix based operating systems, such as Linux, MacOS and the BSDs, do something similar. When writing to a terminal device, which can be a console device, the terminal driver performs CRLF conversions and other special character handling depending on the terminal settings. If an application wants to disable these character operations on a terminal device it has to explicitly turn them off, which is a bit like using _O_BINARY on Windows. CRLF conversions aren't done in the application's stdio code, but the effect is similar to a Windows console fd.


The first thing that comes to mind with C and locales is this legendary MPV commit: https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...


that was a fun read—you can feel the catharsis felt in writing it all out.


Come back wm4, mpv needs you


The author is conflating terms. The meaning of execution character set is [1]:

> The execution character set is the encoding used for the text of your program that is input to the compilation phase after all preprocessing steps. This character set is used for the internal representation of any string or character literals in the compiled code.

But the author seems to be using "execution encoding" to mean "the current locale of the running program". These are completely different things.

See the following links for the test program that's outputting "Execution Encoding":

https://github.com/soasis/cuneicode/blob/018d284088fce910ac8...

https://github.com/soasis/idk/blob/ada0f119da3ccce0f1f58abc9...

1: https://learn.microsoft.com/en-us/cpp/build/reference/execut...


The C standard defines the execution character set.

First it defines the base execution character set. That includes: upper and lower case A to Z, digits 0 to 9, these symbols !"#%&'*+,-./:;<=>?[\]^_{|}~ , space, horizontal tab, vertical tab, form feed, alert, backspace, CR, LF. These will all have single byte encoding.

The full execution character set is further defined as the basic execution characters (single byte), plus additional locale-specific characters (either single or multi-byte):

> The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

> - The basic character set shall be present and each character shall be encoded as a single byte.

> - The presence, meaning, and representation of any additional members is locale-specific. [ED: emphasis added]

> - [allows for shift-dependent encodings]

> - A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

---

Which is to say, that the basic characters must have identical encodings in all supported char* strings, and everything else is dependent on the locale, which means setting the locale effectively modifies the execution character set. If you want to correctly interpret any string literal that exceeds the basic character set, you need to know what locale the compiler used when converting to the execution character set, and set the locale to match.

Any encoding that does not have the base characters having the same byte values as are not technically valid as a string, but could be represented as an arbitrary byte array, even if you choose to spell that byte array as `char*`. This also implies that no locale may legally use such as encoding.


> Standard C’s primary deficiency is its constant clinging to and dependency upon the “multibyte” encoding and the “wide character” encoding.

What? No, UTF-8 won for a reason, and that reason is not just that C has a deficiency in this area but, rather, that UTF-8 is: a) simple and much saner than UTF-16, b) self-synchronizing in both directions, c) as -or even more- space efficient than UTF-16 on average even for non-Latin text, d) even UTF-32 doesn't make it possible to turn logical character string indices into 32-bit word string indices. (d) is the real killer.

One just cannot assume that a character or glyph requires just one codepoint to express, therefore one can't assume that a character or glyph will require some fixed number of code units to express, therefore one might as well use UTF-8 because it's saner and more efficient than the other UTFs, therefore... <drumroll/> a "constant clinging to and dependency upon the “multibyte” encoding" is NOT a deficiency of C but an advantage of C.

C does have problems here though, namely all the usual problems it has:

  - C strings (NUL-terminated) suck
  - C doesn't have a first class string type,
      only pointer to char
  - C doesn't have a first class string type
      that indicates what encoding the string
      uses
Now, C's wchar_t is not even a deficiency but a disaster.


I don't think that's what the complaint is. The complaint is that "multibyte" is not necessarily UTF-8. You can't just blindly convert to multibyte assuming that it's UTF-8, because it might not be. You can't convert between two encodings by just going through "multibyte", because it might actually not support all characters you might need to support.

So it really is a deficiency in C. It's nearly useless to have a "multibyte" or "wide character" encoding when those can mean anything. Having conversion between UTF-8 and UTF-32 is useful. Having conversion between "implemetation and platform dependent 'multibyte'" and "implementation and platform dependent 'wide character'" strings is nearly useless.


C multibyte, I believe, was designed around ISO2022-style stateful code switching. It predates Unicode.


> You can't just blindly convert to multibyte assuming that it's UTF-8, because it might not be.

I mentioned that. TFA didn't say that specifically.


It sort of did, but in a completely different place past the critique section:

> But, rather than using them and needing to praying to the heaven’s the internal Multibyte C Encoding is UTF-8 (like with the aforementioned wcrtomb -> mbrtoc8/16/32 style of conversions), we’ll just provide a direction conversion routine and cut out the wchar_t encoding/multibyte encoding middle man.

Not sure why it wasn't mentioned up top. When trying to convert between UTF-8 and UTF-16 without doing it myself or pulling in external dependencies, this was the most annoying thing that slapped me in the face. This is the problem that makes reliable charset conversions between specific encodings actually impossible using just the stdlib functions.


Standards-wise the only answer to this is to deprecate all non-UTF-8 locales and leave non-UTF-8 codesets outside the scope of C.

Basically, non-Unicode needs to always be at the edge, while in the middle everything needs to be Unicode.

From an application perspective it's easy: document that it only works in UTF-8 locales. Really, that is my position for my software. Anything else is ETOOHARD.


I just want reliable conversions. In my situation (duct taping a very old service to a newer one), I needed to read structured files with UTF-16 fields, and process them into an eventual UTF-8 file written to a different location. The host this needed to run on did not have any unicode locales installed (and incidentally, I hate changing locales for my software because it's a program-global switch to flip, and most of my program still wants to run in the user's locale).

I found it ridiculous that there was no way to just convert UTF-16 to UTF-8 without either reinventing that wheel, pulling in an external dependency, or changing global state and having the right system locales installed (as well as knowing the name of at least one of those locales, and guessing a language along with it), despite having the latest C and C++ compilers at my disposal.


If I understand you correctly, you think that statement is knocking utf-8. And if I understand the article correctly, it's not. Something here is incongruous, and it may very well be me. I believe the author is knocking how the C language poorly supports utf-8 due to legacy issues with supporting wide characters, which aligns with your criticism.


Not just C, the language. On Linux, if you create a `FILE *` with `fopencookie`, your program suddenly terminates if you apply `fgetwc` to it. The Linux C standard library is unable to read Unicode in 2023.


Finally a sane string encoding conversion API, but insane function names. If they get overlong, accept it and write them out.

And better, write a proper string API. It's still missing.


I agree about the function names. At least have official alternate long name for the function like wideToUtf8().


> The reason we always write out Unicode data using fwrite rather than fprintf/printf or similar is because on Microsoft Windows, the default assumption of input strings is that they have a locale encoding. ... even if I put UTF-8 data into fprintf("%s", (const char*)u8" meow ");, it can assume that the data I put in is not UTF-8 but, in fact, ISO 8859-1 or Mac Cyrillic or GBK

I had no clue!

I'm just a happy little Python developer for Unix, and I think I'll stay that way.


> I'm just a happy little Python developer for Unix, and I think I'll stay that way.

I assume you haven’t supported a twenty-year-old website entirely encoded in KOI8-R—the ability to occasionally just set LC_ALL=ru_RU.KOI8-R can be invaluable, and the same caveats would apply there.

Granted, I don’t think modern Python versions will even run properly by default under such conditions. (ETA: Ah, no, I was thinking about PEP 528 and 529, and those only apply to Windows. So dunno.)


Nope! Happy little me pretty much only needs to deal with monolingual English, and prices in £ and €.

Verifying your "Ah, no" on a Mac:

  % python3.11
  >>> print("\N{CYRILLIC CAPITAL LETTER SHORT I}")
  Й
  >>> ^D
  % env LC_ALL=ru_RU.KOI8-R python3.11
  >>> print("\N{CYRILLIC CAPITAL LETTER SHORT I}")
  �
  >>> ^D
Going to make some happy little clouds now. :)


To be entirely fair, that’s a different problem: just because you launched something with a KOI8-R locale inside your terminal emulator doesn’t mean the emulator itself stopped expecting UTF-8. (I mean, it didn’t change its UI language to Russian either, did it?) If you use

  luit -encoding koi8-r env LC_ALL=ru_RU.KOI8-R python3.11
instead, for example, things do work out on my Linux machine, so apparently I was wrong as far as Unix versions of Python go.


If English was good enough for Jesus Christ, it's good enough for me. (/s of course)


But Jesus required more than ASCII! According to the divinely inspired translators of the 1611 King James Bible ("the Only True Bible"), John 3:16, from https://archive.org/details/1611-the-authorized-king-james-b... , is:

  𝔉𝔬𝔯 𝔊𝔬𝔡 ſ𝔬 𝔩𝔬𝔳𝔢𝔡 𝔶ͤ 𝔴𝔬𝔯𝔩𝔡,𝔱𝔥𝔞𝔱
  𝔥𝔢 𝔤𝔞𝔲𝔢 𝔥𝔦𝔰 𝔬𝔫𝔩𝔶 𝔟𝔢𝔤𝔬𝔱𝔱𝔢𝔫 𝔖𝔬𝔫𝔫𝔢: 𝔱𝔥𝔞𝔱
  𝔴𝔥𝔬ſ𝔬𝔢𝔲𝔢𝔯 𝔟𝔢𝔩𝔢𝔢𝔲𝔢𝔱𝔥 𝔦𝔫 𝔥𝔦𝔪 , ſ𝔥𝔬𝔲𝔩𝔡
  𝔫𝔬𝔱 𝔭𝔢𝔯𝔦ſ𝔥, 𝔟𝔲𝔱 𝔥𝔞𝔳𝔢 𝔢𝔲𝔢𝔯𝔩𝔞𝔰𝔱𝔦𝔫𝔤 𝔩𝔦𝔣𝔢.
ASCII can't represent the yͤ shorthand for "the", nor the long-s that Jesus used. /s


Python has its own set of Unicode problems. Sometimes it thinks the terminal shouldn't be in Unicode, when it is, or vice-versa. Java has the same problem. But both languages' Unicode problems are so tiny compared to C that you don't need to worry too much.


From the tables in the article, it looks to me that the existing(?) ztd.text already solves all issues he mentions (at least the ztd.text column has checkmarks everywhere). I have not understood why there is a need for this new ztd.cuneicode ?

But I have only skimmed the article so maybe I missed something


ztd.text is a C++ library. cuneicode is a C library.


that explains it, thanks!


Thank God someone has the capacity to attack this issue in depth. UTF8 situation with regards to systems level programming is scary and annoying at same time

edit: I note that HN filters out utf8 emojis..


> UTF8 situation with regards to systems level programming is scary and annoying at same time

These days one of the systems level programming languages (Rust) is natively UTF-8.

This would have been a huge source of drama in say, 1993 (when UTF-8 is basically brand new) and maybe even in 2003 (by which point it's clear Unicode is a success, but still conceivable that UTF-16 "wins" in some sense) but in 2023 people just shrug - obviously it's UTF-8, why not?


On Windows this isn’t obvious at all, as the native encoding is UTF-16, and converting to/from UTF-8 on every Windows API call is not only a pessimization both in runtime and memory usage, but also introduces complications in having to handle possible conversion failures (involving unpaired surrogate characters, in particular).

C in principle allows to abstract over the system encoding.


Rust has an abstraction for the system encoding: OsStr. Though I think most Rust programs just assert everything is UTF-8, the language goes to great lengths to ensure it is possible, and not even particularly difficult, to write Rust code that can handle Windows's broken UTF-16, yet which isn't Windows-specific.


Eh, the `OsStr` API has warts still, some parts are unstable (https://github.com/rust-lang/rust/issues/111544).


What about compiling programs with `cl -utf-8` and using -A APIs? I know the docs say “If the ANSI code page is configured for UTF-8, -A APIs typically operate in UTF-8”, to which I say, just “typically”?! But in practice is it not sufficient?


The -A APIs are converted to calls to the -W APIs, IIRC, so you pay that cost everywhere. Not sure if that's universally true anymore, but it has been that way at least on the NT based kernel since...inception? Probably a good question for Raymond Chen. Sadly, I can't seem to find anything relevant in a quick search of the Old New Thing.


The GP is referencing the new UTF-8 APIs that reuse the -A suffixes.[0]

> Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs typically operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes.

This requires Windows 10 version 1903 or newer. As the version number suggests, this feature is only four years old, and using them prevents your program from working on older versions.

[0]: https://learn.microsoft.com/en-us/windows/apps/design/global...


Yes, precisely. I have never tested it conclusively, but (taking it as a given that we are targeting >=1903) is it not sufficient and optimal to use `-utf-8` and -A APIs?


Windows UTF-8 support is relatively recent and I have no experience with it, so I don’t know. However I expect Windows to just do the same conversions internally or in the linked runtime, that the program would otherwise have to do by itself. I’d assume that there will be edge cases that such programs then can’t handle, such as UI input and file paths containing unpaired surrogates.


> the native encoding is UTF-16

There's no reason all the rest of your software needs to pay for Microsoft's mistake.

> and converting to/from UTF-8 on every Windows API call is not only a pessimization both in runtime and memory usage, but also introduces complications in having to handle possible conversion failures (involving unpaired surrogate characters, in particular).

If the text isn't actually Unicode, then all you're demonstrating by not handling that in another language is that you didn't care about correctness. There's no magic here, unpaired surrogates aren't somehow actually valid Unicode so long as you stick with C.


Windows APIs (including any COM and .NET libraries you may interoperate with) can serve you strings containing unpaired surrogates. That has nothing to do with that I “didn’t care about correctness”. Processing Windows UTF-16 as-is simply avoids complications that you would add by converting back and forth from and to UTF-8. If you’re working with other I/O (network requests, file contents) that is UTF-8-based, it makes much more sense to strategically pick the places where you perform conversions within your application. The conversion then is also limited to where data actually crosses between Windows APIs and other I/O within your application.

Traditionally, strings in C are encoding-agnostic. By default, the encoding of the current locale is assumed, and applications are expected to convert to other encodings when specifically needed. Forcing everything into “there can only be one encoding we can work with” isn’t a great solution. One can wish that everything was UTF-8 from the beginning, but that’s not the world we live in.


> There's no reason all the rest of your software needs to pay for Microsoft's mistake.

How is it a mistake when UTF was not even a thing when the NT kernel was developed?


Not every mistake is something you could have prevented/ avoided. Sometimes you just get unlucky.


UTF-8 should never contain surrogates as they serve no purpose when all code points can be encoded directly. That doesn't prevent programmers doing stupid things with them but that's not a Unicode problem.


Correct – but that’s exactly why you can’t just say “all strings are UTF-8, we’re done here”. Because Windows filenames can contain unpaired surrogates, which don’t have a sensible UTF-8 translation. (Unix has a similar problem with filenames being arbitrary bytes that may not be valid UTF-8.)

Note that Rust has OsString to handle both of these cases, but it’s somewhat hard to use.


With networking code you sometimes need an OsString for the platform you're not on. Rust doesn't seem to support this, other than by writing your own.


> These days one of the systems level programming languages (Rust) is natively UTF-8.

Except that the geniuses who maintain the extant Unix systems still couldn't bring themselves to at least start transitioning pathnames to utf8, so now we're forever stuck with the terrible OsString cruft and wasting brain and CPU cycles converting from and to OsString in some unsatisfactory manner.

At least ZFS (which also incidentally is pretty much the only non-terrible filesystem) has an utf8only flag.


> Except that the geniuses who maintain the extant Unix systems still couldn't bring themselves to at least start transitioning pathnames to utf8, [...]

As far as POSIX and Unix `open(2)` and friends go, pathnames are opaque binary with only two special byte values: NUL (0x00, because it's C strings, which are NUL-terminated) and '/' (because that's the path component separator). Any codeset and form that is compatible with that will "work" -- for some value of "work" where if the codeset isn't Unicode in UTF-8 then you'll be sad.

Nothing keeps users / sites from declaring that thou shall use UTF-8 on the filesystem, or, even better, that thou shall use UTF-8 locales only, as the latter is the only simple way to [mostly] get the former.

> At least ZFS (which also incidentally is pretty much the only non-terrible filesystem) has an utf8only flag.

Note that ZFS can't tell if the strings from user-land are in UTF-8. ZFS can only tell if they're not valid UTF-8.


the geniuses that maintain the extant Unix systems invented UTF-8 precisely because they value backward compatibility above everything else and won't contemplate breaking stuff for the sake of programmer convenience.


UTF-8 was invented by people who were sufficiently fed up with the shortcomings of Unix that they created a quite incompatible but much improved evolution of it. And it's not just some hack for the sake of backwards compatibility at the cost of other properties, it's a vastly superior design compared to the then existing unicode encodings.

And there is no serious backwards compatibility problem. You just introduce a mount option to enforce utf-8, flip it to default after a say a decade, and people who then still have file systems with pathnames in latin-1 (or other craziness) can then flip it off and get a few extra decades to migrate their stuff. In the meantime the rest of the world just writes software under the assumption that it will be deployed to non-crazy installs, and software becomes magically more reliable and quicker to write.

And it's not like even 1% of software now would handle non-utf-8 filenames robustly anyway (I mean even if you are aware of the problem and diligent about it, there is absolutely no good way to deal with non-utf8 filenames if you need to output them as text for consumption by humans or other programs, which is close to 100% of cases, because at the very least you will need to do so in error messages if there is an IO problem).


> UTF-8 was invented by people who were sufficiently fed up with the shortcomings of Unix that

Uh, not exactly, not quite. UTF-8 was invented by Unix people who were fed up with all the alternative proposals at the time. Those people did not then go and "fix Unix" in any way regarding this. Those people were none others than Ken Thompson and Rob Pike, who were among the creators of Unix -- you could only have gotten closer to "those people" being the creators of Unix by having had Denis Ritchie at that diner table on that fateful day.


Then what happens when you mount it with the utf-8 option on one machine, and without it on another. The one without the utf-8 options writes a filename that breaks utf-8 encoding. You then try to mount it again with the utf-8 option. What happens then?


Not only mount. What happens when you unpack an archive that uses non-fully utf-8 compliant names. Or checkout the history of a git repo with non-utf-8 filenames. Network filesystems are also a source of pain.

And even if you settle for utf-8 you'll still have to deal with differences with filesystems automatically canonicalizing names or being case insensitive.


> What happens when you unpack an archive that uses non-fully utf-8 compliant names.

Exactly what you would want to happen (this is the actual error message you will get, BTW)?

   > (cd /mnt/funkyfilesystem && touch $'\370' && tar cf my-bad-file.tar $'\370')
   > tar xf /mnt/funkyfilesystem/my-bad-file.tar
   tar: \370: Cannot open: Invalid or incomplete multibyte or wide character
   tar: Exiting with failure status due to previous errors
Or are you trying to tell me that your life isn't complete without tar or a git checkout silently dropping some nonsense byte sequences into your filesystem? In both cases you can map the names to something saner explicitly and continue with whatever you were doing, or checkout to e.g. a tmpfs where you flipped off utf8-enforcement. BTW, unless you work with fairly unusual git repos or tar archives on a regular basis, this is very unlikely to ever happen to you.

> And even if you settle for utf-8 you'll still have to deal with differences with filesystems automatically canonicalizing names or being case insensitive.

(Case-preserving you mean? I think truly case-insensitive died with 8.3 DOS filenames). These are also annoying, but a much more minor issue, e.g. you don't need to create a new weird string type just because of them.


Your tar example appear to Work On My Machine. Probably I don't have a funky enough filesystem. Of course if the filesystem itself doesn't support invalid unicode names, there isn't much the os or tools can do.

But hey, if enough tools and applications stop supporting non-utf-8 names, it is possible that in a a few decades you might get what you want as de-facto non-utf8 files will just disappear.


I agree it's something that needs to be dealt with, but can you name a plausible scenario in which this would actually happen and in which the desired behavior would be that you get silently some garbage filename written to your disk?

Let's say you get an EILSEQ on accessing the dirent with readdir and now have to debug how you got some corrupted filename you clearly didn't want in the first place, remount and fix it. How is that worse than having the corrupted filename and not knowing about it before something more insidious happens?


Oh, and BTW: Plan-9, the OS utf-8 was specifically invented for did actually enforce utf-8 for filenames.


There is a simple solution: use UTF-8 in the middle and push all codeset conversions to the edges. And, ideally: deprecate non-UTF-8 locales.


Falling back to emoticons is always an option ;-)


Back when I was stil single some people unmatched me on dating apps for using them. Great way to filter out people not worth your time, really


You could do that, or you could set your age minimum to 30 or so.


I'm sure it would have been worse with people in their tweens, but it's not exclusive to them weirdly enough.

Another good one is being completely honest about your height (mostly just filters out American expats but still).

Anyway, the only reason I used emoticons was because I use this keyboard:

http://www.exideas.com/ME/index.php

... but it apparently had some side-effects.


The inability to represent text in the vast majority of the world’s languages is a far bigger issue than the lack of emoji.


The elephant in the room is that Unicode is a steaming pile of shit. It is the source of all the complexity that results in people picking and choosing what they dare to try implementing, and document other things as being off the plate (the application programmer can fend for themselves as best they can, based on what of the missing cruft they need).


It's not Unicode's fault. Even if we got rid of precompositions there would still be a tremendous amount of complexity in Unicode that all derives from the complexity of the scripts that humans use every day (as well as scripts of dead languages that we choose to support still).

That a and &acute; and so on are related is a feature of the Latin scripts, and the need/desire to have a coding system that can express that relatedness would exist even if you object to that desire.

Almost everything about Unicode that people love to hate comes from humans who have nothing to do with Unicode:

  - composition (thus multi-codepoint glyphs)
  - weird collation rules
  - weird equivalence rules
  - the need for normalization and/or
    normalization-insensitive string hashing
    and comparison
Even emoji are like this because emoji are really just a new kind of script (with the extra feature of color, which all other scripts lack).

The only complexity that could have been elided would have been pre-compositions and NFC and NFKC, and maybe the K mappings so that NFKD also could be dispensed with. Everything else would still necessarily be a part of Unicode or any competitor to Unicode.

And no, lots of codesets, one per-script, does not work as an alternative -- we tried it and it sucks, and that's why Unicode exists.

To complain about Unicode's complexity is to tilt at windmills.


> scripts of dead languages that we choose to support

... but it's not Unicode's fault!

> To complain about Unicode's complexity is to tilt at windmills.

A specification that people don't want to fully support is a bad specification;

Complaining about programs and tools that don't implement everything there is to Unicode is what is tilting at windmills.

At the end of the day, the programs win; they determine the functionality you have, not the specification.

A specification that people don't want to fully embrace is a bad specification.


> > scripts of dead languages that we choose to support

> ... but it's not Unicode's fault!

Right, it's not.

> [the rest]

You say that as if current support for Unicode will not improve, as if somehow it has never improved (but then, even what we have now wouldn't exist). That's clearly not the case.


The lack of superscript uppercase C F Q S X Y and Z really accents U+1F4A9.


What would you change?


Not them, but:

- Combining characters, joiners, modifiers, etc. should have been prefix, not suffix/infix, and in blocks by arity. This would have allowed detecting incomplete sequences, allowed code to skip combinations without knowing their meaning, and made dead-key input trivial.

- A standard normalized form should have been specified up front.

- Once blocks overflowed, subsequent blocks should have been made large enough that more that one would not be necessary. Unicode has ten different Latin blocks. You've got ‘Kana Extended-B’, ‘Kana Supplement’, ‘Kana Extended-A’ (in that order!), and ‘Small Kana Extension’. (If there's an explanation for ‘Supplement’ vs ‘Extended’ vs ‘Extension’, I've never seen it.)

- Character ordering is inconsistent even within scripts, e.g. ÀÁÂ… àáâ but ĀāĂ㥹.

- Assigning code points for the UTF-16 encoding is ridiculous.

- The contentious cases of CJK unification should have been disambiguated promptly, before use of the originals became entrenched. (I know where to find 2K code points they could have used…)

- They recently changed some characters to have ‘emoji presentation’ by default, meaning that existing text needs to be edited to preserve its original appearance.


> - Combining characters, joiners, modifiers, etc. should have been prefix, not suffix/infix, and in blocks by arity. This would have allowed detecting incomplete sequences, allowed code to skip combinations without knowing their meaning, and made dead-key input trivial.

Yes, this is a very good suggestion. UTF-8 is a multi-byte encoding that is self-synchronizing in either direction. Multi-codepoint characters/glyphs should have been similarly self-synchronizing in either direction.

> - A standard normalized form should have been specified up front.

There are four. The only reasonable simplifications here would have been to get rid of NFC and NFKC, and maybe NFKD, leaving only NFD, and then maybe also remove all precompositions (but this would have made take-up of Unicode take much longer because it would have required more complexity in implementations earlier).

> - Once blocks overflowed, subsequent blocks should have been made large enough that more that one would not be necessary. Unicode has ten different Latin blocks. You've got ‘Kana Extended-B’, ‘Kana Supplement’, ‘Kana Extended-A’ (in that order!), and ‘Small Kana Extension’.

That's a function of how allocations were made, especially before UTF-8 and UTF-16, when we were stuck with just the BMP.

Well, anyways, we'd need a time machine to get any of these things fixed now.

> - Character ordering is inconsistent even within scripts, e.g. ÀÁÂ… àáâ but ĀāĂ㥹.

Do you mean codepoint ordering? Hardly matters: the actual ordering will vary by locale anyways.

> - Assigning code points for the UTF-16 encoding is ridiculous.

That's not a thing.

> - The contentious cases of CJK unification should have been disambiguated promptly. (I know where to find 2K code points they could have used…)

That would have required getting representatives from those countries to sit down at the UC table sooner.

> - They recently changed some characters to have ‘emoji presentation’ by default, meaning that existing text needs to be edited to preserve its original appearance.

Blech. Occasionally the UC has to make backwards-incompatible changes, but they should try really hard to keep those to the absolute bare minimum -- zero would be ideal.


> That's not a thing.

I was referring to U+D800–U+DFFF being reserved just because UTF-16 happens to use 0xD800–0xDFFF for its encoding scheme. Contrast valid code points U+0080–U+00FF, while (most of) 0x80–0xFF are used by UTF-8.


Yes, surrogate pairs is stupid. We should do away with UTF-16 and liberate the remaining 10 bits of codespace in UTF-8.


Easier said than done. Programs that don't care about surrogate pairs already use blocks like U+DCxx for their purpose. Liberating the code space for character use would be a poor idea; blessing it for local use would make more sense, but it would just be approving existing practices.

I have an implementation in which codes U+DC00 through U+DCFF indicate "this was a bad byte in the original UTF-8". When UTF-8 is generated, these values reproduce the original invalid bytes. Additionally one of those values, U+DC00, is particularly useful since it encodes a null byte that occurred in the UTF-8 stream. While that is not strictly necessary as far as UTF-8 goes, the encoding provides a way to embed the null character in null-terminated strings.


Dead-key input, if done according to ISO 9995, is trivial for exactly that reason. Type all of the combining diacritics, then the base character.

* https://jdebp.uk/Softwares/nosh/guide/commands/console-fb-re...


I would throw out all obscure languages and useless symbols.

One character = one code point.

I would use 32 bits for a code point. The lower 16 bits would be code; the upper 16 bits contain some flags and fields for classification. This would allow simple bit operations to inquire about important properties. There would be a 4-bit flexible type code, leaving 12 bits, some of which would have a code-specific meaning, others fixed. Or something like that.

The goal would be to have a code that programs can work with, without requiring megabytes and megabytes of meta-data about the code.


Didn't read the post thoroughly, will do, but why no libgrapheme in there?


In my understanding this library is meant to be a part of the future C standard library (but that attempt was not successful). As such, it has to keep all past craziness imposed by ISO C---libgrapheme on the other hand only supports UTF-8 and pseudo-UTF-32 ("pseudo" because of uint_least32_t).


Why is text such a shitshow?


Like everything else, because of all the edge cases.

Every symbol in Morse? Easy. Every symbol in ASCII? Easy. Add accents, and now you need to ask: are these (1) unique letters with their own sort order, or (2) modifiers that go "on top of" other letters? Answers may vary by language. Add ligatures, and now you're also forced to care about character length even if you would rather not, e.g. ﷽ (U+FDFD). Emoji? Simple ones are fine… but the general case has ways to combine characters to make different icons, making them a systematic synthlang in the same style as Chinese and Japanese[0].

What even is the right sort order of "4", "四", and "៤"? Or "a" vs. "A"?

What about writing directions? LTR, RTL, the ancient Greek thing whose name I forget with alternating directions on each line, the Egyptian hieroglyphics where it can be either depending on which way the heads face, or vertical?

What about alphabets like Arabic where glyphs are not in general separate, and where they can take different forms depending on if they're the first in a word, the last in a word, in the middle of a word, or isolated?

What about archaic letter forms? e.g. ſ, þ, and ð in old English.

[0] where I will probably embarrass myself by suggesting 犭(dog) + 瓜 (melon) = 狐 (fox), which, though cute, feels like as much a false etymology as suggesting that in English "congress" is the opposite of "progress". Or perhaps Japanese foxes really are melon dogs — I don't know, I can barely count to 4 in that specific writing system.


> where I will probably embarrass myself by suggesting 犭(dog) + 瓜 (melon) = 狐 (fox), which, though cute, feels like as much a false etymology

I don't know Japanese but with what I know about classical Chinese character construction, I'd expect that melon acts as a phonetic hint and dog hints at the meaning (e.g. the word this character represents sounds like "Melon" but is related to "Dog").

Edit: I was curious and looked it up, it's exactly this https://en.wiktionary.org/wiki/%E7%8B%90#Chinese


> the ancient Greek thing whose name I forget with alternating directions on each line

I believe the word you're looking for is "boustrophedon"


The word comes from the concept of plowing a field.


Yup, that's the one. Ευχαριστώ! :)


Mainly because Windows adopted UCS-2 and the hacky extension UTF-16 around when the superior and ASCII-compatible UTF-8 was invented

And Java and JavaScript followed Windows, and Python is constrained by it

The surrogate pairs of UTF-16 even infected JSON and thus implementations in all languages, but funny enough encoded JSON is specified to be UTF-8, which is better but a bit confusing

Newer, sane languages like Go and Rust are more Unix-like and use UTF-8 natively

It’s basically a Windows vs Unix problem


As a bit of history, Windows NT 3.1 was published in summer '93, and it was the first Windows version with Unicode support (UCS-2, not UTF-16 that didn't exist yet). Presumably development started well before that.

UTF-8 was publicly presented at USENIX at the beginning of '93. Not sure when Unicode incorporated it.

It is unlikely that Windows would have been changed at the last minute to use it, especially as the variable encoding of UTF-8 was significantly more complicated than the fixed size UCS-2.


Thanks, yeah that's basically what I thought, but it's nice to know it was the same year!

If only UTF-8 had been invented a little earlier, we could have avoided so much pain :-(

The idea of global varables like LANG= and LC_TYPE= in C is utterly incoherent.

Python's notion of "default file system encoding" is likewise incoherent.

You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data. Encodings shouldn't be global variables!

Python 3 made things worse in many ways, largely due to adherence to Windows legacy, and then finally introduced UTF-8 mode:

https://vstinner.github.io/painful-history-python-filesystem...


> You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data.

So, you can't, because Unicode processing can be (though I'm not sure how much is) locale dependent, and that that metadata is NOT attached to data. Unicode Consortium had been messing up non-Latin languages multiple times, causing hacks and new standards to build on top of UTF-8. Han Unification immediately comes to mind[1], but there are others as the Korean Mess[2], Cambodian Khmer problem[3], to name a few. I don't quite understand why it's always has to be like that.

1: Sets of characters from zh-Hans(zh-CN), zh-Hant(zh-TW), kr-KR, ja-JP that were deemed "same" were merqed lnto same code points, in an attempt to keep commonly used UTF-8 in nice 2 bytes

2: Korean Hangul characters were literally relocated between Unicode 1.1 to Unicode 2.0, causing affected characters written in 1.1 displayed in just unrelated characters

3: Reportedly the Consortium simply did not have a Cambodian linguist(???) (partly due to unrest and genocide that took place during 60s-80s)


Well what I'm saying if you have 2 different web pages, with 2 different declared encodings

Then a decent library design would let you process those in different threads in the same program

A global variable like LANG= inhibits that

So if you have metadata, it should be attached to the DATA, and not the CODE

---

Same thing with a file system. You can obviously have 2 different files on the same disk with different encodings. So Python's global FS encoding and global encoding doesn't make any sense.

They are basically "punting" on the problem of where the metadata is, and the programmer often has NO WAY to solve that problem!

---

The issues you mention are interesting but I think independent of what I'm saying


Because human language is very diverse and not optimized for computers.


And the current day thinking is very conservative, especially Unicode wants to be able to capture and preserve every single little detail of writing that ever existed. Historically human languages were far more flexible, and did adapt to changes; not being able to reproduce everything exactly as hand-writing didn't stop printing press, and various languages added and removed letters from their alphabets. Try writing things in a different way these days and you will be told that stuff is misspelled or otherwise wrong just because it doesn't match what some grumpy old men decreed hundred (or two) years ago.


Because it uses human scripts, of which there are many around the world, and which have a ton of complexity because natural language generally is quite complex.


History and organic growth.


Genesis 11:1–9


These aren't mistakes of C, these are mistakes made hundreds and hundreds of years ago in the design of various language writing systems. English gets a lot right with its writing system, mostly in its choice of a small alphabet, a small set of punctuation marks, and an almost complete absence of diacritical marks.


Maybe, but we also have to understand that by an accident of history, computers and the internet were originated by English-speaking people, so it makes sense that the defaults (ASCII and so on) were geared toward the specific needs of that language— it makes sense that everything not-English is going to have some degree of feeling "bolted on" to an underlying framework built for English.

If French or Greek or Russian or Japanese had happened to be the lingua franca of the Internet, the features those writing systems require would be first class, and the implementation of English would be the one that felt like a bit more of a kludge. Imagine if computers were designed around unicase Arabic languages [1], and then years later there was a "Han Unificiation"-type summit where it was decided that capitalization was to be handled with extra metadata, since an upper- lower-case latin letter are really just the same thing and why bother wasting double the codepoints on that.

(I don't have a link handy, but I believe I read somewhere that you can see some of this effect in NES-era RPGs from Japan— that some of the narrative had to be simplified down for the English translations because the text boxes were originally built for a small number of complex characters rather than the larger number of simpler ones that would be required to capture the same meaning.)

[1]: https://en.wikipedia.org/wiki/Unicase


The narrative that Han unification is this thing imposed by ignorant westerners on East Asian computer users is simply not true. The criteria for which characters were unified was mostly based on the criteria used in legacy East Asian encodings, which already had to deal with the question of what counts as the same character and what does not.

Unicode has round trip compatibility with the old encodings, too, without the use of any of that extra metadata, which is only used for incredibly minor character variations which are pretty much never semantically meaningful the way capitalization is. A human copyist might change one for another in copying a text by hand, just as when copying English, you do not consider it semantically meaningful whether lowercase “a” is drawn as a circle and a line, or a circle and a hook.


Is there a longer-form piece that would be helpful for me to read giving more of this perspective and showing receipts on it? As someone with little knowledge outside the Wikipedia article, it seems to me most damning that Shift JIS is still so popular in Japan.



I wish I could think of a good long form article about this. The best I can immediately think of is a pretty informative FAQ about Japanese encoding.

https://www.sljfaq.org/afaq/encodings.html

What I will point out though, that you can easily verify with a simple search, is that Shift-JIS can be represented losslessly in Unicode, and that this has been true for as long as Unicode CJK has existed.

The continued popularity of Shift-JIS is worth noting, but it is also important to note that its continued use is not a stable thing, and has been declining for two decades. The most popular websites in Japan no longer use it, and among smaller websites, the percentage that use it gets smaller each year. Secondly, there is absolutely no benefit to the user in terms of that can be encoded and distinguished, because it will get converted into Unicode at some stage in the pipeline anyway. Any modern text renderer used in a GUI toolkit will use Unicode internally. “Weird byte sequences to indicate Unicode characters” is really all that legacy encodings are from the perspective of new software. They are essentially just incompatible ways of specifying a subset of Unicode characters.

As for why it has held on so long, there are a few reasons. The Japanese tech world can be quite conservative and skeptical about new things. That is one factor. But I think another is that Japan was really a computing pioneer in the 1980s, and local standards ruled supreme. Compatibility was not a big concern, and even the mighty IBM PC and its clones barely made an impact there for a long time, as it was completely eclipsed by Japanese alternatives. Now, everyone is forced by our increasingly interconnected world to work on international standards, and I can’t help but feel that there is some resentment at not being about to just “do their own thing” anymore. Every time a new encoding extension is proposed, they have to present it to an organization that includes China, Korea, Taiwan, and Vietnam, who will scrutinize it. A few years ago JIS (the Japanese standards organization) actually proposed that each East Asian country should just get their own blocks in Unicode and they should be able to encode whatever they want with no input from others. Of course, none of the other East Asian countries took their proposal seriously. I wish I could find the proposal, because I hate saying stuff like this without sources, but I can tell you that it is buried somewhere among all of the proposal documents that you can find on the website of the Ideographic Research Group, which is the organization that East Asian countries participate in which is responsible for CJK encoding in Unicode. You might find it here with enough scrolling. I have to get on the subway though, so I have to end this comment here.

https://appsrv.cse.cuhk.edu.hk/~irg/


ASCII was always a multi-byte codeset. The characters were designed so that one could use overstrike to write things like &aacute; (á) by writing a BS ' or ' BS a. This worked for most lower-case characters that have diacritical marks in Latin scripts. It didn't work for upper-case characters unless the terminal or printer understood this and had special fonts, but then, that's why Spanish historically doesn't require accenting upper-case letters.

This all goes back to typewriters, where this is how one typically wrote accented characters.

And most of this got lost. Though we still use overstrike for underline and bold in the terminal.


The English language evolved alongside the technology.

English used to be cursive-only, like something like Arabic. But the language evolved to fit into a modern world, and I think other written languages would benefit from a simplified "print" version as well.


Almost every language already has a print version, newspapers and books aren't a new thing.

Computers merely got it wrong several times, mostly for silly reasons.


> English gets a lot right with its writing system

More like English uses a relatively simple writing system which has the drawback of being unable to properly represent all the sounds of the language.

Like a lot of things in life it's a question of tradeoffs, not being right or wrong


It doesn’t matter. If C is supposed to be a language for solving real-world problems, its strings have to be real-world strings. Wishing writing were more elegant gets you nowhere.


Languages with diacritical marks have added them because they are beneficial in disambiguating different sounds.

English's lack of diacritical marks is why romanized Korean is so difficult to read, with digraphs used for monophthongs.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: