Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Should we not be using wchar_t strings in modern C?

    int main(int argc, char *argv[]) {
        wchar_t buf[100];
        wprintf(L"Hello, world!\ntype something>");
        if (fgetws(buf, 100, stdin))
            wprintf(L"You typed '%ls'\n", buf);
        if (argv[1]) {
            char *s = argv[1];
            /* Convert char string to wchar_t string */
            size_t len = mbsrtowcs(buf, &s, 99, NULL);
            if (len != (size_t)-1) {
                buf[len] = 0;
                wprintf(L"argv[1] is '%ls'\n", buf);
            }
        return 0;
    }
It's a pain, but the advantage is access to iswalpha() and friends.


"Modern" C is char16_t and char32_t. The old wchar_t type has many issues. You can read more here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1286.pdf


char16_t and char32_t are useless. The C standard declares functions in <uchar.h> for converting them to and from char, but not wchar_t. The conversion to char may be lossy depending on the platform. No other interfaces uses those types. There's no portably lossless path converting them to and from wchar_t.


There are various "solutions" to this problem of holding one "character" per instance of a type. If for some reason you don't want to use char * (for example you want to find the length of a multi-byte-per-character length string), there's https://github.com/cls/libutf


The point is that wchar_t should be removed entirely. Assume it doesn't exist anymore and use char16_t and char32_t everywhere.


This is entirely undesirable. First of all, char16_t and char32_t are kinda useless as there's no standard interfaces using them, and there's no conversion functions to and from wchar_t.

Secondly, no, you're asking for a massive addition of 2 new versions for every interface that mentions wchar_t. That's a huge addition to standard libraries. That's error prone and bloats things up. Then additionally you're asking for a rewrite of all software using wchar_t. And only until everything is transitioned, which isn't going to happen, the standard libraries will be much larger.

The solution is rather to embrace wchar_t and fix it. All sensible and modern platforms, which is a premise of this article on modern POSIX functions, have a 32-bit wchar_t type. That's excellent. It's only Windows, which due to historical short-sightedness that have 16-bit wchar_t. But writing portable C for native Windows is a losing game, the winning move is not to play. (Do see midipix which is upcoming and will provide a new POSIX environment for Windows with musl and 32-bit wchar_t). In fact, 16-bit wchar_t violates the C standard. That moment you give up broken platforms with 16-bit wchar_t, wchar_t works as intended, and this is a non-problem. Embracing char16_t and char32_t is a worse problem and isn't solving anything.


On Windows, sure. On other platforms UTF-8 is generally preferable (in my opinion).


Annoyingly, there is no simple user accessible UTF-8 decoder in libc. The only standard way to use iswalpha is to convert to wchar_t first.

One hack is to assume that bytes of UTF-8 encoded strings above 127 are all letters. It mostly works :-)


> Annoyingly, there is no simple user accessible UTF-8 decoder in libc.

Am I misunderstanding you, because I've always thought that's what the mbtowc(3) family of functions was?


Well you are right, but these functions are not terribly fun to use. Consider a parsing function which extracts an identifier. For ASCII it's:

    if (isalpha(*s)) {
        *d++ = *s++;
        while (isalnum(*s))
          *d++ = *s++;
    }
To use UTF-8 / Unicode should require only small changes:

    if (iswalpha(decode(&s)) {
        encode(&d, advance(&s));
        while (iswalnum(decode(&s))
            encode(&d, advance(&s));
    }
For efficiency, don't decode twice- have the decoder return a pointer to the next sequence:

    if (iswalpha(c = utf8(&s, &n))) {
        encode(&d, c);
        s = n;
        while (iswalnum(c = utf8(&s, &n))) {
            encode(&d, c);
            s = n;
        }
    }
Also should be able to match a string in line:

   if ('A' == utf8(&s, &t) && 'B' == utf8(&t, &s) && 'C' == utf8(&s, &t)) // we have 'ABC'.


mbtowc isn't necessarily thread safe, it's better to recommend mbrtowc.


Just use setlocale(LC_ALL, "") in main, and use mbrtowc to translate from whatever the system encoding is into the wchar_t type. There's no need to bake assumptions about the system encoding into most programs.


No, it's important to understand the distinction between char and wchar_t. Both are relevant, but in different contexts. char should be considered a byte type to pass around UTF-8 with. This is the appropriate level for the large majority of common string operations, such as concatenation, outputting strings directly, parsers that only handle ascii characters specially, and so on.

Those applications don't really care about the actual unicode codepoints besides ASCII. If you start to deal with visual representation of strings, calculating the column for error messages, advanced unicode-aware parsing, font rendering, and so on, then you do want to convert on the fly to wchar_t. mbsrtowcs and such are kinda bad, because they convert the whole string at once, which means an allocation that can fail in the unbounded case. It's usually sufficient to decode one wchar_t at a time with mbrtowc.

This way, char and wchar_t are not replacements for each other, but complement each other by being better abstractions for various purposes. Now, the wide stdio functions is where things start to get a bit useless, because the regular stdio char functions are perfectly fine and those functions don't really appeal well to the strengths of wchar_t.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: