Should we not be using wchar_t strings in modern C? int main(int argc, char *arg...

apaprocki · on Feb 12, 2016

"Modern" C is char16_t and char32_t. The old wchar_t type has many issues. You can read more here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1286.pdf

sortie · on Feb 12, 2016

char16_t and char32_t are useless. The C standard declares functions in <uchar.h> for converting them to and from char, but not wchar_t. The conversion to char may be lossy depending on the platform. No other interfaces uses those types. There's no portably lossless path converting them to and from wchar_t.

ue_ · on Feb 12, 2016

There are various "solutions" to this problem of holding one "character" per instance of a type. If for some reason you don't want to use char * (for example you want to find the length of a multi-byte-per-character length string), there's https://github.com/cls/libutf

apaprocki · on Feb 12, 2016

The point is that wchar_t should be removed entirely. Assume it doesn't exist anymore and use char16_t and char32_t everywhere.

sortie · on Feb 12, 2016

This is entirely undesirable. First of all, char16_t and char32_t are kinda useless as there's no standard interfaces using them, and there's no conversion functions to and from wchar_t.

Secondly, no, you're asking for a massive addition of 2 new versions for every interface that mentions wchar_t. That's a huge addition to standard libraries. That's error prone and bloats things up. Then additionally you're asking for a rewrite of all software using wchar_t. And only until everything is transitioned, which isn't going to happen, the standard libraries will be much larger.

The solution is rather to embrace wchar_t and fix it. All sensible and modern platforms, which is a premise of this article on modern POSIX functions, have a 32-bit wchar_t type. That's excellent. It's only Windows, which due to historical short-sightedness that have 16-bit wchar_t. But writing portable C for native Windows is a losing game, the winning move is not to play. (Do see midipix which is upcoming and will provide a new POSIX environment for Windows with musl and 32-bit wchar_t). In fact, 16-bit wchar_t violates the C standard. That moment you give up broken platforms with 16-bit wchar_t, wchar_t works as intended, and this is a non-problem. Embracing char16_t and char32_t is a worse problem and isn't solving anything.

hthh · on Feb 11, 2016

On Windows, sure. On other platforms UTF-8 is generally preferable (in my opinion).

jhallenworld · on Feb 11, 2016

Annoyingly, there is no simple user accessible UTF-8 decoder in libc. The only standard way to use iswalpha is to convert to wchar_t first.

One hack is to assume that bytes of UTF-8 encoded strings above 127 are all letters. It mostly works :-)

akira2501 · on Feb 12, 2016

> Annoyingly, there is no simple user accessible UTF-8 decoder in libc.

Am I misunderstanding you, because I've always thought that's what the mbtowc(3) family of functions was?

jhallenworld · on Feb 12, 2016

Well you are right, but these functions are not terribly fun to use. Consider a parsing function which extracts an identifier. For ASCII it's:

    if (isalpha(*s)) {
        *d++ = *s++;
        while (isalnum(*s))
          *d++ = *s++;
    }

To use UTF-8 / Unicode should require only small changes:

    if (iswalpha(decode(&s)) {
        encode(&d, advance(&s));
        while (iswalnum(decode(&s))
            encode(&d, advance(&s));
    }

For efficiency, don't decode twice- have the decoder return a pointer to the next sequence:

    if (iswalpha(c = utf8(&s, &n))) {
        encode(&d, c);
        s = n;
        while (iswalnum(c = utf8(&s, &n))) {
            encode(&d, c);
            s = n;
        }
    }

Also should be able to match a string in line:

   if ('A' == utf8(&s, &t) && 'B' == utf8(&t, &s) && 'C' == utf8(&s, &t)) // we have 'ABC'.

sortie · on Feb 12, 2016

mbtowc isn't necessarily thread safe, it's better to recommend mbrtowc.

sortie · on Feb 12, 2016

Just use setlocale(LC_ALL, "") in main, and use mbrtowc to translate from whatever the system encoding is into the wchar_t type. There's no need to bake assumptions about the system encoding into most programs.

sortie · on Feb 12, 2016

No, it's important to understand the distinction between char and wchar_t. Both are relevant, but in different contexts. char should be considered a byte type to pass around UTF-8 with. This is the appropriate level for the large majority of common string operations, such as concatenation, outputting strings directly, parsers that only handle ascii characters specially, and so on.

Those applications don't really care about the actual unicode codepoints besides ASCII. If you start to deal with visual representation of strings, calculating the column for error messages, advanced unicode-aware parsing, font rendering, and so on, then you do want to convert on the fly to wchar_t. mbsrtowcs and such are kinda bad, because they convert the whole string at once, which means an allocation that can fail in the unbounded case. It's usually sufficient to decode one wchar_t at a time with mbrtowc.

This way, char and wchar_t are not replacements for each other, but complement each other by being better abstractions for various purposes. Now, the wide stdio functions is where things start to get a bit useless, because the regular stdio char functions are perfectly fine and those functions don't really appeal well to the strengths of wchar_t.