In gamedev there is simple rule: don't try to do any of that.
If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.
If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
>Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.
What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?
>What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?
Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.
There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.
> They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.
I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).
I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.
For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.
Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.
std::transform() seems like overkill when you can just iterate over the string and modify it in place. And in my opinion, tranform is way less readable than seeing a loop over some array with a single operation inside.
The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.
If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.
The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.
> The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.
Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).
7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.
Yeah If you're using narrow strings for UTF8 you're making a mistake. wstrings also are not a good representation because of the platform differences, unless you don't care about Windows in which case it's fine but still not great semantically.
In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.
You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.
I'm not 100% sure what you mean by narrow string, but if you refer to std::string vs std::wstring, then std::string is perfectly fine for encoding UTF8, as that uses 8 bit code units which are guaranteed to fit in a char. On the other hand, std::wstring would be a bizarre choice for UTF8 on any platform.
It's not guaranteed for 7-bit ASCII either because tolower/toupper are locale-dependent and with the tr_TR lowercase I (U+0049) is ı (U+0131, aka dotless i) wich encodes as two bytes in UTF-8.
That's not ascii then. It's byte width compatible (to a certain degree as you point out). But it's not ascii. ascii defines 128 code points and the handling of an escape character. It doesn't handle locales.
std::u8string, std::u16string and std::u32string are supposed to be the portable unicode string types, but a lot of machinery is missing and some that has been added has since been deprecated.
> there's no way to implement such a module without being very opinionated about something.
indeed! Boost.Nowide[1] is such an opinionated library.
Yep, there's also ICU and utf8cpp, and many others. They all have trade-offs. So I just don't think the stdlib should cover this because there is no objectively best way to handle it.
I know I can simply iterate. The point is that it's a function that should be included, not that it's impossible without it. It's one of the most common string operations.
To me that feels like the JS community asking for left-pad or is-even in a module. Why have a dedicated function for 2 lines of code?
And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.
Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.
>Why have a dedicated function for 2 lines of code?
Then why does std::max() exist?
>People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.
tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.
>Generalized string conversion is a very complex
Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.
> There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.
Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".
std::tolower / std::toupper are rubbish functions that can't do proper Unicode but still pull in the bloated locale machinery for what should be a simple conditional integer addition if all you care about is ASCII. Both have no valid use case and should be marked [[deprecated]] and erased from all teaching materials.
> What if your game needs to talk to a server and do some string manipulation in between requests?
What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.
Those sound exactly like the newcomer detectors GP was referring to. What you want is a case-insensitive string comparison, and outside ASCII that's not equivalent to just turning both strings to lowercase and checking equality (or doing a substring search or whatever the task requires)
Nobody is thinking about converting the case of ASCII characters. To be thinking that, they are explicitly excluding most of the world's cultures from entering common names correctly. Restricting thought to ASCII is a lack of thought, not an active thought.
>If text is coming from user - then change design until its not needed to 'convert'
In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.
> In gamedev there is simple rule: don't try to do any of that.
I am not in gamedev, but I frequently have to develop middleware that takes in user entered data and formats it in a way that will import into a 3rd party system without errors. And that sometimes means changing the case on strings.
In my experience as a developer, this is very very common requirement.
Luckily I am not forced to use a low level language for any of my work. In C# I can simply do this: "hello world".ToUpper();
If you're putting data into a third-party system, you might want `ToUpperInvariant`, not `ToUpper`. (Just checking that you know the difference, because most people don't!)
The problem is that such third-party requirements are usually wrong.
Two decades ago some developer probably went "Yeah, obviously all names start with capital letters!", not realizing that there are in fact plenty of names which start with a lowercase letter. So they added an input validation test which checks for capitals, which meant everyone feeding that system had to format their data. A whole ecosystem grew around the format of the output of that system, and now you're suddenly rewriting the system and you run into weird and plain wrong capitalization requirements for no technical reason whatsoever.
Alternatively, the same but start with punch cards which predate ASCII and don't distinguish between uppercase and lowercase letters.
> In C# I can simply do this: "hello world".ToUpper()
... which does not work.
Take a look at the German word "straße" (street), for example. Until very recently the "ß" character did not have an uppercase variant, so a ToUpper would convert it to "STRASSE". This is a lossy operation, as the reverse isn't true: the lowercase variant of "KONGRESSSTRASSE" (congress street) is not "kongreßstraße" - it's supposed to be "Kongressstraße".
It can get even worse: the phrase "in Maßen" (in moderate amounts) naively has the uppercase variant "IN MASSEN" - but that means "in huge amounts"! In that case it is probably better to stick to "IN MASZEN".
And then there's Turkish, where the uppercase variant of the letter "i" is of course "İ" rather than "I" - note the dot.
So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.
In practice that means your options are a) giving up on the concept of uppercase/lowercase conversion and just passing it as-is, or b) accepting that you are inevitably going to be silently corrupting your data.
> So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.
Yes. Now try applying it to something like this very HN comment section, which is mixing words belonging to different cultures inside a single comment - and in some cases even inside the same word.
Sure, you can now do case conversion for a specific culture, but which one?
It’s a lossy operation, but it does work. By this logic jpeg and mpeg don’t work either. But were watching them videos daily.
Yes we can simply ToUpper(). We just can’t ToUpper().ToLower(), but that’s useless cause we have the original string if we need it and fine if we don’t need it.
The point is that what ToUpper does depends on locale AND Unicode version. This for many applications it only appears to work until it will fail spectacularly in production.
I don’t think you can say this is universally known in ‘game dev’. In fact just last week I stumbled using the UI in a game that let me enter a name for something, which it then displayed in uppercase.
Game UI is the place I’d expect to most likely come across horrific abuses of localization precisely because game UI is such a cobbled together layer of hacks on hacks.
> There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Your web browser is doing it right now as you are reading this comment.
And it's not just Unity. Several exist for Unreal as well.
Why? Specifically because 2D layout and text rendering suck so much in game engines. What's ~50MB matter when you're shipping several GB of game assets?
It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.
I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.
Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.
You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.
So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.
>You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.
Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....
But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)
Being developed in, and having to stay compatible with, ancient times is a real problem of C++.
The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.
Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.
That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).
Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.
Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:
C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.
Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.
there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and
converting anything to uppercase SS isn’t something germany wants …
> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.
Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.
[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]
Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.
For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.
That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.
What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.
That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?
The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.
Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.
I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.
C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.
You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".
But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.
But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.
> Converting one Unicode string to another is a purely in-memory, in-CPU operation.
...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.
Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.
Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.
Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).
Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.
It's because Unicode don't allow for language switching.
It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).
AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.
The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).
I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).
So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)
Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.
Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.
Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)
Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.
By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.
"no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".
Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.
You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.
Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.
> Any tool which is old enough will have a thousand ways to do something.
Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.
Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.
So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?
You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.
> There are so many ways to do something and every way is freaking wrong!
That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.
JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).
Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.
Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.
How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.
Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.
If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.
Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.
> Makes you wonder why this isn't part of the C++ standard library itself.
Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.
Libraries are fine, not everything needs to be defined by the language itself.
> Makes you wonder why this isn't part of the C++ standard library itself.
Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.
As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.
that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).
I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.
The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).
to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.
But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.
> Is the alternative that it should be made unusable, and more existing code broken?
It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.
The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.
If you are handling multilingual text the locale is mandatory metadata.
I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story
Se fareblus oni, jam farintus oni. (It definitely won't happen on an echo-change day like today, either. ;))
Contra my comrade's comment, Esperanto orthography is firmly European, and so retains European-style casing distinctions; every sound thus still has two letters -- or at least two codepoints.
(There aren't any eszettesque bigraphs, but that's not saying much.)
Language is just part of the problem. Unicode lets you store text as entered, but what you do with that text completely depends on what your problem domain is. When you're writing software to validate that the name on someone's ID matches that on a ticket, you're probably going to normalise that name to your (customer's) locale rather than render each name in the locale it was originally written in. As long as you keep your locale settings consistent and don't do bad stuff like "iterate over characters and individually transform them", you're probably fine, unless your problem domain calls for something else.
If you're printing a name, you're probably printing the name for the current user, not for the person who entered it at some point. If you're going to try to convert back like that, you also need to store a timestamp with every string in case a language changes its rules (such as permitting ẞ instead of SS when capitalising ß). And even then, someone might intend to use the new spelling rules, or they might not, who knows!
This article probably boils down to "programmers don't realise graphemes aren't characters and characters aren't bytes even though they usually are in US English". The core problem, "text processing looks easy as long as you only look at your own language", is one that doesn't just affect computers.
Your best bet is to just avoid the entire problem by not processing input further than basic input sanitisation, such as removing whitespace prefixes/suffixes and maybe stripping out invalid unicode so it can't be used as a weird stored attack.
islower is actually supposed to account for the user's "locale", which includes their language.
The key takeway is that lowercasing a string needs to be done on the whole string, not individual characters, even if std::string had a way to iterate over codepoints instead of bytes (or code units, in the case of wstring).
And there isn't a standard way to do that, you either meed to use a platform specific API, like the windows function mentioned, or use a library like ICU.
> That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
I think it's more the exact opposite.
The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.
On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product names and whatnot. Which are never in ASCII because this isn't 1990.
> Which are never in ASCII because this isn't 1990.
This is a very silly statement. I'm willing to believe that you have lots of cases where those things are outside the ASCII range. Perhaps even most of the cases, depending on where you live. But I do not believe for one second that it never happens.
Never stored in ASCII, never limited to ASCII. They're UTF-8, usually.
If somebody's name happens to fit into ASCII that's irrelevant because it's not guaranteed, so you can never blindly do an ASCII case conversion.
For text data meant for users, I literally cannot remember the last time I used a string in ASCII format as opposed to UTF-8 (or UTF-16 in JS). It's certainly over a decade ago.
So yes, when I say never, I literally mean never. Nothing "very silly" about it, sorry.
(Again, excepting identifiers, where case conversion is not generally applicable.)
And you could argue that if the internal identifiers need to be capitalized or lower-cased, you've already lost.
On an enterprise app these little string manipulations are a drop in the bucket. In a game they might not be. Sort that stuff out at compile time, or commit time.
You can't always control the case you get but often you can not care about anything outside ASCII. Scripts and configuration or text-based data formats are common examples.
Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.
Usually they'll accept it, but some parts of the backend are still running code from the 60's.
So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet
Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.
Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.
So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.
People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.
Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).
Why can't these people understand that that 70% of the world consider ASCII to be "the computer language", not English, and UTF-8 to be "whatever soup that only works inside files and forms and can't be program manipulated"?
Maybe it needs to be communicated more often, like way more often, until it sticks.
It’s totally reasonable to assume your users are in the US if your business only sells to people in the US. I work in the health insurance sector; there’s absolutely no chance my company ever sells these products internationally. We can’t even sell them in every state.
It's reasonable to assume that all users can deal with having to encode their names in 7-bit ASCII. Otherwise you might as well demand that computer systems need to support arbitrary drawings in the name field at which point you might as well not have a name field at all because even most humans won't be able to deal with what you want to put in there.
No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.
Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…
In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.
ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.
I said use a Unicode library if input data is actual human language. Which names and addresses are.
99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)
User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.
"The vast majority of the population does not read or write any English in their day to day lives."
This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
While English speakers are not a majority, it is the most popular language.
And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers.
So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
> "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.
That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.
> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.
That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.
> That’s why I still get mail with my name mangled
Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
> it’s better if a Chinese user name does not break your reporting or logging systems
You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.
> Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.
That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.
It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.
Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.
Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.
Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.
Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.
Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.
And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.
HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.
ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.
But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.
I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range.
With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters.
With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.
> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.
Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.
Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.
I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.
Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.
In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.
I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.
And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".
File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.
So use standard string processing libraries on path names at your own peril.
It's a good idea to consider file paths as a bag of bytes.
IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.
I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".
> It's a good idea to consider file paths as a bag of bytes
(Nitpick: sequence of bytes)
Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)
Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.
That's what I mean, you treat filesystem paths as bags of bytes separated by known ASCII characters, as the only path manipulation that you generally need to do is to append a path, remove a path, change extension, things that only care about those ASCII characters. You only modify the path strings at those known characters and leave everything in between as is (with some exceptions using OS API specific functions as needed).
Just using UTF-8 for username at all is problematic. That has been a major PSA item for Windows users in my language literally since 90s and still is. Microsoft switched home folder names from Microsoft Account username to shortened user email for that reason.
Yes and most importantly, that interpretation is for display purposes ONLY. If your file manager won't let me delete a file because the name includes invalid UTF-16/UTF-8 then it is simply broken.
Better to just convert WTF-16 (Windows filenames re not guaranteed to be valid UTF-16) to/from WTF-8 at the API boundary and then do the same processing internally on all platforms.
UNIX land is even worse in international languages support.
As in, there isn't even something on POSIX at the level other operaring systems support for localisation.
Yes there is some locale stuff, however not enough for all stuff, hence why every modern programming language happens to have this as part of their standard library.
Wow, I came here to write exactly that, and its heartening to see that I am not crazy
Just reading the title, with microsoft.com in bracket, I knew two things:
1. It would be written by Raymond Chen
2. That article is going to be awesome
> From the article: "I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something."
I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.
Emoji translation would cause the machine to hammer out
Handle text in two ways: either it's controlled by you and you can do simple, efficient, and naive processing, or it's not (it's translated resources, or user input) and you can't.
For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.
Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!
> For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
C++ std::tolower/toupper (which are really just C tolower/toupper) are the wrong tool for that too though because they depend on the process locale which makes them a) horribly inefficient and b) prone to blow your program up in interesting ways on customer systems. Not quite as bad as the locale-dependent standard number parsing functions that want . in some localses and , in others but still should never be used.
Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E).
The double-S thing is still widely used, though.
Duden mentions this: "Bei Verwendung von Großbuchstaben steht traditionellerweise SS für ß. In manchen Schriften gibt es aber auch einen entsprechenden Großbuchstaben; seine Verwendung ist fakultativ ‹§ 25 E3›."
But isn't it also dependent on the available glyphs in the font used? So f.e. it needs to be ensured that U+1E9E exists?
I don't think there exists any code that makes uppercasing decisions based on the selected font. Besides, if it doesn't exist in the current font, there's probably a fallback font.
But what if you need to uppercase the historical record in a vital records registry from 1950ies, but and OCRed last week? Now you need to not just be locale-aware, but you locale should be versioned.
Not just a Swiss user as there are many German words that use ss and not ß. And having an ss where there should be an ß will be a lot less disruptive as the inverse because people are used to ASCII limitations.
> And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS.
That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.
I generally just use the language-supported tolower/upper() (or similar) routines. I assume that they take things like UTF and alternative type systems into account.
I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.
Someone actually wrote a whole book about just Swift Strings[1].
Strings in C++ standard library do suck (and C++ is my favorite language).
As for UTF-16, well, I don't know that UTF-8 is a whole lot more intuitive:
> And for UTF-8 data, you have the same issues discussed before: Multibyte characters will not be converted properly, and it breaks for case mappings that alter string lengths.
Surrogate pairs aren't more complex than UTF-8's scheme for determining the number of bytes used to represent a code point. (Arguably the logic is slightly simpler.) But the important point is that UTF-16 pretends to be a constant-length encoding while actually having the surrogate-pair loophole - that's because it's a hack on top of UCS-2 (which originally worked well enough for Microsoft to get married to; but then the BMP turned out not to be enough code points). UTF-8 is clearly designed from scratch to be a multi-byte encoding (and, while the standard now makes the corresponding sequences illegal, the scheme was designed to be able to support much higher code points - up to 2^42 if we extend the logic all the way; hypothetical 6-byte sequences starting with values FC or FD would neatly map up to 2^31).
First, you should consider if you even need case folding; for many uses it will be unnecessary, anyways.
Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.
There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).
If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.
Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.
You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.
These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.
(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)
nothing about working with locales, or text in general, is basic. we were decades into working with digital computers before we moved past switchboards and LEDs. don't take for granted just how high of a perch upon the shoulders of giants you have. that's exactly how the mistakes in the blog post get made.
> Okay, so those are the problems. What’s the solution?
> If you need to perform a case mapping on a string, you can use LCMapStringEx with LCMAP_LOWERCASE or LCMAP_UPPERCASE, possibly with other flags like LCMAP_LINGUISTIC_CASING. If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
The correct thing to do is to not do it at all. If text is 3rd-party supplied, treat it like an opaque byte sequence. Alternatively, pay a well-trained human to do it by hand.
All other options are going to result in edge cases where you're not handling it properly. It's like trying to programmatically split a full name into a first name and a last name: language doesn't work like that.
Thank you for this universal approach. I can now toggle capitalization on/off for any character, instead of just being limited to alphabetic ones!
Jokes aside, I was kinda hoping for a good answer that doesn't rely on a Windows API or an external library, but I'm not sure there is one. It's a rather complex problem when you account for more than just ASCII and the English language.
C is hard. It seems like C++ just made things way harder. I don't regret skipping it. Why not just go right to Java, C#, JS, Haskell, etc. and do what you need in C.
Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.
While certainly much better, you still need to be aware that doing case conversion absent any locale information will never be perfect. If you want proper locale-aware conversion you can use the icu crate (https://docs.rs/icu/latest/icu/).
Footnote in the article provides the following explanation:
"The standard imposes this limitation because the implementation may need to add default function parameters, template default parameters, or overloads in order to accomplish the various requirements of the standard."
...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...
Qt does have a locale-aware equivalent (QLocale::toUpper/toLower) which calls out to ICU if available. Otherwise it falls back to the QString functions, so you have to be confident about how your build is configured. Whether it works or not has very little to do with the design of QString.
I don't see a problem with that. You can have it done locale-aware or not and "not" seems like a sane default. QString will uppercase 'ü' to 'Ü' just fine without locale-awareness whereas std::string doesn't handle non-ASCII according to the article. The cases where locale matters are probably very rare and the result will probably be reasonable anyway.
If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.
If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)