A popular but wrong way to convert a string to uppercase or lowercase

SleepyMyroslav · 2024-10-08T17:59:18 1728410358

In gamedev there is simple rule: don't try to do any of that.

If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.

If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

fluoridation · 2024-10-08T18:58:25 1728413905

>Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

I think you are purposefully misinterpreting the question. They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

SleepyMyroslav · 2024-10-08T22:30:27 1728426627

>What if your game needs to talk to a server and do some string manipulation in between requests? Are you really going to architect everything so that the client doesn't need to handle any of that ever?

Of course! Your string manipulation with user entered attributes like display names or chat messages are 1 millimeter away from old good sql 'Bobby; drop table students'. Never ever do that if you can avoid it. Every time someone 'just concatenates' two strings like to add ie 'symbol that represents input button' programmer makes bad bug that will be both annoying and wrong. Games should use substitution patterns guided by translation team. Because there is no ASCII culture in like around 15 typically supported by big publishers.

There are exceptions like platform provided services to filter ban words in chat. And even there you don't have to do 'things with ASCII characters'. Yeah, players will input unsupported symbols everywhere they can and you need to have good replacement characters for those and fix support for popular emojis regularly. That is expected by communities now.

squeaky-clean · 2024-10-08T19:48:45 1728416925

> They're not asking about converting the case of any Unicode string with locale sensitivity, they're asking about converting the case of ASCII characters.

I'm confused now. The article specifically mentions issues with UTF-16 and UTF-32 unicode characters outside the basic multilingual plane (BMP).

fluoridation · 2024-10-08T20:00:51 1728417651

I'm referring to the people who call case conversion in general "a simple text operation". Say you have an std::string and you want to make it lower case. If you assume it contains just ASCII that's a simpler operation than if you assume it contains UTF-8, but C++ doesn't provide a single function that does either of them. A person can rightly complain that the former is a basic functionality that the language should include; personally, I would agree. And you could say "wow, doesn't this person realize that case conversion in Unicode is actually complicated? They must be really inexperienced." It could be that the other person really doesn't know about Unicode, or it could mean that you and them are thinking about entirely different problems and you're being judgemental a bit too eagerly.

squeaky-clean · 2024-10-08T20:18:42 1728418722

For ascii in C++ isn't there std::tolower / std::toupper? If you're not dealing with unsigned char types there isn't a simple case conversion function, but that's for a good reason as the article lays out.

fluoridation · 2024-10-08T20:45:41 1728420341

Those functions take and return single characters. What's missing is functions that operate on strings. You can use them in combination with std::transform(), but as the article points out, even if you're just dealing with ASCII you can easily do it wrong. I've been using C++ for over 20 years and I didn't know tolower() and toupper() were non-addressable. There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

squeaky-clean · 2024-10-09T00:32:54 1728433974

std::transform() seems like overkill when you can just iterate over the string and modify it in place. And in my opinion, tranform is way less readable than seeing a loop over some array with a single operation inside.

The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

If you are operating on wide strings, there is no suitable single solution, partly because wstring is a terrible type. It's different widths on different platforms, and no string encoding format uses a generalized wsring, they have mandatory min/max character byte widths. So a wstring tells you nothing about the actual encoded string contents semantic representation.

The C++ stdlib could include a fully unicode aware string type set, and surrounding library. But personally I think C++ isn't the kind of language to provide an opinionated stdlib module for such a complex task. And there's no way to implement such a module without being very opinionated about something.

dwattttt · 2024-10-09T09:10:35 1728465035

> The article talks about wstrings for good reason. If you're converting narrow strings, you don't need to be this fancy. Just loop over the string and edit it in place.

Since you mention narrow strings in the context of wstring, just to make sure... you can't convert a UTF-8 std::string character by character, in-place (in case that's what you meant).

7-bit ASCII code points are fine, but outside that it's not guaranteed that one UTF-8 byte converts into exactly one UTF-8 byte when converting case.

squeaky-clean · 2024-10-09T12:40:24 1728477624

Yeah If you're using narrow strings for UTF8 you're making a mistake. wstrings also are not a good representation because of the platform differences, unless you don't care about Windows in which case it's fine but still not great semantically.

In most type definitions you cannot convert UTF8 via simple iteration because the type generally represents a code point and not a character.

You can have a library where UTF8 characters are a native type and code points are a mostly-hidden internal element. But again, that's highly opinionated for C++.

gpderetta · 2024-10-10T09:27:33 1728552453

I'm not 100% sure what you mean by narrow string, but if you refer to std::string vs std::wstring, then std::string is perfectly fine for encoding UTF8, as that uses 8 bit code units which are guaranteed to fit in a char. On the other hand, std::wstring would be a bizarre choice for UTF8 on any platform.

account42 · 2024-10-09T12:44:56 1728477896

It's not guaranteed for 7-bit ASCII either because tolower/toupper are locale-dependent and with the tr_TR lowercase I (U+0049) is ı (U+0131, aka dotless i) wich encodes as two bytes in UTF-8.

squeaky-clean · 2024-10-09T13:02:00 1728478920

That's not ascii then. It's byte width compatible (to a certain degree as you point out). But it's not ascii. ascii defines 128 code points and the handling of an escape character. It doesn't handle locales.

account42 · 2024-10-09T13:29:16 1728480556

ASCII is an encoding, it doesn't say anything about locale. The point is that tolower/toupper is not guaranteed to be safe even if the input is 7-bit.

gpderetta · 2024-10-10T09:16:55 1728551815

I don't think there is any possibility of doing locale specific lower/upper casing in ASCII. It is really designed for (a subset of) American english.

gpderetta · 2024-10-09T12:52:04 1728478324

std::u8string, std::u16string and std::u32string are supposed to be the portable unicode string types, but a lot of machinery is missing and some that has been added has since been deprecated.

> there's no way to implement such a module without being very opinionated about something.

indeed! Boost.Nowide[1] is such an opinionated library.

[1] https://www.boost.org/doc/libs/master/libs/nowide/doc/html/i...

squeaky-clean · 2024-10-09T13:04:58 1728479098

Yep, there's also ICU and utf8cpp, and many others. They all have trade-offs. So I just don't think the stdlib should cover this because there is no objectively best way to handle it.

fluoridation · 2024-10-09T01:35:27 1728437727

I know I can simply iterate. The point is that it's a function that should be included, not that it's impossible without it. It's one of the most common string operations.

squeaky-clean · 2024-10-09T12:48:22 1728478102

To me that feels like the JS community asking for left-pad or is-even in a module. Why have a dedicated function for 2 lines of code?

And it's a huge footgun. There is no ascii type in C++. People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

You could say the generalized tolower should support all the different width/encoding combinations and sort it out. But that's still highly opinionated as far as performance is concerned.

Generalized string conversion is a very complex problem and you really cannot simplify it in a way that will satisfy most C++ users. Just use ICU or utf8cpp if you want to do string operations and don't care what's going on under the hood. But even then I can't recommend just 1 library, because no perfect 3rd party library exists. A perfect first party library definitely could not exist.

fluoridation · 2024-10-09T13:51:00 1728481860

>Why have a dedicated function for 2 lines of code?

Then why does std::max() exist?

>People will use the generalized tolower for UTF8 encoded in narrow strings and have issues.

tolower() and toupper() work correctly on UTF-8 strings, because UTF-8 was specifically designed so that non-ASCII characters were represented by sequences of purely non-ASCII bytes.

>Generalized string conversion is a very complex

Hence why people who say C++ should have a tolower() that operates on strings are not asking more complex Unicode support.

theelous3 · 2024-10-08T23:24:10 1728429850

> There's really no excuse for the library not having simple case conversion functions that operate on strings in-place.

Could not agree more. Any time I touch a C I want to scoop my brain out of my ear. So many simple unbelievably common operations have fifty "best" ways to do them, when they should have one happy path 99% of usecases require baked in. Nobody should ever have to seriously consider something as ridiculous as "is tolower addressable?".

account42 · 2024-10-09T12:40:30 1728477630

std::tolower / std::toupper are rubbish functions that can't do proper Unicode but still pull in the bloated locale machinery for what should be a simple conditional integer addition if all you care about is ASCII. Both have no valid use case and should be marked [[deprecated]] and erased from all teaching materials.

lmm · 2024-10-09T00:52:46 1728435166

> What if your game needs to talk to a server and do some string manipulation in between requests?

What conceivable reason would there be to ever need to do that? If the server takes commands in upper case, then have them in upper case from the start. If the server takes commands in lower case, have them in lower case from the start. If the server specifies that you need to invert the case of its response to use in the next request, find a server developed by someone not crazy.

fluoridation · 2024-10-09T01:13:54 1728436434

Case conversion is not the only string manipulation that's locale sensitive.

lmm · 2024-10-09T01:39:10 1728437950

No reasonable server API should require locale sensitive string manipulation.

NBJack · 2024-10-09T04:33:36 1728448416

Word censoring? Ease of use? Console commands (i.e. from Quake to minecraft)?

wongarsu · 2024-10-09T04:49:31 1728449371

Those sound exactly like the newcomer detectors GP was referring to. What you want is a case-insensitive string comparison, and outside ASCII that's not equivalent to just turning both strings to lowercase and checking equality (or doing a substring search or whatever the task requires)

account42 · 2024-10-09T12:48:57 1728478137

Exactly and where you want case-insensitive comparison you almost always also want other kinds of Unicode normalization.

lmm · 2024-10-09T06:44:09 1728456249

> Word censoring?

Should only ever be needed for text from the user, and in that case, as GP said, find a way to examine it as-is, don't "convert".

> Ease of use?

What ease of use? When has futzing around with case ever made anything easier?

> Console commands (i.e. from Quake to minecraft)?

Why would those necessitate changing case?

barrkel · 2024-10-09T07:46:09 1728459969

Nobody is thinking about converting the case of ASCII characters. To be thinking that, they are explicitly excluding most of the world's cultures from entering common names correctly. Restricting thought to ASCII is a lack of thought, not an active thought.

zahlman · 2024-10-08T18:51:48 1728413508

>If text is coming from user - then change design until its not needed to 'convert'

In games, you can possibly get away with this. Most other people need to worry about things like string collation (locale-aware sorting) for user-supplied text.

makeitdouble · 2024-10-09T02:48:30 1728442110

TBF, if you are caring about string collation, you're already at the entrance of the rabbit hole and probably should go down to the deep end anyway.

I'd assume SleepyMyroslav doesn't apply to devs willing to spend weeks at time to handle all the complexity in full.

cheema33 · 2024-10-09T05:09:13 1728450553

> In gamedev there is simple rule: don't try to do any of that.

I am not in gamedev, but I frequently have to develop middleware that takes in user entered data and formats it in a way that will import into a 3rd party system without errors. And that sometimes means changing the case on strings.

In my experience as a developer, this is very very common requirement.

Luckily I am not forced to use a low level language for any of my work. In C# I can simply do this: "hello world".ToUpper();

Smaug123 · 2024-10-09T07:37:02 1728459422

If you're putting data into a third-party system, you might want `ToUpperInvariant`, not `ToUpper`. (Just checking that you know the difference, because most people don't!)

crote · 2024-10-09T08:02:42 1728460962

The problem is that such third-party requirements are usually wrong.

Two decades ago some developer probably went "Yeah, obviously all names start with capital letters!", not realizing that there are in fact plenty of names which start with a lowercase letter. So they added an input validation test which checks for capitals, which meant everyone feeding that system had to format their data. A whole ecosystem grew around the format of the output of that system, and now you're suddenly rewriting the system and you run into weird and plain wrong capitalization requirements for no technical reason whatsoever.

Alternatively, the same but start with punch cards which predate ASCII and don't distinguish between uppercase and lowercase letters.

> In C# I can simply do this: "hello world".ToUpper()

... which does not work.

Take a look at the German word "straße" (street), for example. Until very recently the "ß" character did not have an uppercase variant, so a ToUpper would convert it to "STRASSE". This is a lossy operation, as the reverse isn't true: the lowercase variant of "KONGRESSSTRASSE" (congress street) is not "kongreßstraße" - it's supposed to be "Kongressstraße".

It can get even worse: the phrase "in Maßen" (in moderate amounts) naively has the uppercase variant "IN MASSEN" - but that means "in huge amounts"! In that case it is probably better to stick to "IN MASZEN".

And then there's Turkish, where the uppercase variant of the letter "i" is of course "İ" rather than "I" - note the dot.

So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.

In practice that means your options are a) giving up on the concept of uppercase/lowercase conversion and just passing it as-is, or b) accepting that you are inevitably going to be silently corrupting your data.

neonsunset · 2024-10-09T17:53:04 1728496384

> So no, you cannot "simply" use ToUpper() / ToLower(). They might work well enough of basic ASCII for languages like English, but they have a habit of making a mess out of everything else. You're supposed to use CultureInfo.TextInfo.ToUpperCase() and explicitly specify what locale the text is in so that it can use the right converter. Which is of course essentially impossible in general-purpose text fields.

Have you ever read the documentation? https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...

crote · 2024-10-10T12:15:13 1728562513

Yes. Now try applying it to something like this very HN comment section, which is mixing words belonging to different cultures inside a single comment - and in some cases even inside the same word.

Sure, you can now do case conversion for a specific culture, but which one?

wruza · 2024-10-09T10:34:56 1728470096

It’s a lossy operation, but it does work. By this logic jpeg and mpeg don’t work either. But were watching them videos daily.

Yes we can simply ToUpper(). We just can’t ToUpper().ToLower(), but that’s useless cause we have the original string if we need it and fine if we don’t need it.

account42 · 2024-10-09T12:53:49 1728478429

The point is that what ToUpper does depends on locale AND Unicode version. This for many applications it only appears to work until it will fail spectacularly in production.

Netch · 2024-10-09T15:18:13 1728487093

> In C# I can simply do this: "hello world".ToUpper();

Hmm still actual: https://www.moserware.com/2008/02/does-your-code-pass-turkey...

neonsunset · 2024-10-09T17:54:26 1728496466

> 2008

This is completely irrelevant because culture-sensitive case conversion relies on ICU/NLS.

Netch · 2024-10-11T06:47:20 1728629240

But at least a programmer shall be aware to call it (whatever API is used).

pjmlp · 2024-10-09T07:44:46 1728459886

Note that the correct way to do that in C# would be to pass an instance of CultureInfo.

jameshart · 2024-10-09T09:21:40 1728465700

I don’t think you can say this is universally known in ‘game dev’. In fact just last week I stumbled using the UI in a game that let me enter a name for something, which it then displayed in uppercase.

Game UI is the place I’d expect to most likely come across horrific abuses of localization precisely because game UI is such a cobbled together layer of hacks on hacks.

beeboobaa3 · 2024-10-08T21:24:46 1728422686

> There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

Your web browser is doing it right now as you are reading this comment.

rty32 · 2024-10-08T22:51:12 1728427872

And web development is not game development? And chances are that games don't ship chromium with them?

moron4hire · 2024-10-09T01:06:55 1728436015

Actually...

  https://github.com/baikety/uWebKit
  https://zenfulcrum.com/browser/docs/Readme.html
  https://github.com/roydejong/chromium-unity-server

There are a lot more, I just got bored at 3.

And it's not just Unity. Several exist for Unreal as well.

Why? Specifically because 2D layout and text rendering suck so much in game engines. What's ~50MB matter when you're shipping several GB of game assets?

blenderob · 2024-10-08T08:58:12 1728377892

It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

An acceptable solution is given at the end of the article:

> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.

bayindirh · 2024-10-08T09:21:42 1728379302

I don't think it's a C++ problem. You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

On the other hand, libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Any tool which is old enough will have a thousand ways to do something. This is the inevitability of software and programming languages. In the domain of C++, which has a size mammoth now, everyone expects this huge pony to learn new tricks, but everybody has a different idea of the "new tricks", so more features are added on top of its already impressive and very long list of features and capabilities.

You want libICU built-in? There must be other folks who want that too. So you may need to find them and work with them to make your dream a reality.

So, C++ is doing fine. It's not that they omitted Unicode during the design phase. Unicode has arrived later, and it has to be integrated by other means. This is what libraries for.

zahlman · 2024-10-08T18:40:02 1728412802

>You just can't transform anything developed in "ancient" times to unicode aware in a single swoop.

Even for Python it took well over a decade, and people still complain about the fact that they don't get to treat byte-sequences transparently as text any more - as if they want to wrestle with the `basestring` supertype, getting `UnicodeDecodeError` from an encoding operation or vice-versa, trying to guess the encoding of someone else's data instead of expecting it to be decoded on the other side....

But in C++ (and in C), you have the additional problem that the 8-bit integer type was named for the concept of a character of text, even though it clearly cannot actually represent any such thing. (Not to mention the whole bit about `char` being a separate type from both `signed char` and `unsigned char`, without defined signedness.)

pornel · 2024-10-08T09:33:33 1728380013

Being developed in, and having to stay compatible with, ancient times is a real problem of C++.

The now-invalid assumptions couldn't have been avoided 50 years ago. Fixing them now in C++ is difficult or impossible, but still, the end result is a ton of brokenness baked into C++.

Languages developed in the 21st century typically have some at least half-decent Unicode support built-in. Unicode is big and complex, but there's a lot that a language can do to at least not silently destroy the encoding.

cm2187 · 2024-10-08T10:18:58 1728382738

That explains why there are two functions, one for ascii and one for unicode. That doesn't explain why the unicode functions are hard to use (per the article).

BoringTimesGang · 2024-10-08T10:55:24 1728384924

Because human language is hard to boil down to a simple computing model and the problem is underdefined, based on naive assumptions.

Or perhaps I should say naïve.

cm2187 · 2024-10-08T18:11:15 1728411075

Well pretty much every other more recent language solved that problem.

kccqzy · 2024-10-08T18:24:31 1728411871

Almost no programming language, perhaps other than Swift, solved that problem. Just use the article's examples as test cases. It's just as wrong as the C++ version in the article, except it's wrong with nicer syntax.

zahlman · 2024-10-08T18:46:01 1728413161

Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:

    >>> 'ß'.upper()
    'SS'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.casefold()
    'ss'

There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.

(No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

crote · 2024-10-09T08:07:53 1728461273

But that's wrong. The uppercase for "in Maßen" ("in moderate amounts") is not "IN MASSEN" ("in Massen", meaning "in massive amounts").

kccqzy · 2024-10-08T19:04:40 1728414280

Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.

> (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.

zahlman · 2024-10-08T19:08:57 1728414537

I'm not seeing anything in the Swift documentation about strings carrying language metadata, either, though?

kccqzy · 2024-10-08T19:20:54 1728415254

This lowercase function takes a locale argument https://developer.apple.com/documentation/foundation/nsstrin...

It looks like an old NSString method that's available in both Obj-C and Swift.

The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.

tedunangst · 2024-10-08T19:18:53 1728415133

But that's wrong. The upper case for ß is ẞ.

cm2187 · 2024-10-08T20:27:23 1728419243

C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.

account42 · 2024-10-09T13:14:59 1728479699

This is not a locale issue, it's a Unicode version issue. Which hightlights another problem with adding this to the base standard library.

IncreasePosts · 2024-10-08T20:15:24 1728418524

That was only adopted in Germany like 7 years ago!

kccqzy · 2024-10-08T21:10:48 1728421848

Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.

extraduder_ire · 2024-10-09T01:10:28 1728436228

Does unicode have space set aside for those new symbols to slot into? I know it's very rare, but it could get messy.

account42 · 2024-10-09T13:16:52 1728479812

Unicode is already messy. Chinese characters especially so due to han unificiation.

Towaway69 · 2024-10-09T06:29:13 1728455353

Isn't uppercase for ß just ß - i.e. it's its own uppercase character?

bratwurst3000 · 2024-10-09T12:59:18 1728478758

there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and converting anything to uppercase SS isn’t something germany wants …

account42 · 2024-10-09T13:22:52 1728480172

> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.

Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.

Towaway69 · 2024-10-09T21:47:10 1728510430

I don’t think that Germany wants a capital ß or the German language requires one rather technology needs one to dot the eyes and cross the tees.

account42 · 2024-10-09T13:18:42 1728479922

Not generally no, but some applications used it that way because of ambiguity of upppercasing ß to SS - which is why ẞ was added.

Towaway69 · 2024-10-09T21:43:30 1728510210

On the other hand, the German language has existed for several hundred years without having a capital ß but now it needs one?

True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?

tialaramex · 2024-10-08T22:00:38 1728424838

Rust will cheerfully:

    assert_eq!("ὀδυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

[Notice that this is in fact entirely impossible with the naive strategy since Greek cares about position of symbols]

Some of the latter examples aren't cases where a programming language or library should just "do the right thing" but cases of ambiguity where you need locale information to decide what's appropriate, which isn't "just as wrong as the C++ version" it's a whole other problem. It isn't wrong to capitalise A-acute as a capital A-acute, it's just not always appropriate depending on the locale.

account42 · 2024-10-09T13:27:33 1728480453

Is this

    assert_eq!("\u1F41δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

or

    assert_eq!("\u03BF\u0314δυσσεύς", "ὈΔΥΣΣΕΎΣ".to_lowercase());

For display it doesn't matter but most other applications really want some kind of normalizatin which does much much more so having a convenient to_lowercase() doesn't buy you as much as you think and can be actively misleading.

MBCook · 2024-10-08T22:36:17 1728426977

So what?

That doesn’t prevent adding a new function that converts an entire string to upper or lowercase in a Unicode aware way.

What would be wrong with adding new correct functions to the standard library to make this easy? There are already namespaces in C++ so you don’t even have to worry about collisions.

That’s the problem I see. It’s fine if you have a history of stuff that’s not that great in hindsight. But what’s wrong with having a better standard library going forward?

It’s not like this is an esoteric thing.

wakawaka28 · 2024-10-09T00:24:20 1728433460

The reason that wasn't done is because Unicode is not really in older C++ standards. I think it may have been added to C++23 but I am not familiar with that. There are many partial solutions in older C++ but if you want to do it well then you need to get a library for it from somewhere, or else (possibly) wait for a new standard.

Unicode and character encodings are pretty esoteric. So are fonts. The stuff is technically everywhere and fundamental, but there are many encodings, technical details, etc. And most programmers only care about one language, or else only use UTF-8 with the most basic chars (the ones that agree with ASCII). That isn't terrible. You only need what you actually need. Most programs don't strictly have to be built for multiple random languages, and there is kind of a standard methodology to learn before you can do that.

account42 · 2024-10-09T13:11:59 1728479519

No, strong backwards compatiblity a real strength of C++. In fact, it's probably it's main strength these days.

relaxing · 2024-10-08T09:44:57 1728380697

It’s been 30 years. Unicode predates C++98. Java saw the writing on the wall. There’s no excuse.

bayindirh · 2024-10-08T09:52:47 1728381167

> There’s no excuse.

I politely disagree. None of the programming languages which started integrating Unicode was targeting from bare metal to GUI, incl. embedded and OS development at the same time.

C++ has a great target area when compared to other programming languages. There are widely used libraries which compile correctly on PDP-11s, even if they are updated constantly.

You can't just say "I'll be just making everything Unicode aware, backwards compatibility be damned, eh".

blenderob · 2024-10-08T09:56:48 1728381408

But we don't have to make everything Unicode aware. Backward compatibility is indeed very important in C++. Like you rightly said, it still has to work for PDP-11 without breaking anything.

But the C++ overlords could always add a new type that is Unicode-aware. Converting one Unicode string to another is a purely in-memory, in-CPU operation. It does not need any I/O and it does not need any interaction with peripherals. So one can dream that such a type along with its conversion routines could be added to an updated standard library without breaking existing code that compiles correctly on PDP-11s.

bayindirh · 2024-10-08T10:09:50 1728382190

> Converting one Unicode string to another is a purely in-memory, in-CPU operation.

...but it's a complex operation. This is what libICU is mostly for. You can't just look-up a single table and convert a string to another like you work on ASCII table or any other simple encoding.

Germans have their ß to S (or capital ß depending on the year), Turkish has ı/I and i/İ pairs, and tons of other languages have other rules.

Esp, this I/ı and İ/i pairs break tons of applications in very unexpected ways. I don't remember how many bugs I reported, and how many workarounds I have implemented in my systems.

Adding a type is nice, but the surrounding machinery is so big, it brings tons of work with itself. Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

SAI_Peregrinus · 2024-10-08T13:56:34 1728395794

> Unicode is such a complicated system, that I read that even you need two UTF-16 characters (4 bytes in total) to encode a single character. This is insane (as in complexity, I guess they have their reasons).

Because there are more than 65,535 characters. That's just writing systems, not Unicode's fault. Most of the unnecessary complexity of Unicode is legacy compatibility: UTF-16 & UTF-32 are bad ideas that increase complexity, but they predate UTF-8 which actually works decently well so they get kept around for backwards compatibility. Likewise with the need for multiple normalization forms.

numpad0 · 2024-10-09T00:07:25 1728432445

It's because Unicode don't allow for language switching.

It takes up to eight bytes per character in Unicode if you want to support both Chinese and Japanese in a single font using IVS(and I don't think there's any font that actually supports this).

AFAICS(As far as I can search), Simplified(PRC) and Traditional(Taiwan) Chinese encoding are respectively called GB2312 and Big5, and they're both two byte encodings with good practical coverage. Same applies for Japanese Shift_JIS. If e.g. :flag_cc: were allowed to be used as start-of-language marker, one could theoretically cut that back down to two bytes per character without losing much and actually improving language supports.

account42 · 2024-10-09T13:55:06 1728482106

The number of characters is not the problem, the mess due to legacy compatibility is - case folding and normaltization could be much simpler if the codepoints were laid out with that in mind. Also the fact the Unicode can't make up its mind if it wants to encode glyphs (turkish I and i, han unification) or semantic characters (e.g. cyrillic vs. latin letters) or just "ideas" (emojis).

bayindirh · 2024-10-08T14:06:49 1728396409

I mean, I already know some Unicode internals and linguistics (since I developed a language-specific compression algorithm back in the day), but I have never seen a single character requiring four bytes (and I know Emoji chaining for skin color, etc.).

So, seeing this just moved the complexity of Unicode one notch up in my head, and I respect the guys who designed and made it work. It was not whining or complaining of any sort. :)

fluoridation · 2024-10-08T19:06:42 1728414402

Cuneiform codepoints are 17 bits long. If you're using UTF-16 you'll need two code units to represent a character.

gpderetta · 2024-10-09T12:25:20 1728476720

you also need two UTF16 code units for plain emojis.

TorKlingberg · 2024-10-09T13:53:27 1728482007

Lots of emoji are outside the Basic Multilingual Plane and need 4 bytes in UTf-8 and UTF-16. That's without going into skin color and other modifiers and combinations.

account42 · 2024-10-09T13:45:29 1728481529

> Germans have their ß to S (or capital ß depending on the year)

FYI, it's never S. If there is no better option then SS and ss are the proper capital and lowercase substitutions.

blenderob · 2024-10-08T10:17:08 1728382628

Thanks for the reply! Really appreciate the time you have taken to write down a thoughtful reply.

bayindirh · 2024-10-08T10:34:46 1728383686

No problems! If you want a slightly longer write-up, here's a classic I constantly share with people:

https://blog.codinghorror.com/whats-wrong-with-turkey/

wakawaka28 · 2024-10-09T00:28:00 1728433680

Java was built from scratch as a heavy language with a whole portability layer that C++ does not have. Also, libraries have been around to do this stuff in C++ but maybe some people saw it better to not require C++ to support Unicode, presumably.

Netch · 2024-10-09T15:24:47 1728487487

> There’s no excuse.

Until mid-2000s there was no certainty Unicode will eventually defeat competitors. In real it havenʼt fully yet - GB2312 and Tron are still locally prevailing, and IBM still jogs with EBCDIC. But at its early times nobody was reasonably sure, and Java attempt could have failed as well. (More so Java approach for UCS-2 was wrong - already commented nearby.)

gpderetta · 2024-10-08T19:53:37 1728417217

Java ended up picking UCS-2 and getting screwed.

throwaway2037 · 2024-10-09T09:53:42 1728467622

Pretty much all Unicode early adopters went for 16-bit chars. Qt and Win32 API are another pair.

gpderetta · 2024-10-09T12:21:21 1728476481

Indeed, ICU as well, and then they all moved to UTF-16, which, again. in the long term lost to UTF-8. My point is that committing on a specific Unicode design 30 years ago was not, in retrospect, necessarily a good idea.

By not committing to UCS-2 early C++ left the road open to UTF-8. I'll concede that UTF8 has risen as the clear winner for more than a decade and C++ is well past the point that it should have at least basic builtin support. The problem is that there is at least one important C++ platform that only very recently added full support for the encoding in their native API.

account42 · 2024-10-09T14:13:44 1728483224

Qt really has no excuse for still using 16-bit characters since unlike the other two they have had multiple ABI breaks since then.

throwaway2037 · 2024-10-10T07:37:59 1728545879

"no excuse" -- I would respectfully disagree here. There are lots of very smart people who have worked on Qt. Really, some insanely good C++ programmers have worked on that project. I have no doubt that they have discussed changing class QString to use UTF-8 internally. To be clear, probably QChar would also need to change, or a new class (QChar8?) would be needed, in parallel to QChar. I guess they concluded the API breakage would be too severe. I assume Java and Win32/DotNet decided the same. Finally, you can Google for old mailing list discussions about QString using UTF-16. Many before have asked "can we just change to UTF-8?".

account42 · 2024-10-10T07:48:58 1728546538

Ah yes, appeal to authority. No better way to admit that you are talking out of your arse.

nitwit005 · 2024-10-10T06:39:24 1728542364

Java embraced Unicode, and ended up with a mess as Unicode changed underneath it.

You can actually end up in a cleaner state in C++, as there is no obligation to use the standard library string classes, but it's pretty much required in Java.

account42 · 2024-10-09T13:43:33 1728481413

Java has 16-bit character types. It is in no way better at modern Unicode than C++ while being needlessly less efficient for mostly-ASCII text like XML-like markup.

akira2501 · 2024-10-08T18:34:05 1728412445

> libicu is 37MB by itself, so it's not something someone can write in a weekend and ship.

Isn't that mostly just from tables derived from the Unicode standard?

ectospheno · 2024-10-08T18:45:48 1728413148

> Any tool which is old enough will have a thousand ways to do something.

Only because of the strange desire of programmers to never stop. Not every program is a never ending story. Most are short stories their authors bludgeon into a novel.

Programming languages bloat into stupidity for the same reason. Nothing is ever removed. Programmers need editors.

fluoridation · 2024-10-08T19:01:41 1728414101

So how do you design a language that accommodates both the people who need a codebase to be stable for decades and the people who want the bleeding edge all the time, backwards compatibility be damned?

the_gorilla · 2024-10-08T19:03:49 1728414229

You don't. Any language that tries to do both turns into an unusable abomination like C++. Good languages are stable and the bleeding edge is just the "new thing" and not necessarily better than the old thing.

fluoridation · 2024-10-08T19:07:49 1728414469

C++ doesn't try to do that. It aims to remain as backwards compatible as possible, which is what the GP is complaining about.

pistoleer · 2024-10-08T09:20:52 1728379252

> There are so many ways to do something and every way is freaking wrong!

That's life! The perfect way does not exist. The best you can do is be aware of the tradeoffs, and languages like C++ absolutely throw them in your face at every single opportunity. It's fatiguing, and writing in javascript or python allows us to uphold the facade that everything is okay and that we don't have to worry about a thing.

pornel · 2024-10-08T09:45:15 1728380715

JS and Python are still old enough to have been created when Unicode was in its infancy, so they have their own share of problems from using UCS-2 (such as indexing strings by what is now a UTF-16 code unit, rather than by a codepoint or a grapheme cluster).

Swift has been developed in the modern times, and it's able to tackle Unicode properly, e.g. makes distinction between codepoints and grapheme clusters, and steers users away from random-access indexing and having a single (incorrect) notion of a string length.

Muromec · 2024-10-08T10:09:48 1728382188

Well, the only time you can do str lower where unicode locale awareness will be a problem is when you do it on the user input, like names.

How about you just dont? If it's a constant in your code, you probably use ASCII anyway or can do a static mapping. If it's user user input -- just don't str lower / str upper it.

account42 · 2024-10-09T14:19:23 1728483563

Yes, except when it is not your choice. If the requirements are to display some strings in lower/uppercase then you need to find a way to do that. That doesn't have to be using the standard library though.

pjmlp · 2024-10-08T14:05:32 1728396332

Because it is a fight to put anything on a ISO managed language, and only the strongest persevere long enough to make it happen.

Regardless of what ISO language we are talking about.

account42 · 2024-10-09T14:21:23 1728483683

If anything it should be harder to add things to the language. Too many new additions have been half-arsed like and needed to be changed or deprecated soon after.

gpderetta · 2024-10-08T19:57:08 1728417428

Yes, significantly smaller libraries had an hard time getting onto the standard. Getting the equivalent of ICU would be almost impossible. And good luck keeping it up to date.

account42 · 2024-10-09T13:09:25 1728479365

> Makes you wonder why this isn't part of the C++ standard library itself.

Because the C++ standard library cares about binary size and backwards compatiblity, both of with are incompatible with a full Unicode implementation. Putting this in the stdlib means everyone has to pay for it even when you don't need it.

Libraries are fine, not everything needs to be defined by the language itself.

Netch · 2024-10-09T15:30:44 1728487844

> Makes you wonder why this isn't part of the C++ standard library itself.

Plainly no need if there is a separate easily attachable library (and with permissible license). What C++ had to do - provide character (char{8,16,32}_t) and string types - it has done.

Night_Thastus · 2024-10-09T15:22:40 1728487360

As a C++ dev, I have never run into the problem the post is describing. Upper and lowercase conversion has always worked just fine. Though then again, I don't fiddle with mixed unicode and non-unicode situations.

wslh · 2024-10-08T23:21:41 1728429701

Me too, how is case conversion perfectly done in modern languages such as Zig [1], Rust, or Swift?

[1] Ended up looking at https://github.com/JakubSzark/zig-string

steveklabnik · 2024-10-09T00:19:03 1728433143

In Rust, the APIs are clear if they're ASCII only or unicode aware.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ‘Lowercase’ is defined according to the terms of the Unicode Derived Core Property Lowercase.

https://doc.rust-lang.org/stable/std/primitive.str.html#meth...

> ASCII letters ‘A’ to ‘Z’ are mapped to ‘a’ to ‘z’, but non-ASCII letters are unchanged.

Now, "perfectly" is very strong. For example, the Turkish i problem. That is not solved. But 99% of Unicode stuff is handled correctly by default.

oguz-ismail · 2024-10-09T02:24:34 1728440674

> 99% of Unicode stuff

Does that include context-dependent conversion rules like o'reilly -> O'Reilly?

PaulDavisThe1st · 2024-10-09T05:10:06 1728450606

that is neither up-casing nor-downcasing, but (de)capitalization, which is a significantly more complex task (which ultimately requires up- or down-casing, but a whole lot more before then).

oguz-ismail · 2024-10-09T05:26:00 1728451560

So it doesn't. If Unicode doesn't cover non-trivial forms of case-folding, 99% of Unicode doesn't mean anything.

PaulDavisThe1st · 2024-10-09T17:49:15 1728496155

I am not aware of a Unicode concept of "the latin letter o followed by an apostrophe followed by another latin letter". Unicode would identify the glyphs for such a concept, but I don't see how Unicode is involved in this in anyway as far the process of deciding what "capitalized o'reilly" means.

steveklabnik · 2024-10-09T12:04:04 1728475444

Sort of, see the Greek example elsewhere in this thread. I don’t think that specific situation is part of Unicode though.

hoseja · 2024-10-09T07:50:17 1728460217

>Makes you wonder why this isn't part of the C++ standard library itself.

Because then every change in Unicode would need to be standardized in C++ as well. Yup. Can't have Unicode due to committee friction.

dennis_jeeves2 · 2024-10-09T05:25:07 1728451507

> There are so many ways to do something and every way is freaking wrong!

Stroustrup, laugheth!

BoringTimesGang · 2024-10-08T09:19:56 1728379196

>It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

These are mostly unicode or linguistics problems.

tralarpa · 2024-10-08T09:32:07 1728379927

The fact that the standard library works against you doesn't help (to_lower takes an int, but only kind of works (sometimes) correctly on unsigned char, and wchar_t is implicitly promoted to int).

BoringTimesGang · 2024-10-08T09:41:22 1728380482

to_lower is in the std namespace but is actually just part of the C89 standard, meaning it predates both UTF8 and UTF16. Is the alternative that it should be made unusable, and more existing code broken? A modern user has to include one of the c-prefix headers to use it, already hinting to them that 'here be dragons'.

But there are always dragons. It's strings. The mere assumption that they can be transformed int-by-int, irrespective of encoding, is wrong. As is the assumption that a sensible transformation to lower case without error handling exists.

account42 · 2024-10-09T14:27:11 1728484031

> Is the alternative that it should be made unusable, and more existing code broken?

It should be marked [[deprecated]], yes. There is no good reason to use std::tolower/toupper anywhere - they can neither do unicode properly nor are they anywhere close to efficient for ASCII. And their behavior depends on the process-global locale.

appointment · 2024-10-08T09:10:47 1728378647

The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.

If you are handling multilingual text the locale is mandatory metadata.

zarzavat · 2024-10-08T09:36:20 1728380180

Different parts of a string can be in different languages too[1].

The lowercase of "DON'T FUSS ABOUT FUSSBALL" is "don't fuss about fußball". Unless you're in Switzerland.

[1] https://en.wikipedia.org/wiki/Code-switching

schoen · 2024-10-08T09:43:24 1728380604

Probably "don't fuss about Fußball" for the same reasons, right?

thiht · 2024-10-08T14:53:08 1728399188

I thought the German language deprecated the use of ß years ago, no? I learned German for a year and that's what the teacher told us, but maybe it's not the whole story

47282847 · 2024-10-08T16:37:08 1728405428

Incorrect. ẞ is still a thing.

CamperBob2 · 2024-10-08T18:49:54 1728413394

Going by what you and the grandparent wrote, it's not just a thing, but two different things: ẞ ß

It is probably time for an Esperanto advocate to show up and set us all straight.

selenography · 2024-10-08T21:50:54 1728424254

> set us all straight.

Se fareblus oni, jam farintus oni. (It definitely won't happen on an echo-change day like today, either. ;))

Contra my comrade's comment, Esperanto orthography is firmly European, and so retains European-style casing distinctions; every sound thus still has two letters -- or at least two codepoints.

(There aren't any eszettesque bigraphs, but that's not saying much.)

D-Coder · 2024-10-08T21:11:15 1728421875

Pri kio vi parolas? En Esperanto, unu letero egalas unu sonon.

What are you talking about? In Esperanto, one letter equals one sound.

TZubiri · 2024-10-08T21:33:30 1728423210

Germans run Uber Long Term Support dialects

Kwpolska · 2024-10-08T23:06:00 1728428760

The Swiss have dropped ß, but it's still a thing in Germany or Austria.

jeroenhd · 2024-10-09T10:43:49 1728470629

Language is just part of the problem. Unicode lets you store text as entered, but what you do with that text completely depends on what your problem domain is. When you're writing software to validate that the name on someone's ID matches that on a ticket, you're probably going to normalise that name to your (customer's) locale rather than render each name in the locale it was originally written in. As long as you keep your locale settings consistent and don't do bad stuff like "iterate over characters and individually transform them", you're probably fine, unless your problem domain calls for something else.

If you're printing a name, you're probably printing the name for the current user, not for the person who entered it at some point. If you're going to try to convert back like that, you also need to store a timestamp with every string in case a language changes its rules (such as permitting ẞ instead of SS when capitalising ß). And even then, someone might intend to use the new spelling rules, or they might not, who knows!

This article probably boils down to "programmers don't realise graphemes aren't characters and characters aren't bytes even though they usually are in US English". The core problem, "text processing looks easy as long as you only look at your own language", is one that doesn't just affect computers.

Your best bet is to just avoid the entire problem by not processing input further than basic input sanitisation, such as removing whitespace prefixes/suffixes and maybe stripping out invalid unicode so it can't be used as a weird stored attack.

thayne · 2024-10-09T02:04:41 1728439481

Not quite.

islower is actually supposed to account for the user's "locale", which includes their language.

The key takeway is that lowercasing a string needs to be done on the whole string, not individual characters, even if std::string had a way to iterate over codepoints instead of bytes (or code units, in the case of wstring).

And there isn't a standard way to do that, you either meed to use a platform specific API, like the windows function mentioned, or use a library like ICU.

vardump · 2024-10-08T08:53:29 1728377609

As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)

That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.

crazygringo · 2024-10-08T20:58:02 1728421082

> That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

I think it's more the exact opposite.

The only times I'm dealing with 7-bit ASCII is for internal identifiers like variable names or API endpoints. Which is a lot of the time, but I can't ever think of when I've needed my code to change their case. It might literally be never.

On the other hand, needing to switch between upper, lower, and title case happens all the time, always with people's names and article titles and product names and whatnot. Which are never in ASCII because this isn't 1990.

bigstrat2003 · 2024-10-09T03:12:06 1728443526

> Which are never in ASCII because this isn't 1990.

This is a very silly statement. I'm willing to believe that you have lots of cases where those things are outside the ASCII range. Perhaps even most of the cases, depending on where you live. But I do not believe for one second that it never happens.

crazygringo · 2024-10-09T03:15:33 1728443733

Never stored in ASCII, never limited to ASCII. They're UTF-8, usually.

If somebody's name happens to fit into ASCII that's irrelevant because it's not guaranteed, so you can never blindly do an ASCII case conversion.

For text data meant for users, I literally cannot remember the last time I used a string in ASCII format as opposed to UTF-8 (or UTF-16 in JS). It's certainly over a decade ago.

So yes, when I say never, I literally mean never. Nothing "very silly" about it, sorry.

(Again, excepting identifiers, where case conversion is not generally applicable.)

hinkley · 2024-10-08T22:53:05 1728427985

And you could argue that if the internal identifiers need to be capitalized or lower-cased, you've already lost.

On an enterprise app these little string manipulations are a drop in the bucket. In a game they might not be. Sort that stuff out at compile time, or commit time.

account42 · 2024-10-09T14:44:02 1728485042

You can't always control the case you get but often you can not care about anything outside ASCII. Scripts and configuration or text-based data formats are common examples.

sebstefan · 2024-10-08T10:01:16 1728381676

Yes please, keep making software that mangles my actual last name at every step of the way. 99% of the world loves it when you only care about the USA.

Muromec · 2024-10-08T10:13:12 1728382392

If it needs to uppercase names it probably interfaces with something forsaken like Sabre/Amadeus that only understands ASCII anyway.

The real problem is accepting non-ASCII input from user where you later assume it's ASCII-only and safe to bitfuck around.

sebstefan · 2024-10-08T11:06:15 1728385575

From experience anything banking adjacent will usually fuck it up as well

For some reason they have a hard-on for putting last names in capital letters and they still have systems in place that use ASCII

Muromec · 2024-10-08T13:52:16 1728395536

If it uses ASCII anyway, what's the problem then? Don't accept non-ASCII user input.

sebstefan · 2024-10-08T14:01:53 1728396113

First off: And exclude 70% of the world?

Usually they'll accept it, but some parts of the backend are still running code from the 60's.

So you get your name rendered properly on the web interface, and most core features, but one day you're wandering off from the beaten path, by, like, requesting some insurance contract, and you'll see your name at the top with some characters mangled, depending on what your name's like. Mine is just accented latin characters so it usually drops the accents ; not sure how it would work if your name was in an entirely different alphabet

Muromec · 2024-10-08T20:16:23 1728418583

>First off: And exclude 70% of the world?

Guess what, I'm part of this 70% and I also work in a bank and I know exactly how.

Not a single letter in my name (any of them) can be represented with ASCII. When it is represented in UTF-8, most of the people who have to see it can't read it anyway.

So my identity document issued by the country which doesn't use Latin alphabet includes ASCII-representation of my name in addition to canonical form in Ukrainian Cyrillic. That ASCII-rendering is happily accepted by all kinds of systems that only speak ASCII.

People still can't pronounce it and it got misspelled like yesterday when dictated over the phone.

Now regarding the accents, it's illegal to not support them per GDPR (as per case law, discussed here few years ago).

numpad0 · 2024-10-09T00:19:02 1728433142

Why can't these people understand that that 70% of the world consider ASCII to be "the computer language", not English, and UTF-8 to be "whatever soup that only works inside files and forms and can't be program manipulated"?

Maybe it needs to be communicated more often, like way more often, until it sticks.

Muromec · 2024-10-09T17:36:44 1728495404

Well, it's much easier to understand the difference when one and another are using different alphabets.

account42 · 2024-10-09T14:46:03 1728485163

You are not being excluded just because you need to use a romanized version of your name. Clear example of a first world problem.

sebstefan · 2024-10-10T11:04:25 1728558265

>first world problem

? The more first world you are the more your alphabet is taken into consideration

Hint: You use the word """romanized"""

InfamousRece · 2024-10-08T12:13:00 1728389580

Some systems are still using EBCDIC.

account42 · 2024-10-09T14:44:46 1728485086

Cool, I will.

MajimasEyepatch · 2024-10-08T23:31:54 1728430314

It’s totally reasonable to assume your users are in the US if your business only sells to people in the US. I work in the health insurance sector; there’s absolutely no chance my company ever sells these products internationally. We can’t even sell them in every state.

davidcbc · 2024-10-08T23:47:53 1728431273

It's not reasonable to assume that users in the US have names that only use 7-bit ASCII

account42 · 2024-10-09T14:51:01 1728485461

It's reasonable to assume that all users can deal with having to encode their names in 7-bit ASCII. Otherwise you might as well demand that computer systems need to support arbitrary drawings in the name field at which point you might as well not have a name field at all because even most humans won't be able to deal with what you want to put in there.

davidcbc · 2024-10-09T22:58:50 1728514730

Nice slippery slope you've got there

bigstrat2003 · 2024-10-09T03:13:03 1728443583

It actually is. That covers the vast, vast majority of people in the US.

saagarjha · 2024-10-14T08:50:23 1728895823

That is not very reasonable, is it?

MajimasEyepatch · 2024-10-15T16:04:58 1729008298

It is if all your customers are in the US.

fhars · 2024-10-08T09:03:53 1728378233

No, when you are doing string manipulation, you are almost never interestet in just the seven bit ASCII range, as there is almost no language that can be written using just that.

vardump · 2024-10-08T09:12:21 1728378741

> as there is almost no language that can be written using just that.

99% of use cases I've seen have nothing to do with human language.

1% human language case that is needs to be handled properly using a proper Unicode library.

Your mileage (percentages) may vary depending on your job.

kergonath · 2024-10-08T10:06:09 1728381969

Right. That’s why I still get mail with my name mangled and my street name barely recognisable. Because I’m in the 1%. Too bad for me…

In all seriousness, though, in the real world ASCII works only for a subset of a handful of languages. The vast majority of the population does not read or write any English in their day to day lives. As far as end users are concerned, you should probably swap your percentages.

ASCII is mostly fine within your programs like the parser you mention in your other comment. But even then, it’s better if a Chinese user name does not break your reporting or logging systems or your parser, so it’s still a good idea to take Unicode seriously. Otherwise, anything that comes from a user or gets out of the program needs to behave.

vardump · 2024-10-08T10:12:18 1728382338

I said use a Unicode library if input data is actual human language. Which names and addresses are.

99% case being ASCII data generated by other software of unknown provenance. (Or sometimes by humans, but it's still data for machines, not for humans.)

kergonath · 2024-10-08T13:04:57 1728392697

I am really not sure about this 99%. A lot of programs deal with quite a lot of user-provided data, which you don’t control.

account42 · 2024-10-09T14:56:51 1728485811

User-provided data, yes, but also data where you can treat non-ASCII bytes as garbage in -> garbage out. E.g. the config file might be typed by a human but if you need to support case-insensitive keys you still don't need to worry about Unicode.

kergonath · 2024-10-10T22:08:12 1728598092

Exactly. But in this case, don’t try to upper-case or otherwise transform anything.

Factory · 2024-10-09T01:00:39 1728435639

"The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language. And one should also note that given English is the lingua franca of programming, I'd suspect that English as a second language is actually a majority for programmers. So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

kergonath · 2024-10-10T22:03:15 1728597795

> "The vast majority of the population does not read or write any English in their day to day lives." This is doubtful: https://en.wikipedia.org/wiki/List_of_languages_by_total_num... While English speakers are not a majority, it is the most popular language.

That is the number of English-speaking people, as in people who can speak English. Not necessarily people who use it every day. In any case, ASCII only works for a subset of even English if you ignore all loan words and diacritics in things like proper names.

> So any code that deals solely with programmers as users can easily just use standard ASCII as default, and never see any problems.

That would not be much code at all, given that most code deals with user interfaces or user-provided data. That is the point: it’s not because the code is in basic English simplified enough to fit in ASCII that you can ignore Unicode and don’t need to consider text encoding.

numpad0 · 2024-10-09T00:34:54 1728434094

> That’s why I still get mail with my name mangled

Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

> it’s better if a Chinese user name does not break your reporting or logging systems

You should not be just casually dumping Chinese usernames into logs without warnings, in fact, you should not be using Chinese characters for usernames at all. Lots of Chinese online services exclusively use numeric IDs and e-mails for login IDs. "Usernames in natural human language" is a valid concept only in ASCII cultural sphere.

kergonath · 2024-10-10T22:07:31 1728598051

> Which is why you always type out addresses in ASCII representations in any foreign transactions even if it's not going to match your identity documents, unless the other party specifically demands it in UTF-8 and insists that they can handle it.

That is not always possible and the translation from local writing system to ASCII is often not unique and ambiguous. There really is no excuse for this sort of thinking. Even American programmers have to realise at some point that programs serve some purpose and that their failure to represent how the world works is just that: a failure. There is no excuse for programs to not support UTF-8 from user input to any output, including all the processing in between.

Muromec · 2024-10-08T10:14:45 1728382485

Who and why still tries to lowercase/uppercase names? Please tell them to stop.

kergonath · 2024-10-08T13:05:44 1728392744

Hell if I know. I don’t know what kind of abomination e-commerce websites run on their backend, I just see the consequences.

9dev · 2024-10-08T09:21:56 1728379316

It's funny how software developers live in bubbles so much. Whether you deal with human language a lot or almost not at all depends entirely on your specific domain. Anyone working on user interfaces of any kind must accommodate for proper encoding, for example; that includes pretty much every line-of-business app out there, which is a lot of code.

elpocko · 2024-10-08T10:32:08 1728383528

Every search feature everywhere has to be case-insensitive or it's unusable. Search seems like a pretty ubiquitous feature in a lot of software, and has to work regardless of locale/encoding.

account42 · 2024-10-09T14:58:21 1728485901

Search needs a whole lot more normalization than just case folding.

elpocko · 2024-10-09T22:24:15 1728512655

Okay.

inexcf · 2024-10-08T09:41:25 1728380485

Why do you need upper- or lowercase conversion in cases that have nothing to do with human language?

vardump · 2024-10-08T09:51:32 1728381092

Here's an example. Hypothetically say you want to build an HTML parser.

You might encounter tags like <html>, <HTML>, <Html>, etc., but you want to perform a hash table lookup.

So first you're going to normalize to either lower- or uppercase.

ARandumGuy · 2024-10-08T18:45:36 1728413136

Converting string case is almost never something you want to do for text that's displayed to the end user, but there are many situations where you need to do it internally. Generally when the spec is case insensitive, but you still need to verify or organize things using string comparison.

inexcf · 2024-10-08T12:30:56 1728390656

Ah, i see, we disagree on what is "human language". An abbreviation like HTML and it's different capitalisations to me sound a lot like a feature of human language.

recursive · 2024-10-08T18:24:43 1728411883

Is this a serious argument? Humans don't directly use HTML to communicate with each other. It's a document markup language rendered by user agents, developed against a specification.

tannhaeuser · 2024-10-08T21:04:59 1728421499

Markup languages and SGML in particular absolutely are designed for digital text communication by humans and to be written using plain text editors; it's kindof the entire point of avoiding binary data constructs.

And to GP, SGML/HTML actually has a facility to define uppercasing rules beyond ASCII, namely the LCNMSTRT, UCNMSTRT, LCNMCHAR, UCNMCHAR options in the SYNTAX NAMING section in the SGML declaration introduced in the "Extended Naming Rules" revision of ISO 8879 (SGML std, cf. https://sgmljs.net/docs/sgmlrefman.html). Like basically everything else on this level, these rules are still used by HTML 5 to this date, and in particular, that while elements names can contain arbitrary characters, only those in the IRV (ASCII) get case-folded for canonization.

recursive · 2024-10-08T23:58:14 1728431894

HTML is a text-based medium. But that doesn't make it a human language. Some human languages are not text-based. And some text is not a human language.

ANSI C was designed to be written by humans using a plain text editor. That doesn't make it a human language.

Muromec · 2024-10-08T10:16:51 1728382611

But but, I want to have a custom web component and register it under my own name, which can only be properly written in Ukrainian Cyrillic. How dare you not let me have it.

daemin · 2024-10-08T09:12:06 1728378726

I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

The other normal cases of string usage are file paths and user interface, and the needed operations can be done with simple string functions, and even in UTF8 encoding the characters you care about are in the ASCII range. With file paths the manipulations that you're most often doing is path based so you only care about '/', '\', ':', and '.' ASCII characters. With user interface elements you're likely to be using them as just static data and only substituting values into placeholders when necessary.

pistoleer · 2024-10-08T09:23:04 1728379384

> I would argue that for most programs when you're doing string manipulation you're doing it for internal programming reasons - logs, error messages, etc. In that case you are in nearly full control of the strings and therefore can declare that you're only working with ASCII.

Why would you argue that? In my experience it's about formatting things that are addressed to the user, where the hardest and most annoying localization problems matter a lot. That includes sorting the last name "van den Berg" just after "Bakker", stylizing it as "Berg, van den", and making sure this capitalization is correct and not "Van Den Berg". There is no built in standard library function in any language that does any of that. It's so much larger than ascii and even larger than unicode.

Another user said that the main takeaway is that you can't process strings until you know their language (locale), and that is exactly correct.

daemin · 2024-10-08T10:12:48 1728382368

I would maintain that your program has more string manipulation for error messages and logging than for generating localised formatted names.

Further I do say that if you're creating text for presenting to the user then the most common operation would be replacement of some field in pre-defined text.

In your case I would design it so that the correctly capitalised first name, surname, and variations of those for sorting would be generated at the data entry point (manually or automatically) and then just used when needed in user facing text generation. Therefore the only string operation needed would be replacement of placeholders like the fmt and standard library provide. This uses more memory and storage but these are cheaper now.

pistoleer · 2024-10-08T11:14:11 1728386051

I agree, but the logging formatters don't really do much beyond trivially pasting in placeholders.

And as for data entry... Maybe in an ideal world. In the current world, marred by importing previously mangled datasets, a common solution in the few companies I've worked at is to just not do anything, which leaves ugly edges, yet is "good enough".

heisenzombie · 2024-10-08T09:26:55 1728379615

File paths? I think filesystem paths are generally “bags of bytes” that the OS might interpret as UTF-16 (Windows) or UTF-8 (macOS, Linux).

For example: https://en.m.wikipedia.org/wiki/Program_Files#Localization

vardump · 2024-10-08T09:28:41 1728379721

File paths are scary. The last I checked (which is admittedly a while ago), Windows didn't for example care about correct UTF-16 surrogate pairs at all, it'd happily accept invalid UTF-16 strings.

So use standard string processing libraries on path names at your own peril.

It's a good idea to consider file paths as a bag of bytes.

netsharc · 2024-10-08T10:00:48 1728381648

IIRC, the FAT filesystem (before Windows 95) allowed lowercase letters, but there's a layer in the filesystem driver that converted everything to uppercase, e.g. if you did the command "more readme.txt", the more command would ask the filesystem for "readme.txt" and it would search for "README.TXT" in the file allocation table.

I think I once hex-edited the FA-table to change a filename to have a lowercase name (or maybe it was disk corruption), trying to delete that file didn't work because it would be trying to delete "FOO", and couldn't find it because the file was named "FOo".

Someone · 2024-10-08T09:51:34 1728381094

> It's a good idea to consider file paths as a bag of bytes

(Nitpick: sequence of bytes)

Also very limiting. If you do that, you can’t, for example, show a file name to the user as a string or easily use a shell to process data in your file system (do you type “/bin” or “\x2F\x62\x69\x6E”?)

Unix, from the start, claimed file names where byte sequences, yet assumed many of those to encode ascii.

That’s part of why Plan 9 made the choice “names may contain any printable character (that is, any character outside hexadecimal 00-1F and 80-9F)” (https://9fans.github.io/plan9port/man/man9/intro.html)