The issue is that the concept of “length” without additional context no longer applies to Unicode strings. There’s not one length there’s four or five - at least.
1. Length in the context of storage or transmission is the number of bytes of its UTF-8 representation - and not just that, one of its decomposed forms too.
2. Length in terms of number of visible characters is the number of grapheme clusters.
3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.
4. Length when parsing is based on the number of code points.
If you look at it that way it makes perfect sense that the “length” of a string could change when any operation is performed on it - because there’s no such thing as a canonical length.
tl;dr bug or not, the concept of a canonical string length is an anachronism from the pre-Unicode days when all those were the same thing. There’s no such thing as a string length anymore.
Except almost everyone always means #2. No one asked for strings to be ruined in this way, and this kind of pedantry has caused untold frustration from developers who just want their strings to work properly.
If you must expose the underlying byte array do so via a sane access function that returns a typed array.
As for “string length in pixels”, that has absolutely nothing to do with the string itself as that’s determined in the UI layer that ingests the string.
Until the string has to be stored in a database. Or transmitted over HTTP. Or copy-pasted in Windows running Autohotkey. Or stored in a logfile. Or used to authenticate. Or used to authorize. Or used by a human to self-identify. Or encoded. Or encrypted. Or used in an element on a web page. Or sent in an email to 12,000,000 users, some of whom might read it on a Windows 2000 box running Nutscrape. Or sent to a vendor in China. Or sent to a client in Israel. Or sent in an SMS message to 12,000,000 users, some of whom might read it on a Nokia 3310. Or sent to my exwife.
English speaking world has developed intuition about strings due to ASCII which simply fails when it comes to Unicode and that basically explains a lot of these pitfalls.
String length when defined #2 is also fairly complex when it comes to some languages such as Hindi. There are some symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character. So when you type them out on a keyboard you have to bit two keys but only one character will appear on screen. Unicode too represents this as two separate characters but for human eye it is one.
I would consider ligatures a text rendering concept, which allows for but is distinct from the linguistic concept described by GP.
Edit: to further illustrate my point, in the ligatures I'm familiar with (including the ones in your link), the component characters exist standalone and can be used on their own, unlike GP's example.
In the example "Straße", the ß is, in fact, derived from an ancient ligature for sz.
Old German fonts often had s as ſ, and z as ʒ. This ſʒ eventually became ß.
We (completely?) lost ſ and ʒ over the years, but ß was here to stay.
Its usage changed heavily over time (replacing ss instead of sz), I think for the last time in the 90s (https://en.wikipedia.org/wiki/German_orthography_reform_of_1...), where we changed when to use ß and when ss.
So while we do replace ß with ss if we uppercase or have no ß available on the keyboard, no one would ever replace ß by sz (or even ſʒ) today, unless for artistic or traditional reasons.
Many people uppercase ß with lowercase ß or, for various reasons, an uppercase B. I have yet to see a real world example of an uppercase ẞ, it does not seem to exist outside of the internet.
For example, "Straße" could be seen capitalized in the wild as STRAßE, STRASSE, STRABE, with Unicode it could also be STRAẞE. It would not be capitalized with sz (STRASZE) or even ſʒ (STRAſƷE – there is no uppercase ſ) – at least not in Germany. In Austria, sz seeems to be an option.
So, for most ligatures I would agree with you, but specifically ß is one of those ligatures I would call an outlier, at least in Germany.
P.S.: Maybe the ampersand (&), which is derived from ligatures of the latin "et", has sometimes similar problems, alhough on a different level, since it replaces a whole word. However, I have seen it being used as part of "etc.", as in "&c." (https://en.wiktionary.org/wiki/%26c.), so your point might also hold.
P.P.S.: I wonder why the uppercasing in the original post did not use ẞ, but I guess it is because of the rules in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.... (link taken from the feed). The wikipedia entry says we adopted the capital ẞ in 2017 (but it is part of unicode since 2008). It also states that the replacement SZ should be used if the meaning would otherwise get lost (e.g. "in Maßen" vs. "in Massen" would both be "IN MASSEN" but mean either "in moderate amounts" or "in masses", forcing the first to be capitalized as MASZEN). I doubt any programming language or library handles this. I would not have even handled it myself in a manual setting, as it is such an extreme edge case. And I when I read it, I would stumble over it.
Javascript's minimal library is of course not great, but there are libraries which can help, e.g. grapheme-splitter, although it's not language-aware by design, so in this instance it'll return 2.
We even already had something like this in pure ASCII: "a\bc" has "length" 3 but appears as one glyph when printed (assuming your terminal interprets backspace).
Except for SMS, #2 should work fine for all of those uses. And the reason you'd need a different measure for SMS is because SMS is bad.
And half the things on your list are just places that Unicode itself might fail, regardless of what the length is? That has nothing to do with how you count, because the number you get won't matter.
No, #2-length is a higher-level of abstraction to display for humans. E.g. calculating pixel widths for sizing a column in a GUI table, etc.
I don't think you understood the lower-level of abstraction in the examples that gp (dotancohen) laid out. (Encrypting/compressing/transmitting/etc strings.) In those cases, you have to know exact count of bytes and not the composed graphemes that collapse to a human visible "length".
In other words, the following example code to allocate a text buffer array would be wrong:
char *t = malloc(length_type_2); // buggy code with an undersized buffer
>is a higher-level of abstraction to display for humans
Am I missing something here? Who do you think is actually writing this stuff? Aliens?
I just want the first 5 characters of an input string, or to split a string by ":" or something.
I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI. I'll ask this again: Why break strings _globally_ to solve some arcane <0.0001% edge case instead of fixing the interface at the edges for that use case?
All this mindset has done is cause countless hours of harm/frustration in the developer community.
It's causing frustration because it's breaking people's preconceived notions about how text is simple. Previously, the frustration caused was usually to the end users of apps written by people with those preconceived notion. And there are far more end users than developers, so...
>Am I missing something here? [...] I have plenty of apps where none of my strings _ever_ touch a database and are purely used for UI.
Yes, the part you were missing is that I was continuing a discussion about a specific technical point brought up by parents' Dylan16807 and dotancohen. The fact that you manipulate a lot strings without databases or persistence/transmission across other i/o boundaries is not relevant to the context of my reply. To restate that context... Dylan16807's claim that length_type_#2 (counting human visible grapheme clusters) is all the information needed for i/o boundaries is incorrect and will lead to buggy code.
With that aside, I do understand your complaint in the following:
>I just want the first 5 characters of an input string, or to split a string by ":" or something.
This is a separate issue about ergonomics of syntax and has been debated often. A previous subthread from 6 years ago had the same debate: https://news.ycombinator.com/item?id=10519999
In that thread, some commenters (tigeba, cookiecaper) has the same complaint as you that Swift makes it hard to do simple tasks with strings (e.g. IndexOf()). The other commenters (pilif, mikeash) respond that Swift makes the high-level & low-level abstractions of Unicode more explicit. Similar philosophy in this Quora answer by a Swift compiler contributor highlighting tradeoffs of programmers not being aware if string tasks are O(1) fast -vs- O(n) slow:
And yes your complaint is shared by many... By making string api explicitly confront the "graphemes -vs- codeunits - codepoints -vs- bytes", you do get clunky cumbersome syntax such as "string.substringFromIndex(string.startIndex.advancedBy(1))" as in the Q&A answer:
I suppose Swift could have designed the string api to your philosophy. I assume the language designers anticipated it would have lead to more bugs. The tradeoff was:
- "cumbersome syntax with less bugs"
...or...
- "easier traditional syntax (like simple days of ASCII) with more hidden bugs and/or naive performance gotchas"
The main reason I care about the length of strings is to limit storage/memory/transmission size. 1) and to a lesser degree 4) achieve that (max 4 utf8 bytes per codepoint).
2) comes with a log of complexity, so I'd only use 2) in a few places where string lengths need to be meaningful for humans and when stepping through a string in the UI.
* One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations
* Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.
> One (extended) grapheme cluster can require a lot of storage (I'm not sure if it's even bounded at all), so it's unsuitable for length limitations
You could take inspiration from one of the annexes and count no more than 31 code points together.
> Needs knowledge of big and regularly updated unicode tables. So if you use it in an API chances are high that both sides will interpret it differently, so it's unsuitable for API use.
It depends on what you're doing. That could cause problems, but so could counting bytes. Users will not be thrilled when their character limit varies wildly based on which characters they use. Computers aren't going to be thrilled either when replacing one character with another changes the length. Or, you know, when they can't uppercase a string.
And well, I don’t think anyone’s arguing that a “length” function needs to be exposed by the standard library of any language (I saw that Swift does btw).
Besides from byte count it’s the only interpretation of “length” that I think makes sense for users (like I mentioned above, validating that a name has at least length 1, for example)
One problem with grapheme clusters as length is that you can't do math with it. In other words, string A (of length 1) concatenated with string B (of length 1) might result in string C that also has length 1. Meanwhile, concatenating them in the reverse order might make a string D with length 2.
This won't matter for most applications but it's something you may easily overlook when assuming grapheme clusters are enough for everything.
> Grapheme clusters and no. Let’s see if anyone has a different interpretation or unsupported use-case.
Here's one. How many characters is this: שלום
And how about this: שַלוֹם
Notice that the former is lacking the diacritics of the latter. Is that more characters?
Maybe a better example: ال vs ﻻ. Those are both Alif-Lamm, and no that is not a ligature. The former is often normalized to the latter, but could be stored separately. And does adding Hamza ء add a character? Does it depend if it's at the end of the word?
I completely lack the cultural context to tell if this is trivial or not.
Naive heuristic: look at the font rendering on my Mac:
ء
: 1 (but it sounds like it can be 0 depending on context)
ﻻ
: 1
ال
: 2
שלום
: 4
שַלוֹם
: 4
Do you see any issues?
In Japanese, which I am more familiar with, maybe it’s similar to ゙, which as you can see here has width 1 but is mostly used as a voicing sound mark as in turning か (ka) into が (ga), thus 0 (or more properly, as part of a cluster with length 1)
The use-case I’m getting at is answering the question “does this render anything visible (including white space)”? I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal (after trimming leading and trailing whitespace).
It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text. The fact that it’s hard, complex, inelegant and arguably inconsistent does not change that.
Unicode has a fully spelled out and testable algorithm that covers this. There's no need to make anything up on the spot.
You don't need cultural context. It's not trivial. The algorithm is six printed pages.
No individual part of it is particularly hard, but every language has its own weird stuff to add.
I feel strongly that programmers should implement Unicode case folding with tests at least once, the way they should implement sorting and network connections and so on.
.
> I think it’s a reasonable answer on the whole falsehood-programmers-believe-about-names validation deal
All these "false things programmers believe about X" documents? Go through Unicode just one time, you know 80% of all of them.
I get that you like the framing you just made up, but there's an official one that the world's language nerds have been arguing over for 20 years, including having most of the population actually using it.
I suggest you consider that it might be pretty well developed by now.
.
> The use-case I’m getting at is answering the question “does this render anything visible (including white space)”?
This is not the appropriate approach, as white space can be rendered by non-characters, and characters can render nothing and still be characters under Unicode rules.
.
> It’s the most intuitive and practically useful interpretation of “length” for arbitrary Unicode as text.
It also fails for Mongol, math symbols, music symbols, the zero-width space, the conjoiner system, et cetera.
.
> The fact that it’s hard, complex, inelegant and arguably inconsistent
Fortunately, the real approach is easy, complex, inelegant, and wholly consistant. It's also well documented, and works reliably between programming languages that aren't Perl.
At any rate, you're off in another thread instructing me that the linguists are wrong with their centuries of focus because of things you thought up on the spot, and that Unicode is wrong because it counts spaces when counting characters, so, I think I'm going to disconnect now.
The only real issue that I see is that ﻻ is considered to be two letters, even if it is a single codepoint or grapheme.
Regarding שַלוֹם I personally would consider it to be four letters, but the culture associated with that script (my culture) enjoys dissecting even the shape of the lines of the letters to look for meaning, I'm willing to bet that I'll find someone who will argue it to be six for the context he needs it to be. That's why I mentioned it.
> In Japanese, which I am more familiar with, maybe it’s similar to ゙,
Yes, for the Hebrew example, I believe so. If you're really interested, it would be more akin to our voicing mark dagesh, which e.g. turns V ב into B בּ and F פ into P פּ.
Wait'll you find out that whether ß is one letter or two varies based on which brand of German you're drinking.
Programmers should understand that it doesn't matter if they think the thousands of years of language rules are irrational; Unicode either handles them or it's wrong.
Unicode doesn't exist to standardize the world's languages to a representation that programmers enjoy. It exists to encode all the world's languages *as they are*.
Whether you are sympathetic to the rules of foreign languages isn't super relevant in practice.
> Unicode doesn't exist to standardize the world's languages to a
> representationthat programmers enjoy. It exists to encode all
> the world's languages *as they are*.
Thank you for this terrific quote that I'm going to have to upset managers and clients with. This is the eloquence that I've been striving to express for a decade.
If you give a Japanese writer 140 characters, they can already encode double the amount of information an English writer can. Non-storage “character”-based lengths have always been a poor estimation of information encoding so worrying about some missing graphemes feels like you’re missing the bigger problem.
I mean character count. The unicode standard defines that as separate from and meaningfully different than grapheme clusters or codepoints.
The confusion you're relying on isn't real.
.
> Do you want to count zero-width characters?
Are they characters? Yes? Then I want to count them.
.
> More specifically what are you trying to do?
(narrows eyes)
I want to count characters. You're trying to make that sound confusing, but it really isn't.
I don't care if you think it's a "zero width" character. Zero width space is often not actually zero width in programmers' fonts, and almost every font has at least a dozen of these wrong.
I don't care about whatever other special moves you think you have, either.
This is actually very simple.
The unicode standard has something called a "character count." It is more work than the grapheme cluster count. The grapheme cluster count doesn't honor removals, replacements, substitutions, and it does something different in case folding.
The people trying super hard to show how technically apt they are at the difficulties in Unicode are just repeating things they've heard other people say.
The actual unicode standard makes this straightforward, and has these two terms separated already.
If you genuinely want to undersatnd this, read Unicode 14 chapter 2, "general structure." It's about 50 pages. You can probably get away with just reading 2.2.3.
It is three pages long.
It's called "characters, not glyphs," because 𝐞𝐯𝐞𝐧 𝐭𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐨𝐟 𝐭𝐡𝐞 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝐬𝐭𝐚𝐧𝐝𝐚𝐫𝐝 𝐰𝐚𝐧𝐭 𝐩𝐞𝐨𝐩𝐥𝐞 𝐭𝐨 𝐬𝐭𝐨𝐩 𝐩𝐫𝐞𝐭𝐞𝐧𝐝𝐢𝐧𝐠 𝐭𝐡𝐚𝐭 "𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫" 𝐦𝐞𝐚𝐧𝐬 𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐚 𝐟𝐮𝐥𝐥𝐲 𝐚𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐝 𝐬𝐞𝐫𝐢𝐞𝐬.
The word "character" is well defined in Unicode. If you think it means anything other than point 2, 𝒚𝒐𝒖 𝒂𝒓𝒆 𝒔𝒊𝒎𝒑𝒍𝒚 𝒊𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕, 𝒏𝒐𝒕 𝒑𝒍𝒚𝒊𝒏𝒈 𝒂 𝒅𝒆𝒆𝒑 𝒖𝒏𝒅𝒆𝒓𝒔𝒕𝒂𝒏𝒅𝒊𝒏𝒈 𝒐𝒇 𝒄𝒉𝒂𝒓𝒂𝒄𝒕𝒆𝒓 𝒆𝒏𝒄𝒐𝒅𝒊𝒏𝒈 𝒕𝒐𝒑𝒊𝒄𝒔.
All you need is on pages 15 and 16. Go ahead.
.
I want every choice to be made in accord with the Unicode standard. Every technicality you guys are trying to bring up was handled 20 years ago.
These words are actually well defined in the context of Unicode, and they're non-confusing in any other context. If you struggle with this, it is by choice.
Size means byte count. Length means character count. No, it doesn't matter if you incorrectly pull technical terminology like "grapheme clusters" and "code points," because I don't mean either of those. I mean 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐜𝐨𝐮𝐧𝐭.
If you have the sequence `capital a`, `zero width joiner`, `joining umlaut`, `special character`, `emoji face`, `skin color modifier`, you have:
1. Six codepoints
2. Three grapheme clusters
3. Four characters
Please wait until you can explain why before continuing to attempt to teach technicalities, friend. It doesn't work the way you claim.
Here, let's save you some time on some other technicalities that aren't.
1. If you write a unicode modifying character after something that cannot be modified - like adding an umlaut to a zero width joiner - then the umlaut will be considered a full character, but also discarded. Should it be counted? According to the Unicode standard, yes.
2. If you write a zero width anything, and it's a character, should it be counted? Yes.
3. If you write something that is a grapheme cluster, but your font or renderer doesn't support it, so it renders as two characters (by example, they added couple emoji in Unicode 10 and gay couples in Unicode 11, so phones that were released inbetween would render a two-man couple as two individual men.) Should that be counted as a single character as it's written, or a double character as it's rendered? Single.
4. If you're in a font that includes language variants - by example, Swiss German fancy fonts sometimes separate the S Tset into two distinct S characters that are flowed separately - should that be calculated as one character or two? Two, it turns out.
5. If a character pair is kerned into a single symbol, like an English cursive capital F to lower case E, should that be counted as one character or two? Two.
There's a whole chapter of these. If you really enjoy going through them, stop quizzing me and just read it.
These questions have been answered for literal decades. Just go read the standard already.
You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.
I’m not 100% sure what context characters even apply in on a computer other than interest sake.
Invisible/zero-width characters are not interesting when editing text, and character count doesn’t correlate with size, therefore there’s no canonical length.
> You just said there was one canonical length for strings then gave me 4: bytes, code points, grapheme clusters and characters, each of which applies in a different context.
The other things have different labels than "length," which you've already been told.
.
> Invisible/zero-width characters are not interesting when editing text
> These words are actually well defined in the context of Unicode, and
> they're non-confusing in any other context.
John, you've set right so many misconceptions here, and taught much. I appreciate that. However, unfortunately, this sentence I must disagree with. Software developers are not reading the spec, and the terms therefore 𝒂𝒓𝒆 confusing them.
As with security issues, I see no simple solution to get developers to familiarize themselves with even the basics before writing code, nor a way to get PMs to vet and hire such developers. Until the software development profession becomes accredited like engineering or law, Unicode and security issues (and accessibility, and robustness, and dataloss, and performance, and maintainability issues) will continue to plague the industry.
It absolutely is, unless you allow xenophilia to overcomplicate the issue. You don't need to completely bork your string implementation to support every language ever concocted by mankind. No one is coding in cuneiform.
According to the CIA, 4.8% of the world's population speaks English as a native language and further references show 13% of the entire global population can speak English at one level or another [0]. For reference, the USA is 4.2% of the global population.
The real statement is that no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.
It seems unfortunate now, but I'd argue it was pretty reasonable at the time.
Unicode is the best attempt I know of to actually account for all of the various types of weirdness in every language on Earth. It's a huge and complex project, requiring a ton of resources at all levels, and generates a ton of confusion and edge cases, as we can see by various discussions in this whole comments section. Meanwhile, when all of the tech world was being built, computers were so slow and memory-constrained that it was considered quite reasonable to do things like represent years as 2 chars to save a little space. In a world like that, it seems like a bit much to ask anyone to spin up a project to properly account for every language in the world and figure out how to represent them and get the computers of the time to actually do it reasonably fast.
ASCII is a pretty ugly hack by any measure, but who could have made something actually better at the time? Not many individual projects could reasonably to more than try to tack on one or two other languages with more ugly hackery. Probably best to go with the English-only hack and kick the can down the road for a real solution.
I don't think anyone could have pulled together the resources to build Unicode until computers were effective and ubiquitous enough that people in all nations and cultures were clamoring to use them in their native languages, and computers would have to be fast and powerful enough to actually handle all the data and edge cases as well.
> The real statement is that no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.
That much is objectively false. Language designers made a choice to use it; they could’ve used other systems.
Also LBJ mandated that all systems used by the federal government use ASCII starting in 1969. Arguably that tied language designers hands, since easily the largest customer for computers had chosen ASCII.
>no one asked a few self-chosen individual who have never travelled beyond their own valley to ruin text handling in computers like the American Standard Code for Information Interchange has.
Are you seriously mad that the people who invented computers chose to program in their native language, and built tooling that was designed around said language?
That's a rather uncharitable take on ASCII when you consider the context. Going all the way back to the roots of encodings, it was a question of how much you could squeeze into N bits of data, because bits were expensive. In countries where non-Latin alphabets were used, they used similar encoding schemes with those alphabets (and for those variants, Latin was not supported!).
The primary problem is language/library designers/users believing there must be one true canonical meaning of the word „length“ like you just did, and that „length“ would be the best name for the given interface.
In database or more subtly various filesystems code the notion of bytes or codepoints might be more relevant.
By the way, what about ASCII control characters? Does carriage return have some intrinsic or clearly well defined notion of „length“ to you?
What about digraphs like ij in Dutch? Are they a singular grapheme cluster? Is this locale dependent? Do you have all scripts and cultures in mind?
And some clients expect that whitespace is not included in string length. "I asked to put 50 letters in this box, why can I only put 42?" would not be an unexpected complaint when working with clients. Even if you manage to convey that spaces are something funny called "characters", they might not understand that newlines are characters as well. Or emojis.
Credit card numbers come to mind, printed in letters they are often grouped into four number block separated by whitespace, e.g. "5432 7890 6543 4365" and now try to copy-paste this into a form field of "length" 16.
Ok, that's more of a usability issue and many front end developers seem to be rather disconnected from the real world. Phone number entry is an even worse case, but I digress ...
Since you ask for an example: moving the cursor in a terminal-based text editor requires #2 string length. That said I understand your point about focusing on what is actually needed.
Moving the cursor doesn't require string length, it requires finding the new cursor position. You can compute #2 length by moving the cursor one grapheme cluster at a time until you hit the end, but if you only need to move the cursor once, the length is irrelevant. (Also, cursor positions in a terminal-based text editor don't necessarily correspond to graphemes.)
The Unicode character doesn't really have an intrinsic width; the font will define one (and I think a font could define a glyph even for things like zero-width spaces). But as soon as you want to display text you'll get into all sorts of fun problems and string length is most likely never applicable in any way. At that point the Unicode string becomes a sequence of glyphs from a font with their respective placements and there's no guaranteed relationship between the number of glyphs and the original number of code points, anyway.
Even for this you probably wouldn't want to use #2, because some grapheme clusters are composed to create ligatures and wouldn't be used outside of composing to create a ligature, so they shouldn't ever form a valid string by themselves. But by the #2 analysis this would read as "length 1".
No. In fixed width settings, it’s usually very important to distinguish 0-width characters (I’ll use “characters” for grapheme clusters), 1-width characters, and 2-width characters (e.g. CJK characters and emojis). See wcwidth(3).
Very much so. And one may add that the implementation of UAX#11 "East Asian Width"[1] was done in a myopic, backwards-oriented way in that it only distinguishes between multiples 0, 1 and 2 of a 'character cell' (conceptually equivalent to the width one needs in monospace typesetting to fit in a Latin letter). There are many Unicode glyphs that would need 3 or more units to fit into a given text.
0-width characters should get a deprecation warning in Language 2.0, along with silent letters and ones that change pronunciation based on the characters surrounding them.
Honestly I mostly only want #1 on that list, not #2, since most of my stuff is fairly low level (systems programming, network stuff, etc.). So that’s not the best generalisation.
That's funny. As I read this I kept thinking "#1 is obviously the most relevant except for layout and rendering, where #3 is the important one."
But that's because I was thinking about allocation, storage, etc. For logic traversing, comparing, and transforming strings #4 is certainly more useful.
Oh, but you said #2? I guess that is important since it's what the user will most easily notice.
I don't think this is true. Certainly there is no string-length library I'm aware of that handles it that way. The usual default these days (correct or not) is #4 -- length is the number of unicode code points.
In Swift a character is a grapheme cluster, and so the length (count) of a string is in fact the number of grapheme clusters it contains.
The number of codepoints (which is almost always useless but what some languages return) is available through the unicodeScalars view.
The number of code units (what most languages actually return, and which can at least be argued to be useful) is available through the corresponding encoded views (utf8 and utf16 properties)
I do think #4 is worse than #2. There is extremely little useful information one can get by knowing the code points but not the grapheme clusters, and even for text editing these are confusing - if I move the cursor right starting before è, i definitely don't want to end up between e and `.
That’s what Rust does too, in the standard library offering #1 and #4, UTF-8 bytes and code points. Because they named a code point type a char, though, folks continue to assume a code point is a grapheme cluster. Which it frequently is, and that makes things worse.
Some languages that had the misfortune to be developed while people thought UCS-2 would be enough define string length as number of UTF-16 code units, which is not even on the list because it's not a useful number.
Yeah, and most languages/libraries which predate Unicode define string length as the number of bytes (#1). That's probably the most common interpretation, actually. But most new implementations count code point, in my experience. I believe this is also the Unicode recommendation when context doesn't determine a different algorithm to read. Been a while since I read those recommendations though, so I could be wrong.
My experience is that most new stuff is deliberately choosing and even exposing UTF-8 these days, counting in code units (which is there equivalent to bytes).
I’d say that not much counts in code points, and in fact that it’s unequivocally bad to do so, posing a fairly significant compatibility and security hazard (it’s a smuggling vector) if handled even slightly incautiously. Python is the only thing I can think of that actually does count in code points: because its strings are not even potentially ill-formed Unicode strings, but sequences of Unicode code points ('\ud83d\ude41' and '\U001f641' are both allowed, but different—though concerning the security hazard, it protects you in most places, by making encoding explicit and having UTF codecs decline to encode surrogates). String representation, both internal and public, is a thing Python 3 royally blundered on, and although they fixed as much as they could somewhere around 3.4, they can’t fix most of it without breaking compatibility.
JavaScript is a fair example of one that mostly counts in UTF-16 code units instead (though strings can be malformed Unicode, containing unmatched pairs). Take a U+1F641 and you get '\ud83d\ude41', and most techniques of looking at the string will look at it by UTF-16 code units—but I’d maintain that it’s incorrect to call it code points, because U+D83D has no distinct identity in the string like it can in Python, and other techniques of looking at the string will prevent you from seeing such surrogates.
It would have been better for Python to have real Unicode strings (that is, exclude the surrogate range) and counted in scalar values instead. Better still to have gone all in on UTF-8 rather than optimising for code point access which costs you a lot of performance in the general case while speeding up something that roughly no one should be doing anyway.
(I firmly believe that UTF-16 is the worst thing to ever happen to Unicode. Surrogates are a menace that were introduced for what hindsight shows extremely clearly were bad reasons, and which I think should have been fairly obviously bad reasons even when they standardised it, though UTF-8 did come about two years too late when you consider development pipelines.)
It's not just about optimizing Unicode access. It's also because system libraries on many common platforms (Windows, macOS) use UTF-16, so if you always store in UTF-8 internally, you have to convert back and forth every time you cross that boundary.
Most languages that seem to use UTF-16 code units are actually mixed-representation ASCII/UTF-16, because UTF-16’s memory cost is too high. I think all major browsers and JavaScript engines are (though the Servo project has shown UTF-16 isn’t necessary, coining and using WTF-8), and Swift was until they migrated to pure UTF-8 in 5.0 (though whether it was from the first release, I don’t know—Swift has significantly changed its string representation several times; https://www.swift.org/blog/utf8-string/ gives details and figures). Python is mixed-representation ASCII/UTF-16/UTF-32!
So in practice, a very significant fraction of the seemingly-UTF–16 places were already needing to allocate and reencode for UTF-16 API calls.
UTF-16 is definitely on the way out, a legacy matter to be avoided in anything new. I can’t comment on macOS UTF-16ness, but if you’re targeting recent Windows (version 1903 onwards for best results, I think) you can use UTF-8 everywhere: Microsoft has backed away from the UTF-16 -W functions, and now actively recommends using code page 65001 (UTF-8) and the -A functions <https://docs.microsoft.com/en-us/windows/apps/design/globali...>—two full decades after I think they should have done it, but at least they’re doing it now. Not sure how much programming language library code may have migrated yet, since in most cases the -W paths may still be needed for older platforms (I’m not sure at what point code page 65001 was dependable; I know it was pretty terrible in Command Prompt in Windows 7, but I’m not sure what was at fault there, whether cmd.exe, conhost.exe or the Console API Kernel32 functions).
Remember also that just about everything outside your programming language and some system and GUI libraries will be using ASCII or UTF-8, including almost all network or disk I/O, so if you use UTF-16 internally you may well need to do at least as much reencoding to UTF-8 as you would have the other way round. Certainly it varies by your use case, but the current consensus that I’ve seen is very strongly in favour of using UTF-8 internally instead of mixed representations, and fairly strongly in favour of using UTF-8 instead of UTF-16 as the internal representation, even if you’ll be interacting with lots of UTF-16 stuff.
I dunno about most languages - both JVM and CLR use UTF-16 for internal representation, for example, hence every language that primarily targets those does that also; and that's a huge slice of the market right there.
Regarding Windows, the page you've linked to doesn't recommend using CP65001 and the -A functions over -W ones. It just says that if you already have code written with UTF-8 in mind, then this is an easy way to port it to modern Windows, because now you actually fully control the codepage for your app (whereas previously it was a user setting, exposed in the UI even, so you coulnd't rely on it). But, internally, everything is still UTF-16, so far as I know, and all the -A functions basically just convert to that and call the -W variant in turn. Indeed, that very page states that "Windows operates natively in UTF-16"!
FWIW I personally hate UTF-16 with a passion and want to see it die sooner rather than later - not only it's an ugly hack, but it's a hack that's all about keeping doing the Wrong Thing easy. I just don't think that it'll happen all that fast, so for now, accommodations must be made. IMO Python has the right idea in principle by allowing multiple internal encodings for strings, but not exposing them in the public API even for native code.
Neither the JVM nor the CLR guarantee UTF-16 representation.
From Java 9 onwards, the JVM defaults to using compact strings, which means mixed ISO-8859-1/UTF-16. The command line argument -XX:-CompactStrings disables that.
CLR, I don’t know. But presuming it’s still pure UTF-16, it could still change that as it’s an implementation detail.
(As for UTF-16, not only is it an ugly hack, it’s a hack that ruined Unicode for all the other transformation formats.)
I don’t think Python’s approach was at all sane. The root problem is they made strings sequences of Unicode code points rather than of Unicode scalar values or even UTF-16 code units. (I have a vague recollection of reading some years back that during the py3k endeavour they didn’t have or consult with any Unicode experts, and realise with hindsight that what they went with is terrible.) This bad foundation just breaks everything, so that they couldn’t switch to a sane internal representation. I described the current internal representation as mixed ASCII/UTF-16/UTF-32, but having gone back and read PEP 393 now (implemented in Python 3.3), I’d forgotten just how hideous it is: mixed Latin-1/UCS-2/UCS-4, plus extra bits and data assigned to things like whether it’s ASCII, and its UTF-8 length… sometimes. It ends up fiendishly complex in their endeavour to make it more consistent across narrow architectures and use less memory, and is typically a fair bit slower than what they had before.
Many languages have had an undefined internal representation, and it’s fairly consistently caused them at least some grief when they want to change it, because people too often inadvertently depended on at least the performance characteristics of the internal representation.
By comparison, Rust strings have been transparent UTF-8 from the start—having admittedly the benefit of starting later than most, so that UTF-8 being the best sane choice was clear—which appropriately guides people away from doing bad things by API, except for the existence of code-unit-wise indexing via string[index] and string.len(), which I’m not overly enamoured of (such indexing is essentially discontinuous in the presence of multibyte characters, panicking on accessing the middle of a scalar value, making it too easy to mistake for code point or scalar value indexing). You know what you’re dealing with, and it’s roughly the simplest possible thing and very sane, and you can optimise for that. And Rust can’t change its string representation because it’s public rather than implementation detail.
> I believe this is also the Unicode recommendation when context doesn't determine a different algorithm to read.
Except that emojis are universally two "characters", even those that are encoded as several codepoints. Also, non-composite Korean jamo versus composited jamo.
Japanese kana also count as two characters. Which they largely are when romanized, on average. Korean isn’t identical but the information density is approximately the same. Good enough to approximate as such and have a consistent rule.
Though this was technically correct in 2019, meanwhile the Raku Programming Language has continued to evolve, de-emphasizing its heritage more and more. For example, Covid restrictions allowing, 2022 will see the first European Raku Conference.
Is it a fork..or just a rename for a breaking version tree? I understand that backwards compatibility is not a required goal. So isn't it more like Python 3 rather than an entirely different programming language?
It's not a rename for a breaking version tree any more than a C# compiler is of GCC.
It is backwards compatible. The first strong demonstration of this was using the Catalyst framework, a major Perl package, in Raku programs, 6 years ago.
It's not remotely like Python 3 vs Python 2. Python 3 is a tweak of Python 2 that utterly failed to confront hard problems that desperately needed -- and still need -- to be confronted. For example, Raku sits alongside Swift and Elixir as the only two mainstream PLs yet to properly address character processing in the Unicode era. In contrast, Python 3's doc doesn't even mention graphemes. Rakudo has no GIL; Python is stuck with CPython's. And so on for a long list of truly huge PL problems.
Raku is an entirely different programming language. (This is so despite Raku(do) being able to support backwards compatibility via the same mechanism that allows it to use Python modules as if they were Raku modules. And C libraries, and Ruby ones, and so on.)
Not in the slightest. Python 3 made breaking changes to the CPython code base (yes that's a simplification). There's enough backwards incompatibility to consider them different languages, but I'd guess that there might still be bits and pieces of code that is identical (or largely unchanged) between the Python 2 and 3 implementations.
Raku - on the other hand - does not use any of the Perl code base; it's a completely different runtime written from the ground up.
This particular capitalization would have generated a longer string without Unicode - it's a language convention that a capitalization routine could apply to just about any encoding that has an ß.
That is an issue, for sure. But that is not "the issue" that they're describing. They are clearly talking about (2), so confusion about other meanings is irrelevant.
Their point is that they would expect number of visble grapheme clusters to be the same when converting between cases. But in some languages that's not true, as their example demonstrates. (An upper case ß does exist in Unicode but culturally it's less correct than SS.)
5. Length in terms of time until complete display when rendered in a given architecture.
6. Length in terms of Braille characters needed to display it.
7. Length in terms of time of lecture at 300 wpm.
String length is overwhelmingly discussed as character length, meaning #2. Length in bytes should only be an issue for data transmission or storage, and people who still work with ASCII. Rendering dimensions are not relevant to the published article.
I used to say the same thing, but then I was hired at a big company to work on products that needed to handle Arabic, Thai, Devanagari and Bengali.
They don't have a proper concept of characters and diacritics count as a character, not as an accent so you have verticality in a string!
I've come to respect the unicode spec as something with a lot of inherent complexity that you have to understand to work with. Your inner voice will scream "there has to be a better way" all the way through the project, but there actually isn't, you just have to accept that human language and writing is complex and representing it in software is equally complex.
I understand that representing human languages is even more complex than I suppose, it's just that the article posted is a tweet surprised that a German word is one character wider in uppercase. It was not about graphic representation but number of characters.
Nah this came out of personal experience working at a company where a designer wanted to limit a block of text to a certain “length” but in reality they wanted as many characters as would fit in a box without scrolling, scaling or wrapping. This has to be calculated by bounding box.
> 2. Length in terms of number of visible characters is the number of grapheme clusters.
There's a fun subtlety in this case too: a single grapheme cluster need not draw as a single "visible character" (or glyph) on-screen. The visual representation of a grapheme cluster is entirely dependent on the text system which draws this, and the meaning that system applies to the cluster itself. This is especially true for multi-element emoji clusters, whose recommended meanings[1] change with evolving versions of Unicode.
To add to this, Unicode 12 simplified the definition of grapheme clusters by actually generalizing them so that they can be matched effectively by a regex. (See the "extended grapheme cluster" definition in TR29[2].) This reduced the overall number of special cases and hard-coded lists of combinations in the definitions of grapheme clusters (particularly around emoji), but it also means that there are now infinitely more valid grapheme clusters that don't necessarily map to a single glyph.
(Edit: it appears that HN is actually stripping out the ZWJ text from this example and leaving just the Copyright symbol. See below for how to reproduce this text on your machine.)
(I picked this combination somewhat randomly, but ideally, this is an example that should hopefully last as it feels unlikely that "horse copyright" would have a meaningful glyph definition in the future. As of posting this, the above text shows up as two side-by-side glyphs on my machine (macOS Monterey 21A559): a horse, followed by the copyright sign. This may look similar on your machine, or it may not.)
Importantly, you can tell this is actually treated as a real grapheme cluster by the text system on macOS because if you copy that string into a Cocoa text view (e.g., TextEdit), you will only be able to place your cursor on either side of the cluster, but not split it in the middle. A nice interactive way to see this in action is inserting U+1F434 into the document, followed by U+00A9. Then, move your cursor in between those two glyphs and insert U+200D: your cursor should then bounce out from the middle of the newly-formed cluster to the beginning.
This was a pretty short example, but this is arbitrarily extensible: (Edit: Originally I had posted U+2705 <check mark symbol> + U+200D + U+1F434 <horse head> + U+200D + U+1F50B <battery> + U+200D + U+1F9F7 <safety pin> [sorry, no staple emoji] but HN stripped that out too. It does appear correctly in the text area while typing, but HN replaces the sequence with spaces after posting.)
As linked above, Unicode does offer a list of sequences like this that are considered to be "meaningful"[1], which you can largely expect vendors which offer emoji representations to respect (and some vendors may offer glyphs for sequences beyond what is suggested here). If you've ever run into this: additions to this list over time explains why transporting a Unicode sequence which appears as a single glyph on one OS can appear as multiple glyphs on an older one (each individual glyph may be supported, but their combination may or may not have a meaning).
In general, if this is interesting to you, you may enjoy trawling through the Unicode emoji data files [3]. You may discover something new!
Another fun fact: Upper-casing is language dependent. In English uppercasing 'i' gets you 'I'. But Turkish has a dotted and un-dotted 'i', each with an uppercase variant. So if your user's language was Turkish, uppercasing 'i' would give you 'İ', and lowercasing 'I' would give you 'ı'.
Makes me wonder how case insensitive file systems handle this...and for more fun, handle the situation where the user changes the system language. I know that the Turkish 'I' delayed at least one big company's Turkish localization efforts for awhile.
> Makes me wonder how case insensitive file systems handle this...
They generally don't. It is true that several case-insensitive file systems (including NTFS, exFAT and ext4 [1]) maintain some sort of configurable case-folding maps but they are mostly used to guard against the future Unicode update and do not vary across locales.
Another example is that in Dutch, the bigram 'ij' is considered a single letter, and so at the beginning of a word, both have to be uppercased. See for example the Dutch Wikipedia page for Iceland: https://nl.wikipedia.org/wiki/IJsland.
And it's present in Unicode also! IJ and ij are single characters (try selecting only the i or only the j in those). Their use is discouraged though:
> They are included for compatibility and round-trip convertibility with legacy encodings, but their use is discouraged. Therefore, even with Unicode available, it is recommended to encode ij as two separate letters.
A somewhat surprising and interesting side effect of this can be found in the blog post "Hacking GitHub with Unicode's dotless i" [1], which is now fairly well known.
It goes beyond just language... uppercasing can also be locale-dependent. In Microsoft Word, for example, uppercasing an "é" gets you "É" in the French (Canada) locale but "E" in the French (France) locale.
My understanding is that case insensitive filesystems that wish to be portable have a mapping table in the metadata. A quick search showed references to an 'upcase table', although I'm not sure of the accuracy of the source, so I won't link it.
Just because the user changed the system language doesn't mean the system should be expected to change the upcase table though. That operation would need to be very carefully managed; you can't change the rules if there are existing files that would conflict in the new mapping. And you might have symbolic links that matched because of case insensitivity that won't anymore... Pretty tricky.
I think this goes to show that the file system is not the right place to implement natural language handling to its full extent, like it's probably not a good idea to implement time zones as a primitive function in your new programming language.
This could've been a solution but is not what Unicode chose to do (in this particular case). Also relevant: discussion around unifying glyphs like A, B, C, E and so on across Cyrillic, Latin, and Greek, all with their language-specific casing behavior.
That's correct, but this could be very surprising for anyone not familiar with Turkish. For example if you have an English text which has a Turkish word in it, maybe a place or person name. Or for example when doing OCR.
I sort of expect that nothing can be assumed when talking about strings and characters anymore. Waiting for the post on HN one day that says that its in the unicode spec that characters can animate or be in superposition until observed…
When I read GP that was the first thing I thought of.
Animated emojis probably aren't too far off the horizon, and I could definitely see a world where multiple glyphs could be combined to affect the animation much like we do for skin tone or the family emojis.
Emojis in Unicode may have been massive scope creep, but I don’t think the poop emoji or love hotel emoji represent it. Love hotels are a legitimate phenomenon in Japan, where emoji originate (heck, even the word emoji is Japanese.) It fits in well alongside other map icons, even if the cultural clashes can be amusing. And the poop emoji may be silly, but it’s hard to deny its application for expression.
Now what I find is less practical and more ridiculous is the sheer number of ways you can combine various codepoints to get different emoji representing families. On one hand, it’s impressive dedication to trying to be neutral and encompassing, on the other hand it is nightmarish just how many possibilities exist.
> It fits in well alongside other map icons, even if the cultural clashes can be amusing.
Another example is the omission of the middle finger emoji in earlier Unicode versions but inclusion of various different hand emojis that insult people in East Asia or the Middle East. E.g. OK Hand Sign and Call me hand, which are both completely mislabeled when used in other cultures.
> Now what I find is less practical and more ridiculous is the sheer number of ways you can combine various codepoints to get different emoji representing families. On one hand, it’s impressive dedication to trying to be neutral and encompassing, on the other hand it is nightmarish just how many possibilities exist.
My problem with this is that it's trying to solve a font-level problem at the alphabet-level. For example, it's fair to ask why a male-looking construction worker appears on some particular emoji keyboard, whilst a female-looking construction worker doesn't; yet that bias exists in the choice of font, not in the Unicode standard (where U+1F477 simply defines 'construction worker').
Vendors like Apple created this problem when they moved away from simple silhouettes and (AFAIK) neutral 'smiley faces', to more detailed images which required a bunch of arbitrary choices to be made (gender, skin tone, etc.). Rather than making a bunch of alternative fonts, akin to light/bold/serif/monospace/etc. those choices were instead shoehorned into Unicode's modifier-character system :(
I'm sometimes believe that full general purpose embracing of unicode for text, with no clean distinction between "machine-friendly" text versus "for human" natural language text (in every script since the dawn of time plus every goofy emoji anyone dreams up, with all the complexity these entail) is a major mistake that has lead computing astray. I fear, though, that it is impractical to separate these things, short of entirely shunning the latter, and tempting as it is I can't quite advocate a return to pure ASCII.
Programming languages really should have distinct string-like types:
identifier -- printable ASCII characters only, an array of 8-bit chars
ucs16 -- An array of 16-bit chars for compatibility with Windows, Java, and .NET
utf8 -- Normalised, 100% valid UTF-8 with potentially some "reasonableness" constraints
text -- Abstraction over arbitrary code pages, including both Unicode and legacy encodings.
Languages like Rust kinda-sorta implement this. For example, the PathBuf type internally uses a "WTF8" encoding that is vaguely Utf-8 compatible, but allows the invalid code sequences that can turn up in Win32 system API calls.
IMHO that's a good try but not ideal. The back-and-forth conversion is complex, requires temporary buffers, and is slow as molasses for many types of API calls.
The ideal would be to have abstractions (traits, interfaces, whatever) that cover all the use-cases. E.g.: it should be possible to test if a 'utf8' string contains an 'identifier' string. It should be possible to compare strings without having to convert their formats. Etc...
My take on the 'string like types' languages should have:
# Simple indexed access (often an array, possibly an array of arrays)
raw / octets -- Not-classified sequence of raw bytes
# Fancy strings, which MAY be validated (but don't have to be), and MAY stay validated (if the operations are known to be simple enough), and MAY also have multiple types of index for speedy access to specific points by raw byte, unit run of encoded components, complete display units (a single displayed element), or even a cached last known render left bound on an output.
UTF-8 / WTF8 / UTF8 / ASCII -- A fancy string with octet components of possibly multi-byte (variable) length
UCS-2 / USC-16 / etc -- A legacy string encoding format that no one should use as it too is variable length but suffers from endien confusion in raw data storage.
In an object based language the latter two would probably use the first as a raw storage mechanism for the strings, while they'd also have some associated attributes for the desired encoding, if it's validated, how it is known to be normalized, and storage for different index aids.
Crucially the system libraries should have the same interface. If there isn't library support for converting / normalizing encoded text the results should always be raw octets. If library support is included then WTF8 should probably be the result target of any operations, possibly upgraded to UTF-8 if the results are theoretically still valid.
> ucs16 -- An array of 16-bit chars for compatibility with Windows, Java, and .NET
Just no. Maintaining compatibility with broken implementations that hold incorrect assumptions is not something that should be done. Especially not encoding it into some kind of standard.
I’ll say it, it was a mistake. Every developer who learns swift for the next hundred years will curse the designers the first time they need to do “basic” string processing. It’s gotten slightly better, but it’s still a joke.
Care to explain? I believe Swift’s model to be quite sensible for human non-US text, having an API that puts Unicode before ASCII. The bytes are still available if you want to shoot yourself in the foot.
As a developer whose native language is not English, I'm really glad that Swift has finally forced developers from Western countries (and especially US) contemplate what strings actually are and aren't, and how to properly use them in different contexts.
There are two German words that uppercase to the same: Masse (physical unit, measured in kg usually: mass) and Maße (plural of Maß: measurement). So downcasing MASSE either requires understanding the text or results in a superposition.
This happens even in English. Downcasing POLISH could give you either an Eastern European nationality or the action you do to shine a shoe, and so needs textual understanding to disambiguate.
Or that some people’s names can only be represented as a bitmap or vector graphic. Or what if some people’s names can only be represented in a computer by that computer executing arbitrary code? Then all computer software that accepts the input of human names must by definition have arbitrary code execution vulnerabilities!
Sometimes I wish the popular programming languages had been invented by people who's written language had no concept of UPPER and lower case. There's so much cruft in code bases because of it. A conventions include kUPPERCASE for constants, CapitalizedCamelCase for classes and or functions, sometimes snake_case for variables or whatever. So then you have millions of wasted person hours of things that could have been automated if they matched, but they don't
Example
enum Commands {
kCOPY,
kCUT,
kPASTE,
}
class CopyCmd : Public Cmd { ... }
class CutCmd : Public Cmd { ... }
class PasteCmd : Public Cmd { ... }
Cmd* MakeCommand(Command cmd) {
switch (cmd) {
case kCut: return new CutCmd();
case kCopy: return new CopyCmd();
case kPaste: return new PasteCmd();
...
The fact that in some places it's COPY, in others Copy, and in probably others copy means work that would disappear if it was just always COPY. All of it superfluous work that someone from a language that doesn't have the concept of upper and lower case would never had even considered when come up with coding standards. Hey, I could just copy and paste this.... oh, the cases need to be fixed. Oh, I could code generate this with a macro.... oh, well, I've got to put every case form of the name in the macro so I can use the correct one, etc...
a) Is your tooling not identifier aware? That removes the hassle from remembering the case correctly. And for the matter so does having a consistent standard...
b) When is capitalization ever the thing stopping you from copying and pasting some code?
Capitalization only differs if the semantic meaning differs. Like what's a specific example in your code where this is avoided:
> Hey, I could just copy and paste this.... oh, the cases need to be fixed. Oh, I could code generate this with a macro.... oh, well, I've got to put every case form of the name in the macro so I can use the correct one, etc...
I'm not sure what your point was. My point is because these languages were invented in English and because English language has a concept UPPER and lower case, English programmers made up naming conventions using different case variations of the same id. The most common convention I've run into I listed above. So I type out some enum using kUPPER for 50 things, I then copy and paste that somewhere to generate 50 functions, but according to naming conventions, I now have to fix the case of all of those functions. That "fix the case" is wasted work. Then we also end up with kUPPER except if it's a compound word so kPASTEANDGO becomes kPASTE_AND_GO which then has to be translated to some `class PasteAndGo` because of naming conventions. Again, wasted work.
There are lots of cases where this conversion ends up causing all kinds of headaches, major and minor, just to keep everything matching the naming convention. Sure, you could choose to ignore the convention in this case but I've rarely seen a code base that does.
Whereas, if the languages had come from a culture with the language that doesn't have the concept of UPPER and lower case then these conventions might never have happened. You'd have maybe something like K_COPY_AND_GO, CLASS C_COPY_AND_GO, etc and copying and pasting ids around or code generating there'd be no hoops to jump through to match the case conventions for different uses of the same id.
> Capitalization only differs if the semantic meaning differs
Yes, and I'm positing that prefixes or suffixes would be better than case differences
This is an unreasonable expectation to have for Unicode anyway.
1. Assume that some letter X has only a lower-case version. It’s represented with two bytes in UTF-8.
2. A capitalized version is added way later
3. There are no more two-byte codepoints available
4. So it has to use three bytes or more
I see people are jumping on the “oh the wicked complexity” bandwagon in here but I don’t see what the big deal is.
Interestingly, and I can't decide if I hate it or not, when searching for "æ" to get back to this comment after stupidly scrolling away from it, Chrome matched on "ae" in other comments, not just on the letter itself.
In Norwegian, "ae" is an accepted transliteration of "æ", but "æ" is very clearly considered it's own character and we'd expect the answer to how many characters is "æ" to be 1 (which creates fun sorting issues too - "ae" is expected to sort after "z"), just like "aa" is a transliteration of "å" except when it isn't (it's always a transliteration when outside of names; within names all bets are off - some names traditionally uses "aa", some uses "å" and if represented with "aa" it respresents a transliteration; to make it worse, some names exist in both forms - e.g. Haakon vs. Håkon).
Now the interesting question to ask Norwegians is "how many characters are 'ae'?" On one hand the obvious answer is 2, but I might pause to think "hold on, you meant to write "æ" but transliterated it, and the answer should be 1". Except it might occur to me it's a trick question - but it could be a trick question expecting either answer. Argh.
[I just now realised a search for "å" matches "a" and "å", but "aa" is treated as two matches on "a", and that I definitively hate, though I understand it's a usability issue that it matches on "a", and that matching on "aa" makes no sense if the matched term is in another language.]
[EDIT2: I've also done some genealogy recently, and to be honest, the spelling of priests makes it quite hard to hold on to any desire for precision in find/search]
To make things worse (or better), æ, œ, and ß have evolved from ligatures to becoming letters in their own right (a choice that may depend on the language), while fi stayed a ligature. So 'fieß', if there was such a word in German, would count as 4 letters.
Also, talking about 'glyphs', one has to clearly separate between what Unicode encodes and what a given font uses; in the context of fonts, 'glyphs' are also called 'glyfs' or 'outlines', and any OTF my choose to render any given string with any number of glyfs/outlines, including using a visual ligature to display 'f+i' = fi but doing so by nudging the outlines for f and i closely together.
I would be surprised if most languages implement it that way.
Counting code points would be a start and would solve this particular problem. But that’s not glyph count. You really need to count grapheme clusters. For example two strings that are visually identical and contain “é” might have a different number of code points since one might use the combining acute accent codepoint while the other one might use the precomposed “e with accute accent”.
Even a modern language like Rust doesn’t have standard library support for counting grapheme clusters; you have to use a third party one.
And if you do count grapheme clusters then people will eventually complain about that as well (“TIL that getting the length of a String in X takes linear time”).
Python has str.casefold() for caseless comparisons that handles the example in the OP[1]:
> str.casefold()
> Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.
> Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".
> The casefolding algorithm is described in section 3.13 of the Unicode Standard.
I believe this does work for German (or at least I can't think of an example where it doesn't). But a case I can think of where it doesn't work is with standard modern Greek. In all-caps words, accents are omitted, while at least Python's implementation of casefold produces non-equal strings for the all-caps and lowercase versions:
But I'd consider that incorrect, or at least nonstandard. The Greek alphabet does have accented versions of capital letters, but they can only be used as the first letter of a word in mixed case (e.g. if a sentence starts with έλα, you write it Έλα), never in the middle of a capitalized word. However maybe this slides too far to the "language" rather than "encoding" side of the space that Unicode considers outside of its purview.
This sounds like a question they'd ask during a tech interview and when you ask which language or what they mean by "length," the "senior" developer says, "Heh, guess you're not one of us who work for a living. You couldn't even answer a basic programming question. You see, length means how long a thing is. Everyone knows that. NEXT."
As heavyset_go mentioned casefolding is your friend here, but you'll also likely want to do some unidecoding too depending on your use case. With that lens, going from one to two characters is pretty tame as there are numerous examples of single characters that unidecode to eight! And also ones that go to zero.
We've done some pretty fun work at Inscribe ensuring things like this work across multiple languages, in particular for matching names. So we can now verify that a document does indeed belong to "Vladimir" even if the name on the document is "Владимир". The trickier part is not simply matching those, but returning the original raw result, as for that you need to keep track of how your normalization has changed the length of things.
If you're interested in this kind of thing, shoot me a mail at oisin@inscribe.ai. We're growing rapidly and hiring to match.
It is not a bug. That very character was submitted to Unicode by none other than DIN (the German national standardization organization) [1] and it was clearly indicated that the capital eszett should be encoded without case folding to the original eszett (p. 4). As others mentioned the capital eszett is allowed but not enforced, so DIN didn't want to break the original case mapping between ß and SS. And because of the Unicode stability policy [2], it can no longer be made into a case pair even though DIN changes its mind.
If you read the official announcement for the German language (cited in the wiki article) there it states that SS is the capitalization of ß, but ẞ instead is permitted. So the standard Form is double S.
Question: is there any reason to expect that the double s form will be replaced, or is this a more of a permanent quirk of a language adopting quirks from their neighbors?
Difficult to say. The double s Form has to stay (for small and capital) to not deviate from Swiss German too much (has to stay compatible). Generally older people think capital ẞ is ugly, but maybe this will change over generations. Best bet (for me) is that all ß fall out of favor and Germany goes the Swiss way. Let's talk in 60 years again.
When I first heard about it a few years ago I did indeed think that it was ugly-looking and the whole idea rather pointless.
I've recently re-discovered it, though and somehow this time around actually taken a liking to it, and have smuggled it into a few reports I've had to prepare since then.
Armenian, Coptic, Cherokee, Garay, Osage, Adlam, Warang Citi, and Deseret, as well.
These are called "bicameral scripts."
Whether Japanese has them or not is arguable - consider katakana, where they switch for externality.
Whether Arabic has it or not is also arguable. Many Arabic letters have alternate forms (English speakers might comparatively call them "modern," "holy", "cursive," and "historic.")
You might argue that English has other alternate cases, such as cursive, handwriting, blackletter, doublestrike, smallcaps, title case, and so forth.
Arabic I actually know, and it has nothing like case. The letters have alternate forms depending on their surroundings. And of course various shorthands use in handwriting, much like our cursive.
I don't see the connection between cursive, etc. and case at all. Case exists in all those you mention, it's orthogonal.
> I don't see the connection between cursive, etc. and case at all.
Linguists do. It's one of their standard examples.
Bicameral case comes in because some words get special rules for letters. Capital letters in English by example are used to lead proper nouns.
To get an English speaker through this, usually the path is:
1. In English, it's expected that formal text writes God in blackletter. Though this practice has fallen out of place, it's still a rule.
1. In English, it's expected that quotations and references are placed in italics.
1. In English, it's expected that the use of cursive implies non-formal friendly text.
Et cetera.
Given those, the fact that there are different letterforms for religious text should be relatively easy to understand as a divisive case (pardon the phrasing.)
At any rate, the academics recognize it; if you don't, that's fine by me.
.
> Case exists in all those you mention, it's orthogonal.
It really isn't. They're exactly the same thing. In each setup, it's letters that get a different letter-form because of rules about what's being written.
.
> And of course various shorthands use in handwriting, much like our cursive.
> Linguists do. It's one of their standard examples.
Well, I do have a degree in linguistics, so I know as well as anyone that you can always dig up a linguist who'll argue some position, however untenable. Can you specify your exact source, maybe?
If I'm following you correctly, you are saying that some specialists argue that the distinction between cursive, blackletter etc. is the same as that between upper and lowercase? This does not make sense at all, regardless of who says it. Different font/script families like cursive, blackletter, etc. all themselves include upper-case and lower-case variants so by definition these are orthogonal categories. Additionally, excepting the one example you cite with the name of God n formal (archaic) texts, generally cursive, blackletter and fracture are not mixed in any way the same way as uppercase and lowercase. It's quite simply a different kind of category, a different level of analysis. Fonts are not cases.
> it's letters that get a different letter-form because of rules about what's being written.
This is an extremely abstract definition of case. I also don't think see how it covers the different forms and combinations of letters in Perso-Arabic script, sine these are purely mechanical, like ligatures in fine typesetting, except some of them are optional and some are only used in handwriting. I also don't necessarily agree that it covers the difference between blackletter and roman, since really this is more likely to be decided by who is doing the writing and who they're writing for rather than what they are writing.
> If I'm following you correctly, you are saying that some specialists argue that the distinction between cursive, blackletter etc. is the same as that between upper and lowercase?
No, that's not my position.
.
> Fonts are not cases.
Those aren't fonts.
.
> Different font/script families like cursive, blackletter, etc. all themselves include upper-case and lower-case variants
It's not clear to me why you felt the need to say this.
.
> > it's letters that get a different letter-form because of rules about what's being written.
>
> This is an extremely abstract definition of case.
that is:
1. not a definition
2. not about case
3. not abstract
.
> I also don't think see how it covers the different forms and combinations of letters in Perso-Arabic script, sine these are purely mechanical
Some are, some aren't. Sure, there are ligatures, but there are special forms for holy topics, and so forth.
Of course, I'm not talking about Perso-Arabic; I'm talking about Arabic. They're not exchangable that way.
Perso-Arabic is not how you describe that writing system, by the way. Perso-arabic is how you reference the specific branch of Arabic that's been bent to writing in Iran. It's meant to distinguish from Farsi, not to encompass Arabic as a whole.
Anyway, this shouldn't be surprising, since Arabic is an abjad, so it doesn't really handle this discussion, since it's got entire implicit letters.
.
> I also don't necessarily agree that it covers the difference between blackletter and roman, since really this is more likely to be decided by who is doing the writing
My opinion is that you missed the purpose of my comment, which was not to talk about stylistic choices by authors.
Of course, as a user of the internet I am aware of things like fonts.
But since you're apparently a linguist, I'm sure that you're aware that in Europe, for about 400 years, the word "god" was written in what you appear to think is a different font, at a different size, in some countries even by law.
This is what I'm talking about. Not someone sitting in Word trying to make something look pretty; rather, language rules which a schoolteacher would use in grading something as correct or incorrect. Unicode doesn't encode stylistic choices, but it does encode language rules. If something is in Unicode, either it's an Emoji, or someone thought it was a valid part of language somewhere. Admittedly, that music is in there means the definition of language is being stretched a bit; still if it is in Unicode and isn't an emoji, it's because someone is trying to make a real world pre-existing rule usable.
It's got nothing to do with fonts, you'll find. There's a blackletter block in most fonts, and it looks different font to font.
I'll tie the knot.
There's an author of comedy - I think he might have recently passed on early alzheimer's - named Terry Pratchett. I always found his style to be similar to the better known Douglass Adams. His world theme was swords and sorcery rather than science fiction, but it's all about the wordplay and nonsense nonetheless.
Most of his books take place in a shared universe called "discworld," in which ᴅᴇᴀᴛʜ gets her name written in small caps in every single book. (See how I did that? HN doesn't have fonts.)
Now, you might argue that that's a stylistic choice, and therefore doesn't qualify for discussion here. Indeed, if you did, I would even agree with you: that's my point, after all. I'd even skip the bitter argument about how smallcaps aren't small caps, because the letter width is different, the e and x heights are different, et cetera. Try to tell a typesetter "it's just a font" and you might get yelled at
I would say that the reason there's a meaningful difference is that no schoolteacher will ever mark your paper wrong due to not following Terry Pratchett's standards, where as schoolteachers *will* mark you wrong in Ecumenical School for not using blackletter when writing God.
Because there are rules. And even if you, apparently a linguist don't know about them, they're still there.
That is, of course, why Unicode has alternate letterforms, such as 𝗯𝗼𝗹𝗱, 𝘪𝘵𝘢𝘭𝘪𝘤, 𝓉𝒽ℯ 𝒶𝒻ℴ𝓇ℯ𝓂ℯ𝓃𝓉𝒾ℴ𝓃ℯ𝒹 𝒸𝓊𝓇𝓈𝒾𝓋ℯ, 𝓲𝓽𝓼 𝓫𝓸𝓵𝓭, 𝕕𝕠𝕦𝕓𝕝𝕖-𝕤𝕥𝕣𝕦𝕔𝕜 𝕔𝕙𝕒𝕣𝕒𝕔𝕥𝕖𝕣𝕤 for math, 𝖆𝖓𝖉 𝖊𝖛𝖊𝖓 𝖙𝖍𝖊 𝖇𝖑𝖆𝖈𝖐𝖑𝖊𝖙𝖙𝖊𝖗 𝖙𝖍𝖆𝖙 𝖞𝖔𝖚'𝖗𝖊 𝖈𝖚𝖗𝖗𝖊𝖓𝖙𝖑𝖞 𝖆𝖗𝖌𝖚𝖎𝖓𝖌 𝖆𝖇𝖔𝖚𝖙.
Because they're not fonts! They're meaningfully distinct letter forms with well understood usage patterns. (Blackletter shows up twice, with identical symbols, once for math and once for religion, and there's talk of adding a third for physics, because 𝕮, 𝕮, and 𝕮 all mean different things, and Unicode's position is that they should have different character representations for different meanings, despite having potentially identical graphic distinctions.
Indeed, if you ask a linguist "when is the most common use in English of double-struck letters," and they'll probably have a ready answer for you.
Heck, the Unicode formal character names have the answer baked right in pretty often.
U+1D539 U+0042 MATHEMATICAL DOUBLE‑STRUCK CAPITAL B Bopf
Not a font, friend. Not even if you really believe it is.
If you need clarity on the matter, use your own tool: you wanted to remind me that blackletter "wasn't uppercase" (nobody said it was) because it has upper case and lower case variants. Well, here's some of your own medicine. Fonts, like Arial and Times New Roman, have their own distinct blackletter and double-struck representations. And they look different, in the way that fonts do. Because they need to support the valid, different letterforms. A blackletter meant for Arial will look terrible set into Times, or vice versa.
There is an actual right and wrong here. This isn't an issue of opinion.
Of course, if you're a linguist, all I really need to do is say "swiss eszett."
At any rate, maybe all of my textbooks are wrong. No, I'm not interested in digging them out and looking them up for you, and it does not compel me that a stranger says they're wrong until I spend time finding it. If you're evidence motivated, bring some. If not, fine by me. Since you're a linguist, and I'm just a lowly programmer, if you actually want to check reference, go right ahead, be my guest; I'm sure you own much more of it than I do.
It's irrelevant to me: Unicode has rules, they cover this topic, they're clear on the point, they aren't set on website discussions, and if you want to debate them, we'll see you at the next meeting.
Good luck. There's an awful lot of inertia behind how it's currently seen, and it's extensively supported by the literature in the free, publicly available unicode meeting minutes.
Thank you for typing all of that out. First of all, I should like to apologize for the tone in my last post.
It seems quite strange to me to say that blackletter is not a font/typeface/script (or a clade of such objects). Historically, before we have mixed usage of the scripts, they absolutely were different typefaces, and the distinction between who used blackletter and who used roman was obviously one of time and place. The two are closely related, like dialects, but they are distinct.
Now, mixing scripts is also common as you mention, as has the recruitment of single symbols like the use of blackletter in mathematics. But, I mean, Hebrew letters have also been used. Does that make them part of our script in any sense?
As I hinted at I have some knowledge of some languages of the Islamic world, and as you know the inclusion of Arabic words and phrases, written in the Arabic script, is very common in texts written in non-Arabic languages. It is similar to how Western philosophers routinely write Greek words without transliteration to look smart. As humans, we have no problem switching between different scripts and languages that we have knowledge of. Now, computers do have such problems, hence Unicode to capture all of it instead of the mess that came before. But they're still distinct scripts.
Now, contrast that with case. The concept of case is at the very core of the modern Latin alphabet in all its derivations. The history of it is completely different to the distinction between roman and blackletter. The alternation between uppercase and lowercase is not something that occurs only in specialist contexts, nor is it analogous to mathematicians running out of symbols and recruiting new ones from different scripts. I'm simply saying that 'A' and 'a' have a relationship that's similar to the one between their blackletter counterparts, but the relationship between those two sets is very different. And I don't think inventing a category like "letters that get a different letter-form because of rules about what's being written" is very helpful, except maybe outside of the very specific technical context of solving the problem Unicode is solving.
(By the way, you make it sound like you are part of the Unicode process? If so, thank you very much for your service.)
> First of all, I should like to apologize for the tone in my last post.
Color me surprised. Thank you for saying so.
.
> It seems quite strange to me to say that blackletter is not a font/typeface/script (or a clade of such objects). Historically, before we have mixed usage of the scripts, they absolutely were different typefaces, and the distinction between who used blackletter and who used roman was obviously one of time and place. The two are closely related, like dialects, but they are distinct.
I don't mean to seem rude, but it's not at all clear to me how you can believe this.
Typefaces begin with the invention of movable type, which depending on what you count as "enough" is probably Bi Sheng. However since the context is blackletter and international trade hadn't emerged yet, we're stuck with Laurens Janszoon Coster in context. His work is generally believed to have been around 1420, though records aren't terribly clear. (Any number involving 420 is pretty good though, so just run with it.)
Prior to this, in European context, there is no such thing as a typeface.
The emergence of blackletter is contentious - it's not clear whether to include gothic miniscule (or, at the other end, textura) - but my preference is for the standard issue of 1150, about 300 years before type existed.
Blackletter is so old that its nearest unambiguous parent is Carolingian.
Blackletter is older than fully 70% of yo momma jokes, both in content and target.
Amusingly, it is actually well known and well documented that the first typeset Gutenberg used was blackletter. This seems invalidating to me.
> the distinction between who used blackletter and who used roman was obviously one of time and place
My problem with the word "obviously" is that in my personal experience, it has been more often used about things that are incorrect than things that are correct.
Blackletter has been pretty squarely relegated to The Bible, in truth. Its emergent use in mathematics is recent, and the result of our running out of several other groups of symbols to use - they had already run through latin, cyrillic, latin, greek, doublestruck, bold, italics, capital variants of all the preceding, and a variety of related symbols.
Kanzlei, the ancestor of the font, was developed by monks for the specific purpose of producing visually pleasing, salable bibles. As this preceded movable type, Bibles, which were very long, were intensely expensive - sometimes a full year of a person's income. This is also why they were so ornately decorated.
Frankly, it's an awful font, and should never be used for anything. Not even wedding invitations. And for once, all of humanity (you know, the same humanity that still uses comic sans) just stayed away, probably because of the intense eye pain.
.
> But, I mean, Hebrew letters have also been used. Does that make them part of our script in any sense?
No. It makes them a part of mathematics, which is not a part of "our" script either.
As far as I know, this starts and stops with aleph. Aleph in hebrew is U+05D0. Aleph in mathematics is U+2135.
Are there any other Hebrew letters in use by math at all?
.
> as you know the inclusion of Arabic words and phrases, written in the Arabic script, is very common in texts written in non-Arabic languages
Transclusion of foreign segments is formally irrelevant to the nature of native orthographies.
.
> But they're still distinct scripts.
Unicode has a vast and unweildly array of counterexamples in its documentation, some of which were already linked for you, but I'm glad that you're confident.
Again: swiss eszett.
.
> Now, contrast that with case. The concept of case is at the very core of the modern Latin alphabet in all its derivations.
This is, of course, 100% incorrect. The Latin alphabet was 100% what we would call uppercase until around 200 ad, and miniscule didn't normalize across the language for another 300 years.
Latin spend more time monocameral than bicameral by more than triple. We currently think of Latin as emerging in 700 BC, and we typically think of the death of Latin as the second fall of Rome in 476 AD.
So in reality, Latin was single case for 900 years and double case for less than 300.
I strongly recommend you start verifying your beliefs before presenting them as fact. You're batting 1 for 19 at the time of this writing.
.
> The history of it is completely different to the distinction between roman and blackletter.
The academics do not agree.
.
> The alternation between uppercase and lowercase is not something that occurs only in specialist contexts
I never said it was, and this is not the case of most of my examples.
I feel that you are arguing for the sake of arguing, and have lost track of my intent for the hope of going for each sentence out of context.
.
> nor is it analogous to mathematicians running out of symbols and recruiting new ones from different scripts.
Cool, didn't say this was either.
.
> I'm simply saying that 'A' and 'a' have a relationship that's similar to the one between their blackletter counterparts
Say that as many times as you like. The historians and linguists don't agree, even though you have a belief system to display.
.
> And I don't think inventing a category like "letters that get a different letter-form because of rules about what's being written" is very helpful
That's nice. I didn't invent that. I'd tell you to talk to the person who did, but he's been dead for more than a thousand years.
Kinda spooky.
.
> except maybe outside of the very specific technical context of solving the problem Unicode is solving.
"Except outside of the very specific technical context of the exact thing this post and you were talking about, and has been the core of your point the whole time."
Thanks, I guess.
I'd respond, but I think this isn't very helpful, except maybe outside of the outside of the very specific technical context of human language as it's used on Earth
Imagine trying to argue what things actually count as letters, then begging out "you're being too technical"
You are right, I maybe have lost track of what we're talking about, or maybe I am reading too quickly - a bit like I think you are when you take my reference to "the modern Latin alphabet and all its derivatives" to be about antiquity.
So let us backtrack. I am still reacting to your original point, which was that English has blackletter and cursive as "alternate cases". I understood that to mean that if we take the idea of the letter A (the thing that the uppercase and lowercase A have in common), that the cursive and blackletter realizations of it somehow enter into the same paradigm as the two roman glyphs do in English. I really don't see how you could argue that that is the case.
> I don't mean to seem rude, but it's not at all clear to me how you can believe this.
I can believe it because blackletter typefaces were the dominant in my country for all purposes into the early modern era, when they were supplanted by roman ones. I think this is at least partially related to the unification of Germany and our losing a war against them.
Hiragana/Katakana absolutely can not be interpreted as “uppercase/lowercase”. BTW kana also has full-width/half-width versions, which while also distinct would be closer.
> Hiragana/Katakana absolutely can not be interpreted as “uppercase/lowercase”.
I didn't say it was upper or lower case.
I said it was a difficult and arguable parallel, because just like upper and lower case, it's a setup where the letters you use are switched out for others because of context.
.
> For a more pure example, consider Bopomofo.
From my perspective, it's literally exactly the same thing: a case where the letters you use are switched out because of context.
In fact, my linguistics textbook gives Bopomofo as an example *in the same sentence as katakana*.
Depending on what that sentence is, it doesn’t sound contradictory.
I considered it more as it’s own one set of characters (in the same way that kanji and Roman letters are their own)
A difference here is that while Japanese words can correctly be written as either hiragana or katakana, the idiomatic way is to mix them depending on the word (and occasionally context).
Bopomofo is its own writing system and doesn’t really get mixed with, say, pinyin.
I was testing this lately,Teradata, SQL server, Oracle and others just return ß for upper('ß'). Snowflake returns SS.
There is btw an upper letter defined for ß
The conversion is lossy since lowercase ss (e.g Gasse) is valid and common German and capitalizes to SS as well. Short of the call learning some German, there's no neatly reversing this.
There are even cases where it is context-depedent: Maße (measurements) and Masse (mass). Both are written MASSE uppercased, but how do you know which one's which?
For additional fun: Swiss German doesn't use ß at all, so there is no difference between "in Maßen genießen" (to enjoy in moderation) and "in Massen genießen" (to enjoy in large amounts).
Making X.upper().lower() return X.lower() is doable with an expanded string type that keeps track of more context, such that modifiers don't apply until the final output. In that case, it would be relatively simple to say that with multiple upper and lower calls, only the last survives.
Making 'STRASSE'.lower() return 'straße' requires that the caller have knowledge of the written language in us and a lookup through a language dictionary. IIRC, not all German words with two consecutive 's' es are properly written with an 'ß', and I don't know much about other languages that use that character. Blindly changing any SS to ß on lowercasing isn't what anyone wants, but rarely are strings annotated with the written language they contain, and it gets worse because strings can contain multiple written languages, which is only extremely rarely annotated.
You could plausibly capitalize the ß into one of the homoglyphs for S. Then, lower() could detect the homoglyph and know how to lowercase it. This will have some side effects, though (e.g. a doubled lowercase version of your homoglyph will now not round-trip).
I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.
(High roller, I know.)
No games, though. If you say "An algorithm to compute the length of the string in number of visible characters," that obviously is designed to pass the test rather than to do anything useful.
Maybe framing it as a bet will break the illusion that string length is ever required.
Set top boxes (the things you use to decode paid tv signal and watch "the [shows] programing guide" ) uses a format called EIT (Event Information Table) which is transfered as an encoded binary file, for which MANY of the titles and tv shows info is capitalized (mostly* due to readability, you know so an L does not look like a 1) I worked on a proyect for such file generation and it is HIGHLY sensitive to the lengt of the data (since it will be encoded to a binary and if you add a single bite to the length of such file the rest of the info turns into garbage) ... Now that I think about it I may probably drop one of those guys still there a call to let them know the shows in german may have a bug... Can I get my 20 now?
I'll send you $20 for the interesting reply and reporting unicode bugs, if you want! (Feel free to email me your info.) But in the situation you mention, you care about the byte length, not the string length. Byte length is a totally valid thing to care about, but string length isn't needed.
> I'll paypal $20 to anyone who can name a situation where string length (in number of visible characters) is actually required by any reasonable algorithm.
I've already described above where I've had this exact requirement. The client demands that "50 letters" be allowed in a title field for a website element. He means letters: spaces (and presumably newlines) don't count, neither do punctuation. Emojis were not a concern at the time. I wrote Javascript and PHP validators for client- and server-side validation and stored the string in a VARCHAR(127) to be safe.
And this is completely reasonable. He is a human, not "a computer person" (his terminology) and pays me to abstract from him those "computer things" which he has no desire to understand.
I guess I'll admit that fixed width fonts are a valid use case, but I'd argue that that defeats the point of displaying language to humans. There are so many cases this doesn't apply: RTL languages, kanji, pretty much anything that's not eurocentric.
My original response was going to be that you don't actually care about "foo".length in that context, either. You care whether one string is larger than the other, i.e. the actual value doesn't matter -- all that matters is that if a string X is longer than string Y, then the length should increase. But mathematically, that's just a precise definition of requiring the string length, so that logic doesn't work out.
If you email me your paypal/venmo/whatever, you can collect the $20, since you were the first to point out monospaced font. I still feel that monospaced fonts are sidestepping the whole point, but I won't be pedantic about it.
Actually, I've been meaning to ask: what's a good charity to donate to? I haven't had much money until recently, so I've never sat down and figured out a charity.
There are lots to choose from, and I suppose I don't really care about making a perfect choice. I'm mostly just curious what other people donate to.
Re: the data tables, it's tempting to think that you need the string length for that case. But what you really care about is whether the strings start at the same place, and whether the string fits in a column -- both of which are questions best answered by a font subsystem rather than relying on string length.
But the data table case is so common that I agree it fulfills the terms of the bet, in the strictest sense. But e.g. that Competition / Name column will break for various non-English languages if you try to base the formatting on "number of visible characters."
FYI, depending on where you live, charitable donations can have tax-benefits (i.e. in the form of a deduction). It can be worth making sure you're donating to charities which are enrolled in such programs. For example, Singapore has a 250% tax deductions on donations (this is pretty extreme though, I'd bet there are very few countries with > 100%).
I think you'd find most people donate based on what's touched their lives, e.g. breast cancer research if you lost a loved on to breast cancer.
Width of text is a discreet thing from string "length"… or, well, string "length" is particularly vague. But lets presume the OP meant length in unicode code points or grapheme clusters. (Neither of these will be width under a fixed width font. There's also length in bytes, which ofc. is also not width in a fixed width font.)
E.g., most Asian characters & emoji are 2 cell widths in a fixed with font for a single character. The character "e" is 1 byte in UTF-8, "é" is >1. The "family of four" emoji can be up to 11 code points, 41 bytes (in utf-8), and 2 cell widths (still just an emoji) and is, I presume, a single grapheme cluster.
Oh that's a great one! I have a thermal receipt printer (can recommend anyone to buy one, they're amazing) that is 80 characters wide. Would definitely screw up the formatting if it ever encountered this situation. (and no further validation was performed on the uppercase'd string)
At the very least least you need to know which code point clusters are one visual character for stepping the cursor left and right through text in a text editing control.
Which is 99% of the work to also calculate the length in those terms.
If your substr() cuts characters in half, then the reason is simple - it's either not aware of multi-byte encodings, so it cuts by bytes instead of cutting by characters, or the encoding it uses is different from the encoding of your string, so substr() cuts it wrong.
You may visualize all that by converting your strings to lists of bytes.
That's the point. You can lose accents, or other parts of a character, even if you are aware of the encoding. You need to be aware of the modifiers as well.
I'm not sure I understand what you're getting at. But if changing the case didn't alter the length, you could do it in-place, without having to allocate memory. I always thought this was the point?
Visible character count isn't necessarily related to storage size.
If uppercase requires a diacritic, but lower case didn't that doesn't (usually) increase the visible character count, but it likely increases storage space, depending on encodings and normalization forms. Thinking about i to I with dot (which I can't currently type), which would need more storage in utf-8 or as I + dot.
implementing tab-to-spacification in a way that requires tabs to work "as tabs" (i.e. you get your alignment along 4-space lines) with a mono-spaced font.
I suppose some terminal rendering (where you're taking in chunks and you might want to be a bit clever about jumping to spaces to figure out where breaks should go).
There is a notion of UTF codepoints that you can index into (see how Python does string indexing), though I generally think that people who whine a lot about string behavior for UTF tend to just not be reaching for the right tools.
To not have users complain you're withholding characters when you implement artificial input lengths? ;) (although e.g. Twitter does not accurately count visible characters, despite the character limit being a prominent feature, suggesting its not that important)
"What size does this UI element have to fit this text in our specific mono-spaced font"?
But the point is it ought to be. Tell a user they have "a 280 character limit" and the intuitive interpretation for 98% of the world population that doesn't know the intricacies of Unicode is "a limit of 280 visible characters."
That's what Twitter's tweet limit ought to be, and if there was an easy, culturally neutral algorithm for it I'm sure that's what they would have done.
Twitter clearly has chosen to intentionally deviate from it. E.g. what they specify on that page is not "unicode code points" either, but indeed their own mapping of "this character is worth that many points".
From what that page describes, it appears Twitter's algorithm is in fact "number of unicode code points" except for specific carve-out cases which are so obscenely unfair by that metric as to justify counting differently. E.g. counting Unihan characters as two makes the 280 character limit in these languages be approximately the same limit on information as western languages, since they are about twice as compact in terms of information density. And emojis are an insane amount of unicode code points, particularly when modifiers are applied. Etc.
1. Length in the context of storage or transmission is the number of bytes of its UTF-8 representation - and not just that, one of its decomposed forms too.
2. Length in terms of number of visible characters is the number of grapheme clusters.
3. Length in terms of visible space on screen is the number of pixels wide and tall when rendered in a given font.
4. Length when parsing is based on the number of code points.
If you look at it that way it makes perfect sense that the “length” of a string could change when any operation is performed on it - because there’s no such thing as a canonical length.
tl;dr bug or not, the concept of a canonical string length is an anachronism from the pre-Unicode days when all those were the same thing. There’s no such thing as a string length anymore.