Hacker News new | past | comments | ask | show | jobs | submit login
String Lengths in Unicode (hsivonen.fi)
230 points by panic 35 days ago | hide | past | web | favorite | 140 comments

The closest to insight in this post is a quote from somebody else. “String length is about as meaningful a measurement as string height”

Under that quote three rationale are offered.

1. "to allocate memory". Fine, like an integer your very low level code needs to actually store strings in RAM and in this sense both integers and strings have a size. I don't think I'd name this size "length" though and certainly if length(5) is an error in your language it seems as though length("word") is also an error by this definition.

2. "quota limit". Just pick anything, there is definitely no reason to name the value you're using for this arbitrary quota "length". If you have a _reason_ for the quota, then you need to measure the reason, e.g. bytes on disk.

3. "display space". This one is a length! But it's in pixels (or milimeters, or some other type of distance unit) and it requires a LOT of additional parameters to calculate. And you'll notice that at last there is actually a string height too for this case, which likewise requires many additional parameters to calculate.

Treat code that cares about string "length" as code smell. If it's in your low-level fundamentals (whether in the memory allocator or the text renderer) it might, like a lot of other things which smell down there, have a good reason to be there, make sure it's seriously unit tested and clearly documented as to why it's necessary. If it's in your "brilliant" new social media web site backend, it's almost certainly wrong and figuring out what you ought to be doing instead of asking the "length" of a string will probably fix a bug you may not know you have.

I'd say all of the above objections don't fix any of the actual issues (which revolve stem from there being multiple, similar, metrics one wants regarding the size in bytes, characters, normalized characters, etc of a unicode string), while introducing extra semantic bike-shedding!

Whether the metric is called "length", or "size", or whatever, is irrelevant, and length for strings and arrays and lists is so well entrenched and understood that the objections don't make sense.

What's actually the problem is not that "string length" is not "really" a length in the way that a leg or a wall has length, but that for unicode strings it's difficult to calculate, and confusing, unless one understands several unicode implementation mechanics...

The name length itself, for example, or the measuring of said length, was never an issue with ASCII strings (the issue there was to remember to add/subtract the NUL at the end).

It doesn’t matter how well you understand Unicode, it is impossible to compute “the length of a string” because there is no single metric that means that.

It’s like measuring “the size of a box.” I want volume, you want maximum linear dimension, UPS wants length plus width plus height. The problem with “the size of a box” isn’t that it’s hard to measure, it’s that it doesn’t exist. Imagine your favorite language has a Box type with a “size” property. What does it return. How likely is it that the thing it measures is the thing you want?

Of course it was never a problem for ASCII, because ASCII is structured to make most of the measurements people care about be the same value. I want bytes, you want code points, he wants “characters,” doesn’t matter, they’re all the same number.

ASCII is also incapable of representing real world text with good fidelity. It’s inherently impossible to remedy that while maintaining a singular definition of “length.” If you value length measurement over fidelity, you can keep using ASCII.

>It doesn’t matter how well you understand Unicode, it is impossible to compute “the length of a string” because there is no single metric that means that.

That's already covered in my comment: ("[the problem] stems from there being multiple, similar, metrics one wants").

Whether we call all of them length or a specialized name is not the real problem.

The real problem is you need to know what you want of each, and some of them (e.g. regarding normalization, decomposition, and so on) can be hard to grasp.

In 99% of cases people either want to know "how many bytes", or "how many discreet character glyphs of final output" (even if they have combining diacritics etc).

It's really rare to care about the number of glyphs. That's not something you can answer for a string in isolation, anyway; it depends on the font being used to render it. The only code that would care about this would be something like a text rendering engine allocating a buffer to hold glyph info.

I suspect you mean the number of grapheme clusters, which is Unicode's attempt to define something that lines up with the intuitive notion of "a character." This is basically a unit that your cursor moves over when you press an arrow key.

However, it's pretty uncommon to want to know the number of grapheme clusters too. Lots of people think they want to know it, but I struggle to come up with a use case where it's actually appropriate. An intentionally arbitrary limit like Tweet length is the best I can think of.

"How many bytes" is ambiguous. Do you mean UTF-8, UTF-16, UTF-32, or something else?

There are a lot of different ways to answer the question, "how long is this string?"

You did mention similar metrics, but you then went on to say that the objections don't make sense and that the actual problem is that length for a unicode string is difficult to calculate.

My point is that the difficulty of calculating a length is not the problem. It's annoying, but people have written the code to do it and there's rarely any reason to write it yourself. Just call into whatever library and have it do the work. The problem is that you have to know what kind of question to ask so you can make the call that will actually give you the answer that you need. And that is not the sort of thing that can be wrapped up in a nice little API.

> I struggle to come up with a use case where it's actually appropriate.

TCP packet size is a very real thing for me.

Also in forms it helps to know, for example, that no one has entered more than 2 characters for a USA state abbreviation.

Definitely open to bigger ways about thinking of these things though.

How is calculating the number of grapheme clusters relevant to TCP packet size?

As for state abbreviations, is $7 a valid abbreviation? Is XQ? You’d probably want to validate those against a list of known good ones.

For which use case is "how many character glyphs" the measure you you want to know?

Super curious about why UPS wants l+w+h. Any detail you can link?

Apparently I misremembered slightly: it's actually width + 2 x length + 2 x height.

That link doesn't seem to explain the why, but my understanding is that it's just a decent heuristic for the general difficulty of handling packages as they go through the system. Volume wouldn't be appropriate, because a really long, skinny box is harder to handle than a cube of the same volume.

How do you decide which dimension is the width versus the length? I assume height is often significant for packages containing things that shouldn't be turned upside down, but length versus width seems pretty arbitrary. Is width just assumed to be the shorter of the two dimensions?

Length is defined as the longest side. The other two sides are interchangeable so pick what you like. This measure doesn’t appear to account for packages that require a certain orientation.

Fun fact: the iOS bug where you could brick a device by sending some strings in Arabic to the device was essentially due to assuming that there is a monotonic relationship between some measure of "string length" and "display space".

(essentially, iOS was trying to truncate strings in notifications to make them fit available space, except it did that in a very naive way, and would crash on any case where removing codepoints from a string made it display _longer_. Pretty easy to trigger with letters like ي)

Can you go into a little more detail how that works? I know almost nothing about Arabic.

The same letter can look different depending on whether it's at the beginning, in the middle of, or at the end of a word. So if you remove a letter with a fairly short final form, which is preceeded by a letter with a fairly short medial form but a fairly long final form, it's possible that the new rendering of this letter would be longer than the previous two.

This problem isn't exactly Arabic-specific, it's just super easy to trigger in Arabic. You've got kerning in the Latin script as well, and you could theoretically have a font that makes `the` look smaller than `th`.

But in Arabic, letters change form based on where they are in the word. All three things with two dots on them are the same letter in تتت, but the middle one is much smaller. I can imagine something like يم being truncated to ي and the end result being longer visually, depending on the font.

Interestingly, in many use cases all these "lengths" agree. For example an empty string is 0 in all these measures, and fuzzy metrics like "unreasonably long password" work equally well with "500 normalized characters" "500 bytes" "1000px at font size 8" etc.

The problems only arise when taking about "a string of length 20" which indeed has a strong smell

"unreasonably long password" must go into the area of "Content-Length is too long for this HTTP request to login/change password".

>The problems only arise when taking about "a string of length 20" which indeed has a strong smell

It has no smell at all. We interface with all kinds of systems with limited display space, external forms, POS displays, databases with varchar columns, and so on all the time.

Whether certain strings should have a specific length is a business decision, not a code smell...

limited display space is a case for "string length in px", which is notoriously hard to calculate and has poor library support. Just because 20 "x" fit doesn't mean 20 "w" will fit. Fixed space fonts are an exception, but they have problems with Chinese.

Databases with varchar columns exist, but varchar(20) sounds generally suspect unless it's a hash or something else that's fundamentally limited in length.

>limited display space is a case for "string length in px", which is notoriously hard to calculate and has poor library support.

It is notoriously easy when the display is a LED display, a banking terminal, a form-based monospaced POS, something that goes to a printed out receipt (like a airline check-in or a luggage tag), a product / cargo label maker, and tons of other systems billions depend upon everyday, where one visible glyph = 1 length unit, and type design doesn't come into much play...

It's only easy if the system forbids everything that would make calculating visible length hard, which I think constitutes extremely poor library support. I want to see the monospaced system that can correctly print Mongolian: ᠮᠣᠩᠭᠣᠯ ᠪᠢᠴᠢᠭ If properly implemented, it should join the characters and display them vertically. But your browser is probably showing them horizontally right now, because support for vertical writing is seriously broken: https://en.wikipedia.org/wiki/Mongolian_script#Font_issues

Most of those systems are terrible at handling non-latin text because they get all these things wrong. Of course it's "easy" to handle length in these cases, they've selected out any kind of text that makes handling length hard.

The mere assumption that "one glyph" is a meaningful well-defined concept that works across languages is the problem here.

I was at one time responsible for code to populate preprinted forms. I only had to deal with ascii but the best solution was still to just do the layout and then check that the bounding box wasn't too big.

I quite agree... mostly... but I tend to abbreviating input from untrusted sources using the String.length() anyway, because it's so simple, and a single-digit overestimate is often safe.

For example, I have a phone app that abbreviates various things at 300 whatevers because I know that on one hand 300 is more than the phone will display, but on the other it's little enough that it'll cause no processing problems. If I process 300 whatevers and then the display can only show 120, I haven't burned much of the battery. I overstimated by a factor of 2.5, burned a minuscule part of the battery, and I gained simplicity.

I strongly agree that numbers as low as 20 are a code smell.

Mechanically abbreviating strings is likely to have nasty corner cases. Avoid if at all possible.

There are fancy Unicode corner cases, like the flags (cutting a flag emoji in half doesn't get you half a flag it gets you one half of the flag's country code identifier) but there are plenty of corner cases already in ASCII.

Abbreviating "Give all the money to Samantha's parents to hold in trust" as "Give all the money to Sam" doesn't involve any scary modern encoding, just the problem that this isn't how human languages work.

Can you suggest a relevant nasty corner case?

Consider an unbounded input string (generally less than 1k, but sometimes more than 100k), a phone display large enough to display, say, 80-120 glyphs, and a mechanical abbreviation to a few hundred codepoints soon after input, before all expensive processing. What are the nasty corner cases?

You are not thinking outside of your web development box.

Embedded systems care deeply and know exactly display sizes.

Embedded systems increasingly have screens with real pixels and support for weird languages. At least they are low-level that "give me the width and height of that string" is actually an achievable task.

The ambiguity of which length is being talked about is what smells. It is a bad odor, not bad enough that I'd not use a library that had 'length' as a property, but it is absolutely a smell because I can't be sure that every single last usage of 'length', inside the library nor in my code, is using the same definition for 'length' as is meant.

It's subtle, which is what makes it a smell, rather than a "DO NOT WANT".

> "1000px at font size 8"

That one is likely to get you into trouble when you allow 0 width characters.

1. You don't iterate over bytes in an integer, so you don't need to know the size of an int since all uses are handled by the compiler. If you do care, you can always find out with a sizeof check (or preferably specify the size directly). The compiler can't handle Unicode for you since it's typically dynamic data.

2. This is reasonable.

3. What about fixed-width fonts like a terminal screen. In this case, I care about how many distinct (non-combined) code points exist, and a count of code points is often a acceptable approximation.

That being said, when talking about string length, I mostly care when iterating and allocating memory, and those are two different usages of the term "length". I like how Go handles it: len(string) gives number of Unicode code points, len([]byte(string)) gives number of bytes.

Everything else is a special case IMO that shouldn't use length, except perhaps as a conservative estimate (e.g. fixed width rendering).

If you're not using a modern iterator, you'll want string length and string indexing. Otherwise you don't need string lengths except for the reasons given in TFA and above, but (1) (to allocate memory) can be hidden in most cases, (2) is mostly not necessary (account for encoded bytes where possible and be done), and there's very little code that needs to figure out (3).

For almost every task, an iterator that iterates graphemes (extended grapheme clusters) is the right thing to do.

I do think it's annoying that Unicode ended up with such a complex extended grapheme cluster scheme though.

> Treat code that cares about string "length" as code smell

This is nonsense. Pretty much anywhere that you're accepting user input on the web you want to have some sort of length limit. Even if you're Hacker News rather than Twitter, you probably don't want to let anybody post a 10GB comment. You just need to pick an appropriate length metric. That's part of the post's point, and it's also apparently your point a few sentences after you rubbish it; why are you insulting the post and then suddenly agreeing with it?

for what is worth, I have used the string byte length (i.e. the utf-8 code unit count) as a (weak) component for word weight in a clustering algorithms.

I think it’s good that Rust sort of forces you to recognize the complexity in this. If you try s.len(), it’ll be in terms of UTF-8 bytes, which might be what you want… or might be far from it. If you switch to s.chars().count(), you’ll get Unicode scalar values, which may be closer to the mark. And if you need proper segmentation by “user-perceived character,” as set out in Unicode Annex 29, you’ll have to bring in an external crate. That’s fair enough.

Keep in mind that even Annex 29 is, on some level, just a starting point. Its segmentation rules don’t work “out of the box” for me with Persian and Arabic text. I’m not totally on-board with its treatment of the zero-width non-joiner (U+200C), and it doesn’t deal with manual tatweel/kashida (U+0640) at all. So you make the necessary adjustments for your use case. The rabbit hole is deep.

As a python user that came to Rust encodings never really made much sense to me. I knew what encodings basically were, but I neither knew why it has to be that way and how to deal with common situations without feeling insecure.

Rust put the whole thing front and center and made it incredibly clear why they made that choice. The design choices around encodings also have been more well reflected than those in Python, it never becomes a “hairy” mess, but something you can rationally manage.

> I think it’s good that Rust sort of forces you to recognize the complexity in this.

It's not bad, but Swift still does better:

* String.Index is an opaque type (ideally it would be linked to a specific string too)

* it returns grapheme cluster counts by default, that's probably the least surprising / useless though not necessarily super useful either

* other lengths go through explicit lazy views e.g. string.utf8.count will provide the UTF8 code units count, string.unicodeScalars.count will provide the USVs count

> If you switch to s.chars().count(), you’ll get UTF-8 scalar values

You get unicode scalar values (which I assume is what you meant).

I think what swift does is good for its design space. I definitely would not like it if Rust did the same thing, because you need fast ways to store and reuse indices for parsing and other things.

Swift is primarily geared towards UI, and in that case using EGCs is the right design choice.

The big problem with extended grapheme clusters by default is that it changes behavior completely as Unicode spec changes. So Swift strings are nearly guaranteed to change their default behavior between versions.

TFA makes a very good case that Swift got it wrong.

Perhaps string length should be more hidden though, or parametrized with a unit to count. Perhaps it should be

  let nu = s._smelly_length(UTF8CodeUnits);
  let nc = s._smelly_length(ExtendedGraphemeClusters);
Stop using "zero length" to denote empty string, just have an empty string method.

Not sure how you read "Swift’s approach to string length isn’t unambiguously the best one" and interpret it as "Swift got it wrong."

The author's substantive criticism of Swift's String/Character/etc. types seems to be complications arising from the dependency on ICU (such as not necessarily being able to persist indices across icu4c versions).

Good catch! I was still able to edit the comment.

I agree here! If anything, in my Rust code, I've learned that I very rarely actually need to index a string; `String::contains`, `String::starts_with`, and `String::ends_with` cover the vast majority of my use cases, and if for some reason I actually need to inspect the characters, `String::chars` will let me iterate over them.

There is not the char concept in Go, only rune, which means a code point. This reduces much confusions.

It's probably a false sense of security though. If runes are in fact code points, thats about the least helpful abstraction. They do not encode grapheme clusters meaning some items, such as emoji, may be represented by multiple runes, and if you split them along the boundary, you may end up with invalid strings or two half-characters in each substring. You can't really do much with code points -- they're useful when parsing things or when implementing unicode algorithms, and not much else.

If you care about grapheme clusters, it doesn't mean using characters is wrong: you can layer a higher level API on top of characters.

For example, if you want to encode emojis some non-Unicode abstract description goes in and a sequence of characters go out, and if you want to figure out the boundaries of letters clustered with the respective combining diacritical marks they are subsequences in a sequence of characters.

> If you care about grapheme clusters, it doesn't mean using characters is wrong: you can layer a higher level API on top of characters.

Be very careful here; when talking Unicode, there is no such single thing as a "character", and people being insufficiently precise here has been the source of much confusion.

"You can't really do much with code points -- they're useful when parsing things or when implementing unicode algorithms, and not much else."

That is exactly what "rune" is intended for, though. In the several years I've been using Go, I believe I've used it once, precisely in a parsing situation.

For the most part, Go's answer to Unicode is to just treat them as bytes, casually assume they're UTF-8 unless you really go out of your way to ensure they are something else, and to not try to do anything clever to them. As long as you avoid writing code that might accidentally cut things in half... and I mean that as a serious possible problem... it mostly just works.

To be honest, if you're trying to tear apart this example emoticon at all, you're almost certainly doing something wrong already. The vast bulk of day-to-day [1] code most people are going to write should take the Unicode characters in one side, and just spit them out somewhere else without trying to understand or mangle them. A significant amount of the code that is trying to understand them should still not be writing their own code but using something else that has already implemented things like word tokenization or something, or performing operations using Unicode-aware libraries (e.g., implementing a find-and-replace function can basically do a traditional bytestring-based replacement; code to replace "hello" with "goodbye" will work even if it has this emoticon in the target text if you do the naive thing). What's left is specialized enough to expect people to obtain extra knowledge and use specialized libraries if necessary.

A lot of what's going wrong here is putting all these things in front of everybody, which just confuses people and tempts then into doing inadvisable things that half work. In a lot of ways the best answer is to stop making this so available and lock it behind another library, and guide most programmers into an API that treats the strings more opaquely, and doesn't provide a lot of options that get you into trouble.

It isn't necessarily a perfect solution, but it's a 99% solution at worst. I write a lot of networking code that has to work in a Unicode environment, so it's not like I'm in a domain far from the problem when I say this. It's just, 99% of the time, the answer is, don't get clever. Leave it to the input and rendering algorithms.

[1]: I say this to contrast things like font rendering, parsing code, and other things that, while they are vitally important parts of the programming ecosystem and execute all the time, aren't written that often.

Does it? I doubt people expect len(facepalmguy) to be 17 whether you say that's counting "runes" or "chars" (with both being defined as UTF-8 code points). If you know enough about Unicode to object to code points being called a shorthand for "characters," I doubt you'll be confused for more than a second.

> There is not the char concept in Go, only rune, which means a code point.

Go named their char rune. That's a distinction without a difference.

Rune in Go means a scalar value i.e. the same as Rust char but not the same as C char or Java char.

> Rune in Go means a scalar value i.e. the same as Rust char

Yes. It's also the original intended meaning of java char, just didn't survive the transition to 21 bit USVs.

So distinction without a difference.

I seriously applaud the writer to dive into Unicode in such detail and compare multiple implementation in different languages. That must have taken a while!

He is working for Mozilla and I guess he needs to actually know all those nitty-gritty details. I could not imagine myself to even bring up the patience for analyzing it.

In some way I am really scared about Unicode. I don't care if the programming language simply allows input and output of Unicode in text-fields and configuration files. But where it gets tough is, if you actually need to know the rendered size of a string or convert encodings, if the output format requires it. There are so many places where stuff can go awry and there's only a small passage in the blog post which mentions fonts. There are multiple dragons abound, like font encoding, kernings and what not.

Imagine writing a game. Everything works nice and dandy with your ASCII format and now your boss approaches you and wants to distribute the game for the asian market! Oh, dear... My nightmares are made out of this!

How do you handle this? Do you use a programming language which does everything you need? (Which one?) Do you use some special libraries? What about font rendering? Any recommendations?

> Imagine writing a game. Everything works nice and dandy with your ASCII format and now your boss approaches you and wants to distribute the game for the asian market!

You just make sure all translated text is in UTF-8, and use Google Noto fonts for those languages. All game engines I know render UTF-8 text without problems if you supply a font that has the needed glyphs.

Source: I'm and indie game developer and have recently localized my game to Chinese. The game is a mix of RPG and roguelike, so it has a lot of text (over 10000 words). I used SDL_TTF to render text. Precisely: TTF_RenderUTF8_Blended() function. The only issue I had is with multiline/wrapped text. SDL_TTF doesn't break lines on Chinese punctuation characters (.,;:!?) so I would search+replace strings at runtime to add a regular space characters after those.

> SDL_TTF doesn't break lines on Chinese punctuation characters (.,;:!?)

Those aren't Chinese punctuation characters. Chinese punctuation characters are full-width, including the spacing that should follow them (or in the case of "(", precede) within the glyph itself: (。,;:!?). (You may also notice that the period is radically different.) Chinese text should almost never include space characters.

Chinese applications seem happy to break lines anywhere including in the middle of a word, but punctuation seems like an especially good place for a line break, so I'm confused why SDL_TTF would go out of its way to avoid breaking there.

It sounds more like a bug in SDL_TTF than a deliberate attempt to not break the line on Chinese punctuation marks.

I wonder if SDL_TTF works with Unicode Zero-Width Space (U+200B). If so, that would probably be the right choice.

> Those aren't Chinese punctuation characters.

I know, I meant the actual ones you wrote above.

> I'm confused why SDL_TTF would go out of its way to avoid breaking there.

SDL_TTF doesn't break at all. If you have a long Chinese text which uses proper punctuation characters, it would never break, because it only breaks on ASCII whitespace.

I wanted to avoid breaking lines in the middle of a word, so I added extra "regular" space characters to force breaking the line.

You don't really need to only break on punctuation. There is no convention to do so and so long as you so not break any logograms in half, the resulting text reads perfectly fine. In fact, the convention is to have left and right justified text with equal numbers of monospaced logograms, including punctuation, on each line (on the equivalent for vertical text). Classical Chinese before the 20 th century was seldom punctuated.

I wasn't aware of this. Thanks.

>You just make sure all translated text is in UTF-8, and use Google Noto fonts for those languages. All game engines I know render UTF-8 text without problems if you supply a font that has the needed glyphs.

The game could have used a custom engine (like tons of games do), or the requirement could include e.g. Arabic or some such RTL text, further messing up the display...

First rule: don't panic :)

Also disclaimer: I've been working on games which were translated to >20 languages including some exotic ones, but I'm in no way an UNICODE expert.

- most important: consider using UTF-8 as text encoding everywhere, and only encode and decode from and to other text encodings when needed (for instance when talking to APIs which don't understand UTF-8, like Windows)

- be very careful in all places where users can enter strings, and with filesystem paths, this is where most bugs happen (one of the most popular bugs is when a user has Unicode characters in his login name, and the game can't access the user's "home directory", happens to the big ones too: https://what.thedailywtf.com/topic/15579/grand-theft-auto-v-...)

- get familiar with how UTF-8 encoding works and how it is "backward compatible" with 7-bit ASCII, there are good chances you don't need to change much of your old string processing code

- rendering is where it gets interesting, and here it makes sense to only do what's needed:

(1) The easiest case is American and European languages, these are all left-to-right, have fairly small alphabets and don't have complicated 'text transformation' rules

(2) East-Asian languages with huge alphabets can be a problem if you need to pre-render all font textures.

(3) The next step is languages which render from right-to-left, the interesting point is that substrings may still need to be rendered left-to-right (for instance numbers, or "foreign" strings)

(4) And finally there are languages like Arabic which rely heavily on modifying the shape of 'characters' based on where they are positioned in words or in relation to other characters, you need some sort of language-specific preprocessing of strings before you forward them to the renderer. HarfBuzz is a general solution for this problem, but it's also a lot of code to integrate (we created a specialized transformation only for Arabic).

(5) For actual rendering, all text rendering engines which can use TTF fonts are usually ready for rendering UNICODE text

So basically, the problem becomes a lot easier if you only need to support a specific set of languages.

The last part about having support for. limited numbers of languages is bigger than you probably expect. Generally, unless a software has actually been adapted to and tested with a specific language, it shouldn't claim to support it, even if is just processing UTF-8 encoded text in that language.

English-native developers building apps for mainly Western languages can easily introduce encoding i18n bugs that are really unfair for other folks. The rise of emoji in everyday text has been great to force developers to deal with the upper end of the unicode spectrum and make fewer assumptions about inputs. Often in a data processing app I'll throw a few emoji in my unit tests.

Are you familiar with the big list of naughty strings?


I have seen it before but thats a good reminder in this case!

Shameless plug: a few years ago I wrote a little Python library with the aim to "sanitize" filenames in a cross-platform, cross-filesystem manner: https://github.com/ksze/filename-sanitizer

By "sanitize", I mean it should take as input any string for a filename, and clean it up so it looks the same and can be safely stored on any platform/filesystem. This would allow people to, for instance, download or exchange files over the Internet and know deterministically how the filename shall look on the receiving side, instead of relying on the receiving side to clean up the filename.

The length of Unicode strings was definitely one of the pain points. Back then I only knew about NFC vs NFD.

Now that I have read this article, I realise that my algorithm to determine where to truncate a filename is probably still wrong, and that I need to dig both much deeper and much wider.

If anybody wants to help dig some serious rabbit holes, you're most welcome to fork and make PRs.

As far as I can tell, it looks like progress on getting this into the Python stdlib has stalled: https://bugs.python.org/issue30717

In the absence of support in the stdlib, is https://github.com/alvinlindstam/grapheme the best to use?

I would highly recommend getting to know https://github.com/ovalhub/pyicu, which in addition to counting grapheme clusters with BreakIterator can also do things like normalisation, transliteration, and anything else you might need to do relating to Unicode.

There’s a lot of commentary in this post about storage size, but outside of serialisation (for file storage or network reasons) the actual byte count is typically not what you want.

What you want in your UI is the number of glyphs that will be presented to the user. Which means it is correct for it to change across OS or library versions, as new emoji will result in a different number of rendered glyphs.

There are very few places in program UI that want anything at all other than the actual glyph count - emoji aren’t even the first example of this, there are numerous “characters” the are made up of letters+accents that are not single code units/points/whatever (I can never recall), for which grapheme count is what you want.

The only time you care about the underlying codepoints is really if you yourself are rendering the characters, or you’re implementing complex text entry (input managers for many non-English characters). The latter of which you would pretty much never want to do yourself - I say this having done exactly that, and making it work correctly (especially on windows) was an utter nightmare.

Then once you get past code point/units to the level of bytes your developer set divides neatly in two:

* people who think the world is fixed # of bytes per character

* people who know to use the API function to get the number of bytes

But for any actual use of a string in a UI you want the number of glyphs that will actually be displayed.

My assumption is that that logic is why Swift gives you the count it does.

The article acknowledges that the Swift design makes sense for the UI domain when the Unicode data actually stays up-to-date: "It’s easy to believe that the Swift approach nudges programmers to write more extended grapheme cluster-correct code and that the design makes sense to a language meant primarily for UI programming on a largely evergreen platform (iOS)."

the problem i had was your framing as being bad because the “size” of a string changes as library/OS versions change, whereas I believe that that is a desirable behavior.

My point is generally that the only measurements a developer should ever care about are the visible glyph count and the byte/storage size.

What do you do with the glyph count? I've never needed it. For storage I need to know the size in bytes. For the GUI I need to know the dimensions of the rendered result.

Caret positioning, selection, etc /if/ you’re implementing text entry yourself - though doing so means accepting the huge amount of work required to support input managers.

In every other case you are 100% correct: you want either the graphical bounds or the number of bytes. The other measures seem of questionable utility to anyone outside of the Unicode library implementation :)

> But I Want the Length to Be 1! > There’s a language for that

Perl 6: https://docs.perl6.org/routine/chars

And of course, if you want the other length forms, use .codes or .encode('UTF-8').bytes. But internally to Rakudo, an emoji really just one code point, so most of the common string ops are O(1). There's a bit of an optimization if all of the code points fit into ASCII, but otherwise we use synthetic code points to represent all of the composed characters.

This is probably the biggest mystery to me of the Python 3 migration. If they were going to break backcompat, why on Earth didn't they fix Unicode handling all the way? They didn't have to go completely crazy with new syntax like Perl 6 did, but most languages shift too much of the burden of handling unicode correctly onto the programmer.

With Unicode being a moving target I'm not sure any language will truly "fix it all the way": building in things like grapheme-cluster breaking/counting to the language just means the language drifts in and out of "correctness" as the rules or just definitions of new or existing characters change. Of course, this is covered in the article, but when you "clean up" everything such that the language hides the complexity away you can still have people bitten (say, by not realizing a system/library/language update might suddenly change the "length" of a stored string somewhere). Or you could simply have issues because developers aren't totally familiar with what the language considers a "character," as there's essentially no agreement whatsoever across languages on that front (Perl 6 itself listing the grapheme-cluster-based counting as a potential "trap" and noting that the behavior differs if running on the JVM.) I don't think a "get out of jail free card" for Unicode handling is really possible.

The codepoint-based string representation used by Python 3 may be "the worst" (I'm not totally sure I agree) but it's fine. The article's main beef is about the somewhat complex nature of the internal storage and the obfuscation of the underlying lengths.

I mentioned Perl 6's in-RAM storage format.

I didn't seek to mention every programming language for everything. E.g. I didn't mention C#, since UTF-16 was already illustrated using JavaScript.

So does php with `mb_strlen`

"Python 3’s approach is unambiguously the worst one"

No rationale for this seems to be included in the article.

It's in the section with the heading "Which Unicode Encoding Form Should a Programming Language Choose?":

"Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea, because if you want to write correct software, you still need to deal with variable-width extended grapheme clusters.

The choice of UTF-32 arises from wanting the wrong thing."

But Python's strings are not UTF-32, they are sequences of Unicode code points, not code units in some encoding. I don't remember how they're stored internally; that's an implementation detail not relevant to the programmer who uses Python.

Whether the use of Unicode code points instead of some Unicode encoding is a good thing or not, that I don't know.

> But Python's strings are not UTF-32

The article says "Python 3 strings have (guaranteed-valid) UTF-32 semantics" and later argues that the fact that there's a distinction between the semantics and actual storage is a data point against UTF-32.

> they are sequences of Unicode code points, not code units in some encoding

They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics".

Note that e.g. Rust strings are conceptually sequences of scalar values, and you can iterate over them as such, but they don't provide indexing by scalar value or expose the scalar value length without iteration.

JavaScript strings, on the other hand, are conceptually sequences of code points.

> I don't remember how they're stored internally

The article say how they are stored...

> They are sequences of _scalar values_ (all scalar values are code points but surrogate code points are not scalar values). Exposing the scalar value length and exposing indexability by scalar value index is the same as "(guaranteed-valid) UTF-32 semantics".

Sorry. I'm shocked that I tested wrong when researching the article. Python 3 indeed has code point semantics and not scalar value semantics. I've added a note to the article that I've edited in corrections accordingly.

Python 3 is even more messed up than I thought!

What can you use the number of code points in a string for? I can't think of a single use case where that is actually useful.

Length of a string in bytes is useful, though most code shouldn't need to operate at that level of abstraction.

Length of a string in glyphs is useful if you are formatting something to a fixed-width display, though that is kind of a niche use-case.

Length of a string when rendered is frequently useful, though impossible to calculate from just the string's contents.

Length of a string in code points can't be used correctly for anything.

from the article:

CPython since 3.3 makes the same idea three-level with UTF-32 semantics: Strings are stored as UTF-32 if at least one character has a non-zero bit in its most-significant half. Else if a string has a non-zero bits in its second-least-significant 8 bits of at least one character, the string is stored as UCS2 (i.e. UTF-16 excluding surrogate pairs). Otherwise, the string is stored as Latin1.

I think the author meant that UTF-32 is the worst encoding, though that seems to conflate choice of "length" definition with how the value is encoded internally.

No, just that you only want to expand the encoded string into Unicode wide bytes for certain operations and you want to pay that price only on demand.

The article also explains how internally this is an optimization the Python interpreter actually does.

We went through a lot of pain to get this right in Tamgu (https://github.com/naver/tamgu). In particular, emojis can be encoded across 5 or 6 Unicode characters. A "black thumb up" is encoded with 2 Unicode characters: the thumb glyph and its color.

This comes at a cost. Every time you extract a sub-string from a string, you have to scan it first for its codepoints, then convert character positions into byte positions. One way to speed up stuff a bit, is to check if the string is in ASCII (see https://lemire.me/blog/2018/05/16/validating-utf-8-strings-u...) and apply regular operator then.

We implemented many techniques based on "intrinsics" instructions to speed up conversions and search in order to avoid scanning for codepoints.

See https://github.com/naver/tamgu/blob/master/src/conversion.cx... for more information.

It's not sufficient to use code points right? Some characters, for instance your example of the black thumbs up emoji, are grapheme clusters [1] composed of multiple code points. I think you have to iterate in grapheme clusters and convert that back to an offset in the original underlying encoding.

If you just rely on code points you risk splitting up a grapheme cluster into (in your example) two graphemes, one in each sub-string, the left representing "black" and the right representing "thumbs up." Further, you actually need to utilize one of the unicode normalization forms to perform meaningful operations like comparison or sorting.

This is one thing Rust's string API gets right, allowing you to iterate over a string as UTF-8 bytes in constant time -- and, by walking, your choice of codepoints and (currently unstable, or in the unicode-segmentation crate) grapheme clusters. Even that though is a partial solution. [2]

Definitely a tough problem!

[1] https://mathias.gaunard.com/unicode/doc/html/unicode/introdu...

[2] https://internals.rust-lang.org/t/support-for-grapheme-clust...

Exactly my point. Most modern emojis cannot rely on pure codepoints to be extracted.

Makes sense! Did you end up implementing normalization for your sub-string find, or did you work around it some other way? I couldn't seem to see it when skimming.

You can have a look on: s_is_emoji...

In https://github.com/naver/tamgu/blob/master/include/conversio..., I have implemented a class: agnostring which derives from "std::string".

There are some methods to traverse a UTF8 string:

  begin(): to initialize the traversal
  end() : is true when the string is fully traversed
  next(): which goes to the next character and returns the current character.

  while (!s.end()) {
    u = s.next();

Very cool, thanks!

Funnily enough this "one graphical unit" is rendered as two graphical units in my environment.

Yeah I've occasionally gotten SMS in the form person+gender symbol. Makes me nervous that my emoji will sometimes render like that and convey a different message than intended.

I wonder if it's a specific app or just something in Android not rendering the font correctly?

I also see it as two characters in Firefox. (I also see 'st' ligatures throughout the article, which is pretty unusual.)

I believe that's more of a font support thing than a browser thing. I'm pretty sure I've seen Firefox render e.g. "male shrug" correctly on MacOS but they're two separate symbols with the default sans-serif font on Windows. I can see it rendered correctly on Windows on sites that define the font to be something that actually has a glyph of a shrugging man.

On what platform does Firefox render the emoji as more than one glyph? Firefox works for me on Ubuntu, macOS, Windows 10, and Android.

The level of ligatures requested in CSS makes sense for the site-supplied font. (I need to regenerate the fonts due to the table at the end of the article using characters that the subsets don't have glyphs for.) If you block the site-supplied font from loading, the requested level of ligatures may be excessive for your fallback font.

Firefox on Debian 9 (XFCE) seems to render it as two.

Displayed as one for me. I'm using https://github.com/eosrei/twemoji-color-font

In Firefox on Windows it renders for me as 1.

Discord have the same issue when I type from my phone. Converts a shrug emoji to a shrug + female gender emojis

When I worked on Amazon Redshift, one bug I fixed was that the coordinator and workers had different behaviors for string length for multi-byte UTF-8 characters.

Swift is not the only language which recognizes it as a single grapheme, Erlang does as well:

   > string:length(" ️").
Indeed it considers it as a single grapheme made of 5 code points:

  > string:to_graphemes(" ️").
EDIT: Pasting from the terminal into the comment box on HN somehow replaced the emoji with a single blank.

HN bans most emoji. That's pretty annoying on a post like this.



> HN bans most emoji. That's pretty annoying on a post like this.

Not just "emoji" either, it bans random codepoints it doesn't consider "textual" enough and rather than tell you they're just removed from the comment when submitting leaving you to find out your comment is completely broken.

It's infuriating.

For completeness sake, it's also possible to get the number of code points in JavaScript:

    Array.from('<emoji here>').length
or equivalently:

    [...'<emoji here>'].length

If iam correct it is in this on wwdc[0] that one of the Swift engineers talks about how they implemented the Strings API for Swift and how it works under the hood. It is fun and interesting to watch. It starts around the 28 minute mark.


In the last five years, I've not encountered a single valid use for character count.

1. If you're using character count to do memory related things, you're introducing bugs. Not every character takes up the same amount of space (see: emoji)

2. If you're using character count to affect layout, you're introducing bugs. Not all characters are the same width. Characters can increase or decrease in size (see: ligatures, Dia critics). Any proper UI library will give you a way to measure the size of text (see JS's measureText API)

3. Even static text changes. Unless you never plan to localize your application, pin it to use one specific font (that you're bundling in your app, because not all versions of the same font are made equal), and bringing your own text renderer (because not all rendering engines support all the same features), you're introducing bugs. The one exception is perhaps the terminal, but your support for unicode in the terminal is probably poor anyway.

Even operations like taking a substring are fraught for human-readable strings. Besides worrying about truncating text in the middle of words or in punctuation (which should be a giant code smell to begin with), slicing a string is not "safe" unless you're considering all of the grammars of all of the languages you'll ever possibly deal with. It's unlikely, even if your string library perfectly supported unicode, that you'd correctly take a substring of a human tradable string. It's better to design your application with this in mind.

I have some unicode string truncation code at work. It just mindlessly chops off any codepoints that won't fit in N bytes. No worrying about grammar, combining characters, multi-codepoint-emoji, etc.

This is because the output doesn't have to be perfect, but it does absolutely positively have to have bounded length or various databases start getting real grumpy.

If you're chopping a diacritic off, you're changing meaning. If you're chopping an emoji off with a dangling ZWJ, you've potentially got an invalid character. Depending on the language and text, you might be completely changing the meaning of what you're storing.

Your database might be grumpy otherwise, but that doesn't make arbitrary truncation correct. This is an issue with your schema, it doesn't mean truncation is the best solution.

This is what I get on Python 3.6:

  Python 3.6.8 (default, Dec 30 2018, 13:01:27)
  Type 'copyright', 'credits' or 'license' for more information
  IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

  Out[1]: 1

That would be because it's a single codepoint (U+1F926 FACE PALM).

Try out the family or flag emoji (which are composite) and you should get a different result.

I just wanted to say, this is a very very good article worth reading. Clearly, the author has put serious work into researching for and writing this article. A lot of original and deeply technical information is presented in an entertaining and easily understandable manner.

gawd can's we all just stick to ascii. /s(/s(/s(..)))

what someone didn't like my fixed point sarcasm tag?

I get it. I hate self-referential humor too.

tl;dr: Unicode codepoints don't have a 1-to-1 relationship with characters that are actually displayed. This has always been the case (due to zero-width characters, accent modifiers that go after another character, Hangul etc.) but has recently got complicated by the use of ZWJ (zero-width joiner) to make emojis out of combinations of other emojis, modifiers for skin colour, and variation selectors. There is also stuff like flag being made out of two characters, e.g. flag_D + flag_E = German flag.

Your language's length function is probably just returning the number of unicode codepoints in the string. You need to a function that computes the number of 'extended grapheme clusters' if you want to get actually displayed characters. And if that function is out of date, it might not handle ZWJ and variation selectors properly, and still give you a value of 2 instead of 1. Make sure your libraries are up to date.

Also, if you are writing a command line tool, you need to use a library to work out how many 'columns' a string will occupy for stuff like word wrapping, truncation etc. Chinese and Japanese characters take up two columns, many characters take up 0 columns, and all the above (emoji crap) can also affect the column count.

In short the Unicode standard has gotten pretty confusing and messy!

> Your language's length function is probably just returning the number of unicode codepoints in the string.

The article didn't say that!

"Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.)

The article presented four kinds of programming language-reported string lengths:

1. Length is number of UTF-8 code units. 2. Length is number of UTF-16 code units. 3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values. 4. Length is number of extended grapheme clusters.

The article perpetuates the fiction that Chinese characters provide more than phonetic information. Mandarin uses a syllabary, so one character represents what would be two or three phonemes, which about matches the numbers in the table at the end.

Sorry, what? Chinese is emphatically NOT a syllabary. Characters have meaning, two characters that are pronounced the same can have different meanings. The spoken language is syllabary-ish but wide regional dialectic variations shift different characters in different ways, which could not be the case in a true syllabary.

To give you just one little example of how you are not correct:

长久 - cháng jiǔ - long time/long lasting

尝酒 - cháng jiǔ - to taste wine

Japanese uses Kanji (Chinese characters) not for their phonetic value (they’ve got two whole syllabaries for that), but for their meaning.

There are lots of characters that represent the same syllable, and complicated rules about which to use for writing common words, which helps to disambiguate the very numerous hononyms.

The "dialectical variations" are really different languages. Mandarin speakers are taught that only pronunciations vary, but a person transcribing Cantonese to Mandarin is doing translation to a degree comparable to Italian -> French. Politically this fact is not allowed in China, but ask anyone who is bilingual in Cantonese and Mandarin. Prepare to be surprised.

Mandarin and Cantonese are both widespread enough to have dialects of their own. For example, Dalian Mandarin has different tones and some syllables that don't exist in Standard Mandarin: https://en.wikipedia.org/wiki/Dalian_dialect

> There are lots of characters that represent the same syllable,

And also (less common, but existant) characters that represent different syllables in different contexts. A prominent example would be 觉, pronounced jué in 觉得 but jiào in 睡觉

Hanzi are logographs, not a syllabary.

> and complicated rules about which to use for writing common words

Maybe if you have some need to believe in the bizarre fiction that Chinese characters only provide phonetic information. In the real world, the characters have meaning and history, which simply dictates what characters to use for which word.

> Hanzi are logographs, not a syllabary.

This is like saying English is written with logographs, not an alphabet. Kanji are logographs and not a syllabary. Hanzi are closer to being a syllabary than they are to being logographs. The sound is the primary concept.

Obviously, the characters do convey considerably more than just phonetic information. But phonetic information is the first and most important thing they carry.

It is quite remarkable how people continue to believe what their elementary teacher told them, even after years and years' experience to the contrary.

In English, people believe that "Elements of Style" is full of good advice despite everything good they have ever read violating every rule on every page.

The article makes a claim about Han script information density relative to measures of length. It makes no claim about how the Han script achieves it.

If you consider this from the perspective of Huffman coding, it shouldn't be surprising that the script that gets the most code space in Unicode would carry the most information per symbol on average.

The point is that if they really were logographs, and not just a complicated syllabary, you would get a lot more than the equivalent of two or three letters out of them.

Your math is wrong.

The Chinese (Traditional) translation consists of 2202 characters (UTF-32). The English translation consists of 8555 characters. This amounts to ~3.88 English letters per Chinese character, quite a bit more than "two or three".

None of the actual syllabaries get even close to the density of Chinese. Korean, the closest comparison point (a true syllabary) clocks in at 3856 characters - a far cry from 2202.

You are arguing with something that the article did not say. The substring "logo" does not occur in the article. The article makes no claim about what the Han script encodes beyond implying that the Han script isn't alphabetic, which shouldn't be controversial.

> Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings have (guaranteed-valid) UTF-32 semantics, so the string occupies 5 code units. In UTF-32, each Unicode scalar value occupies one code unit. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. It is intentional that the phrasing for the Rust case differs from the phrasing for the Python and JavaScript cases. We’ll come to back to that later.

...And the OP is wrong.

1) python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints. Do you want byte-lenght? decode to a byte array and count them. 2) javascript doesn't know about bytes, nor characters, knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.

oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings.. (if you want an encoding that can map each byte sequence to a character, there are, like Latin1 and such on, but it's a different matter)

> python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints.

UTF32 and codepoints is an identity transformation.

> knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.

A UTF-16 code unit is 16 bits. The difference between "UTF16 encoding" and "UTF16 code units" is the latter makes no guarantee that the sequence of code units is actually validly encoded. Which is very much an issue in both Java and Javascript (and most languages which started from UCS2 and back-defined that as UTF-16): both languages expose and allow manipulation of raw code units and allow unpaired surrogates, and thus don't actually use UTF16 strings, however these strings are generally assumed to be and interpreted as UTF16.

Which I expect is what TFA means by "UTF-16 semantics".

> oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings..

Your comment makes no sense. There are byte sequences which are not valid UTF-8. They are also not valid as part of a Rust string. Creating a non-UTF8 rust string is UB.

> Which I expect is what TFA means by "UTF-16 semantics".

The article says "(potentially-invalid) UTF-16 semantics". The "potentially-invalid" part means that the JavaScript programmer can materialize unpaired surrogates. The "semantics" part means that the JavaScript programmer sees the strings acting as if they were potentially-invalid UTF-16 even when the storage format in RAM is actually Latin1 in SpiderMonkey and V8.

> Creating a non-UTF8 rust string is UB.

So how does rust deal with filenames under Linux? Use somethinge other than strings?

Yep. Rust has a PathBuf[1] type for dealing with paths in a platform-native manner. You can convert it to a Rust string type, but it's a conversion that can fail[2] or may be lossy[3].

[1] https://doc.rust-lang.org/std/path/struct.PathBuf.html

[2] https://doc.rust-lang.org/std/path/struct.PathBuf.html#metho...

[3] https://doc.rust-lang.org/std/path/struct.PathBuf.html#metho...

Yes, this is what OsStr/OsString[1] are for.

[1] https://doc.rust-lang.org/std/ffi/struct.OsString.html

Yes: https://doc.rust-lang.org/std/ffi/struct.OsString.html

This is also used to deal with filenames on Windows, as they're a different flavour of "not unicode", IIRC that's the original use case for WTF8.

It also has a separate pseudo-string type to deal with C strings: https://doc.rust-lang.org/std/ffi/struct.CString.html

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact