
String Lengths in Unicode - panic
https://hsivonen.fi/string-length/
======
tialaramex
The closest to insight in this post is a quote from somebody else. “String
length is about as meaningful a measurement as string height”

Under that quote three rationale are offered.

1\. "to allocate memory". Fine, like an integer your very low level code needs
to actually store strings in RAM and in this sense both integers and strings
have a size. I don't think I'd name this size "length" though and certainly if
length(5) is an error in your language it seems as though length("word") is
also an error by this definition.

2\. "quota limit". Just pick anything, there is definitely no reason to name
the value you're using for this arbitrary quota "length". If you have a
_reason_ for the quota, then you need to measure the reason, e.g. bytes on
disk.

3\. "display space". This one is a length! But it's in pixels (or milimeters,
or some other type of distance unit) and it requires a LOT of additional
parameters to calculate. And you'll notice that at last there is actually a
string height too for this case, which likewise requires many additional
parameters to calculate.

Treat code that cares about string "length" as code smell. If it's in your
low-level fundamentals (whether in the memory allocator or the text renderer)
it might, like a lot of other things which smell down there, have a good
reason to be there, make sure it's seriously unit tested and clearly
documented as to why it's necessary. If it's in your "brilliant" new social
media web site backend, it's almost certainly wrong and figuring out what you
ought to be doing instead of asking the "length" of a string will probably fix
a bug you may not know you have.

~~~
wongarsu
Interestingly, in many use cases all these "lengths" agree. For example an
empty string is 0 in all these measures, and fuzzy metrics like "unreasonably
long password" work equally well with "500 normalized characters" "500 bytes"
"1000px at font size 8" etc.

The problems only arise when taking about "a string of length 20" which indeed
has a strong smell

~~~
coldtea
> _The problems only arise when taking about "a string of length 20" which
> indeed has a strong smell_

It has no smell at all. We interface with all kinds of systems with limited
display space, external forms, POS displays, databases with varchar columns,
and so on all the time.

Whether certain strings should have a specific length is a business decision,
not a code smell...

~~~
wongarsu
limited display space is a case for "string length in px", which is
notoriously hard to calculate and has poor library support. Just because 20
"x" fit doesn't mean 20 "w" will fit. Fixed space fonts are an exception, but
they have problems with Chinese.

Databases with varchar columns exist, but varchar(20) sounds generally suspect
unless it's a hash or something else that's fundamentally limited in length.

~~~
coldtea
> _limited display space is a case for "string length in px", which is
> notoriously hard to calculate and has poor library support._

It is notoriously easy when the display is a LED display, a banking terminal,
a form-based monospaced POS, something that goes to a printed out receipt
(like a airline check-in or a luggage tag), a product / cargo label maker, and
tons of other systems billions depend upon everyday, where one visible glyph =
1 length unit, and type design doesn't come into much play...

~~~
yorwba
It's only easy if the system forbids everything that would make calculating
visible length hard, which I think constitutes _extremely_ poor library
support. I want to see the monospaced system that can correctly print
Mongolian: ᠮᠣᠩᠭᠣᠯ ᠪᠢᠴᠢᠭ If properly implemented, it should join the characters
and display them vertically. But your browser is probably showing them
horizontally right now, because support for vertical writing is seriously
broken:
[https://en.wikipedia.org/wiki/Mongolian_script#Font_issues](https://en.wikipedia.org/wiki/Mongolian_script#Font_issues)

------
theobeers
I think it’s good that Rust sort of forces you to recognize the complexity in
this. If you try s.len(), it’ll be in terms of UTF-8 bytes, which might be
what you want… or might be far from it. If you switch to s.chars().count(),
you’ll get Unicode scalar values, which may be closer to the mark. And if you
need proper segmentation by “user-perceived character,” as set out in Unicode
Annex 29, you’ll have to bring in an external crate. That’s fair enough.

Keep in mind that even Annex 29 is, on some level, just a starting point. Its
segmentation rules don’t work “out of the box” for me with Persian and Arabic
text. I’m not totally on-board with its treatment of the zero-width non-joiner
(U+200C), and it doesn’t deal with manual tatweel/kashida (U+0640) at all. So
you make the necessary adjustments for your use case. The rabbit hole is deep.

~~~
masklinn
> I think it’s good that Rust sort of forces you to recognize the complexity
> in this.

It's not bad, but Swift still does better:

* String.Index is an opaque type (ideally it would be linked to a specific string too)

* it returns grapheme cluster counts by default, that's probably the least surprising / useless though not necessarily super useful either

* other lengths go through explicit lazy views e.g. string.utf8.count will provide the UTF8 code units count, string.unicodeScalars.count will provide the USVs count

> If you switch to s.chars().count(), you’ll get UTF-8 scalar values

You get _unicode_ scalar values (which I assume is what you meant).

~~~
cryptonector
TFA makes a very good case that Swift got it wrong.

Perhaps string length should be more hidden though, or parametrized with a
unit to count. Perhaps it should be

    
    
      let nu = s._smelly_length(UTF8CodeUnits);
      let nc = s._smelly_length(ExtendedGraphemeClusters);
    

Stop using "zero length" to denote empty string, just have an empty string
method.

~~~
favorited
Not sure how you read "Swift’s approach to string length isn’t unambiguously
the best one" and interpret it as "Swift got it wrong."

The author's substantive criticism of Swift's String/Character/etc. types
seems to be complications arising from the dependency on ICU (such as not
necessarily being able to persist indices across icu4c versions).

------
zubspace
I seriously applaud the writer to dive into Unicode in such detail and compare
multiple implementation in different languages. That must have taken a while!

He is working for Mozilla and I guess he needs to actually know all those
nitty-gritty details. I could not imagine myself to even bring up the patience
for analyzing it.

In some way I am really scared about Unicode. I don't care if the programming
language simply allows input and output of Unicode in text-fields and
configuration files. But where it gets tough is, if you actually need to know
the rendered size of a string or convert encodings, if the output format
requires it. There are so many places where stuff can go awry and there's only
a small passage in the blog post which mentions fonts. There are multiple
dragons abound, like font encoding, kernings and what not.

Imagine writing a game. Everything works nice and dandy with your ASCII format
and now your boss approaches you and wants to distribute the game for the
asian market! Oh, dear... My nightmares are made out of this!

How do you handle this? Do you use a programming language which does
everything you need? (Which one?) Do you use some special libraries? What
about font rendering? Any recommendations?

~~~
babuskov
> Imagine writing a game. Everything works nice and dandy with your ASCII
> format and now your boss approaches you and wants to distribute the game for
> the asian market!

You just make sure all translated text is in UTF-8, and use Google Noto fonts
for those languages. All game engines I know render UTF-8 text without
problems if you supply a font that has the needed glyphs.

Source: I'm and indie game developer and have recently localized my game to
Chinese. The game is a mix of RPG and roguelike, so it has a lot of text (over
10000 words). I used SDL_TTF to render text. Precisely:
TTF_RenderUTF8_Blended() function. The only issue I had is with
multiline/wrapped text. SDL_TTF doesn't break lines on Chinese punctuation
characters (.,;:!?) so I would search+replace strings at runtime to add a
regular space characters after those.

~~~
thaumasiotes
> SDL_TTF doesn't break lines on Chinese punctuation characters (.,;:!?)

Those aren't Chinese punctuation characters. Chinese punctuation characters
are full-width, including the spacing that should follow them (or in the case
of "(", precede) within the glyph itself: （。，；：！？）. (You may also notice that
the period is radically different.) Chinese text should almost never include
space characters.

Chinese applications seem happy to break lines anywhere including in the
middle of a word, but punctuation seems like an especially _good_ place for a
line break, so I'm confused why SDL_TTF would go out of its way to avoid
breaking there.

~~~
babuskov
> Those aren't Chinese punctuation characters.

I know, I meant the actual ones you wrote above.

> I'm confused why SDL_TTF would go out of its way to avoid breaking there.

SDL_TTF doesn't break at all. If you have a long Chinese text which uses
proper punctuation characters, it would never break, because it only breaks on
ASCII whitespace.

I wanted to avoid breaking lines in the middle of a word, so I added extra
"regular" space characters to force breaking the line.

~~~
buntsai
You don't really need to only break on punctuation. There is no convention to
do so and so long as you so not break any logograms in half, the resulting
text reads perfectly fine. In fact, the convention is to have left and right
justified text with equal numbers of monospaced logograms, including
punctuation, on each line (on the equivalent for vertical text). Classical
Chinese before the 20 th century was seldom punctuated.

~~~
babuskov
I wasn't aware of this. Thanks.

------
cmancini
English-native developers building apps for mainly Western languages can
easily introduce encoding i18n bugs that are really unfair for other folks.
The rise of emoji in everyday text has been great to force developers to deal
with the upper end of the unicode spectrum and make fewer assumptions about
inputs. Often in a data processing app I'll throw a few emoji in my unit
tests.

~~~
andreareina
Are you familiar with the big list of naughty strings?

[https://github.com/minimaxir/big-list-of-naughty-
strings/](https://github.com/minimaxir/big-list-of-naughty-strings/)

~~~
cmancini
I have seen it before but thats a good reminder in this case!

------
k_sze
Shameless plug: a few years ago I wrote a little Python library with the aim
to "sanitize" filenames in a cross-platform, cross-filesystem manner:
[https://github.com/ksze/filename-sanitizer](https://github.com/ksze/filename-
sanitizer)

By "sanitize", I mean it _should_ take as input any string for a filename, and
clean it up so it looks the same and can be safely stored on any
platform/filesystem. This would allow people to, for instance, download or
exchange files over the Internet and know deterministically how the filename
shall look on the receiving side, instead of relying on the receiving side to
clean up the filename.

The length of Unicode strings was definitely one of the pain points. Back then
I only knew about NFC vs NFD.

Now that I have read this article, I realise that my algorithm to determine
where to truncate a filename is probably still wrong, and that I need to dig
both much deeper and much wider.

If anybody wants to help dig some serious rabbit holes, you're most welcome to
fork and make PRs.

------
CaliforniaKarl
As far as I can tell, it looks like progress on getting this into the Python
stdlib has stalled:
[https://bugs.python.org/issue30717](https://bugs.python.org/issue30717)

In the absence of support in the stdlib, is
[https://github.com/alvinlindstam/grapheme](https://github.com/alvinlindstam/grapheme)
the best to use?

~~~
sciyoshi
I would highly recommend getting to know
[https://github.com/ovalhub/pyicu](https://github.com/ovalhub/pyicu), which in
addition to counting grapheme clusters with BreakIterator can also do things
like normalisation, transliteration, and anything else you might need to do
relating to Unicode.

------
olliej
There’s a lot of commentary in this post about storage size, but outside of
serialisation (for file storage or network reasons) the actual byte count is
typically not what you want.

What you want in your UI is the number of glyphs that will be presented to the
user. Which means it is correct for it to change across OS or library
versions, as new emoji will result in a different number of rendered glyphs.

There are very few places in program UI that want anything at all other than
the actual glyph count - emoji aren’t even the first example of this, there
are numerous “characters” the are made up of letters+accents that are not
single code units/points/whatever (I can never recall), for which grapheme
count is what you want.

The only time you care about the underlying codepoints is really if you
yourself are rendering the characters, or you’re implementing complex text
entry (input managers for many non-English characters). The latter of which
you would pretty much never want to do yourself - I say this having done
exactly that, and making it work correctly (especially on windows) was an
utter nightmare.

Then once you get past code point/units to the level of bytes your developer
set divides neatly in two:

* people who think the world is fixed # of bytes per character

* people who know to use the API function to get the number of bytes

But for any actual use of a string in a UI you want the number of glyphs that
will actually be displayed.

My assumption is that that logic is why Swift gives you the count it does.

~~~
hsivonen
The article acknowledges that the Swift design makes sense for the UI domain
when the Unicode data actually stays up-to-date: "It’s easy to believe that
the Swift approach nudges programmers to write more extended grapheme cluster-
correct code and that the design makes sense to a language meant primarily for
UI programming on a largely evergreen platform (iOS)."

~~~
olliej
the problem i had was your framing as being bad because the “size” of a string
changes as library/OS versions change, whereas I believe that that is a
desirable behavior.

My point is generally that the only measurements a developer should ever care
about are the visible glyph count and the byte/storage size.

------
peteretep
> But I Want the Length to Be 1! > There’s a language for that

Perl 6:
[https://docs.perl6.org/routine/chars](https://docs.perl6.org/routine/chars)

~~~
labster
And of course, if you want the other length forms, use .codes or
.encode('UTF-8').bytes. But internally to Rakudo, an emoji really just one
code point, so most of the common string ops are O(1). There's a bit of an
optimization if all of the code points fit into ASCII, but otherwise we use
synthetic code points to represent all of the composed characters.

This is probably the biggest mystery to me of the Python 3 migration. If they
were going to break backcompat, why on Earth didn't they fix Unicode handling
all the way? They didn't have to go completely crazy with new syntax like Perl
6 did, but most languages shift too much of the burden of handling unicode
correctly onto the programmer.

~~~
zerocrates
With Unicode being a moving target I'm not sure any language will truly "fix
it all the way": building in things like grapheme-cluster breaking/counting to
the language just means the language drifts in and out of "correctness" as the
rules or just definitions of new or existing characters change. Of course,
this is covered in the article, but when you "clean up" everything such that
the language hides the complexity away you can still have people bitten (say,
by not realizing a system/library/language update might suddenly change the
"length" of a stored string somewhere). Or you could simply have issues
because developers aren't totally familiar with what the language considers a
"character," as there's essentially no agreement whatsoever across languages
on that front (Perl 6 itself listing the grapheme-cluster-based counting as a
potential "trap" _and_ noting that the behavior differs if running on the
JVM.) I don't think a "get out of jail free card" for Unicode handling is
really possible.

The codepoint-based string representation used by Python 3 may be "the worst"
(I'm not totally sure I agree) but it's fine. The article's main beef is about
the somewhat complex nature of the internal storage and the obfuscation of the
underlying lengths.

------
pjtr
"Python 3’s approach is unambiguously the worst one"

No rationale for this seems to be included in the article.

~~~
hsivonen
It's in the section with the heading "Which Unicode Encoding Form Should a
Programming Language Choose?":

"Reacting to surrogate pairs by wishing to use UTF-32 instead is a bad idea,
because if you want to write correct software, you still need to deal with
variable-width extended grapheme clusters.

The choice of UTF-32 arises from wanting the wrong thing."

~~~
roelschroeven
But Python's strings are not UTF-32, they are sequences of Unicode _code
points_ , not _code units_ in some encoding. I don't remember how they're
stored internally; that's an implementation detail not relevant to the
programmer who uses Python.

Whether the use of Unicode code points instead of some Unicode encoding is a
good thing or not, that I don't know.

~~~
hsivonen
> But Python's strings are not UTF-32

The article says "Python 3 strings have (guaranteed-valid) UTF-32 semantics"
and later argues that the fact that there's a distinction between the
semantics and actual storage is a data point against UTF-32.

> they are sequences of Unicode code points, not code units in some encoding

They are sequences of _scalar values_ (all scalar values are code points but
surrogate code points are not scalar values). Exposing the scalar value length
and exposing indexability by scalar value index is the same as "(guaranteed-
valid) UTF-32 semantics".

Note that e.g. Rust strings are conceptually sequences of scalar values, and
you can iterate over them as such, but they don't provide indexing by scalar
value or expose the scalar value length without iteration.

JavaScript strings, on the other hand, are conceptually sequences of code
points.

> I don't remember how they're stored internally

The article say how they are stored...

~~~
hsivonen
> They are sequences of _scalar values_ (all scalar values are code points but
> surrogate code points are not scalar values). Exposing the scalar value
> length and exposing indexability by scalar value index is the same as
> "(guaranteed-valid) UTF-32 semantics".

Sorry. I'm shocked that I tested wrong when researching the article. Python 3
indeed has code point semantics and not scalar value semantics. I've added a
note to the article that I've edited in corrections accordingly.

Python 3 is even more messed up than I thought!

------
clauderoux
We went through a lot of pain to get this right in Tamgu
([https://github.com/naver/tamgu](https://github.com/naver/tamgu)). In
particular, emojis can be encoded across 5 or 6 Unicode characters. A "black
thumb up" is encoded with 2 Unicode characters: the thumb glyph and its color.

This comes at a cost. Every time you extract a sub-string from a string, you
have to scan it first for its codepoints, then convert character positions
into byte positions. One way to speed up stuff a bit, is to check if the
string is in ASCII (see [https://lemire.me/blog/2018/05/16/validating-
utf-8-strings-u...](https://lemire.me/blog/2018/05/16/validating-
utf-8-strings-using-as-little-as-0-7-cycles-per-byte/)) and apply regular
operator then.

We implemented many techniques based on "intrinsics" instructions to speed up
conversions and search in order to avoid scanning for codepoints.

See
[https://github.com/naver/tamgu/blob/master/src/conversion.cx...](https://github.com/naver/tamgu/blob/master/src/conversion.cxx)
for more information.

~~~
arcticbull
It's not sufficient to use code points right? Some characters, for instance
your example of the black thumbs up emoji, are grapheme clusters [1] composed
of multiple code points. I think you have to iterate in grapheme clusters and
convert that back to an offset in the original underlying encoding.

If you just rely on code points you risk splitting up a grapheme cluster into
(in your example) two graphemes, one in each sub-string, the left representing
"black" and the right representing "thumbs up." Further, you actually need to
utilize one of the unicode normalization forms to perform meaningful
operations like comparison or sorting.

This is one thing Rust's string API gets right, allowing you to iterate over a
string as UTF-8 bytes in constant time -- and, by walking, your choice of
codepoints and (currently unstable, or in the unicode-segmentation crate)
grapheme clusters. Even that though is a partial solution. [2]

Definitely a tough problem!

[1]
[https://mathias.gaunard.com/unicode/doc/html/unicode/introdu...](https://mathias.gaunard.com/unicode/doc/html/unicode/introduction_to_unicode.html)

[2] [https://internals.rust-lang.org/t/support-for-grapheme-
clust...](https://internals.rust-lang.org/t/support-for-grapheme-clusters-in-
std/7339/4)

~~~
clauderoux
Exactly my point. Most modern emojis cannot rely on pure codepoints to be
extracted.

~~~
arcticbull
Makes sense! Did you end up implementing normalization for your sub-string
find, or did you work around it some other way? I couldn't seem to see it when
skimming.

~~~
clauderoux
You can have a look on: s_is_emoji...

~~~
clauderoux
In
[https://github.com/naver/tamgu/blob/master/include/conversio...](https://github.com/naver/tamgu/blob/master/include/conversion.h),
I have implemented a class: _agnostring_ which derives from "std::string".

There are some methods to traverse a UTF8 string:

    
    
      begin(): to initialize the traversal
      end() : is true when the string is fully traversed
      next(): which goes to the next character and returns the current character.
    
    
      s.begin();
      while (!s.end()) {
        u = s.next();
      }

~~~
arcticbull
Very cool, thanks!

------
caf
Funnily enough this "one graphical unit" is rendered as two graphical units in
my environment.

~~~
akersten
Yeah I've occasionally gotten SMS in the form person+gender symbol. Makes me
nervous that my emoji will sometimes render like that and convey a different
message than intended.

I wonder if it's a specific app or just something in Android not rendering the
font correctly?

~~~
salutonmundo
I also see it as two characters in Firefox. (I also see 'st' ligatures
throughout the article, which is pretty unusual.)

~~~
hsivonen
On what platform does Firefox render the emoji as more than one glyph? Firefox
works for me on Ubuntu, macOS, Windows 10, and Android.

The level of ligatures requested in CSS makes sense for the site-supplied
font. (I need to regenerate the fonts due to the table at the end of the
article using characters that the subsets don't have glyphs for.) If you block
the site-supplied font from loading, the requested level of ligatures may be
excessive for your fallback font.

~~~
caf
Firefox on Debian 9 (XFCE) seems to render it as two.

------
dehrmann
When I worked on Amazon Redshift, one bug I fixed was that the coordinator and
workers had different behaviors for string length for multi-byte UTF-8
characters.

------
rdtsc
Swift is not the only language which recognizes it as a single grapheme,
Erlang does as well:

    
    
       > string:length(" ️").
       1
    

Indeed it considers it as a single grapheme made of 5 code points:

    
    
      > string:to_graphemes(" ️").
      [[129318,127996,8205,9794,65039]]
    

EDIT: Pasting from the terminal into the comment box on HN somehow replaced
the emoji with a single blank.

~~~
Thorrez
HN bans most emoji. That's pretty annoying on a post like this.

[https://news.ycombinator.com/item?id=17508962](https://news.ycombinator.com/item?id=17508962)

[https://news.ycombinator.com/item?id=19482991](https://news.ycombinator.com/item?id=19482991)

~~~
masklinn
> HN bans most emoji. That's pretty annoying on a post like this.

Not just "emoji" either, it bans random codepoints it doesn't consider
"textual" enough and rather than tell you they're just removed from the
comment when submitting leaving you to find out your comment is completely
broken.

It's infuriating.

------
exogen
For completeness sake, it's also possible to get the number of code points in
JavaScript:

    
    
        Array.from('<emoji here>').length
    

or equivalently:

    
    
        [...'<emoji here>'].length

------
Sjenk
If iam correct it is in this on wwdc[0] that one of the Swift engineers talks
about how they implemented the Strings API for Swift and how it works under
the hood. It is fun and interesting to watch. It starts around the 28 minute
mark.

[0][https://developer.apple.com/videos/play/wwdc2017/402/](https://developer.apple.com/videos/play/wwdc2017/402/)

------
bastawhiz
In the last five years, I've not encountered a single valid use for character
count.

1\. If you're using character count to do memory related things, you're
introducing bugs. Not every character takes up the same amount of space (see:
emoji)

2\. If you're using character count to affect layout, you're introducing bugs.
Not all characters are the same width. Characters can increase or decrease in
size (see: ligatures, Dia critics). Any proper UI library will give you a way
to measure the size of text (see JS's measureText API)

3\. Even static text changes. Unless you never plan to localize your
application, pin it to use one specific font (that you're bundling in your
app, because not all versions of the same font are made equal), and bringing
your own text renderer (because not all rendering engines support all the same
features), you're introducing bugs. The one exception is perhaps the terminal,
but your support for unicode in the terminal is probably poor anyway.

Even operations like taking a substring are fraught for human-readable
strings. Besides worrying about truncating text in the middle of words or in
punctuation (which should be a giant code smell to begin with), slicing a
string is not "safe" unless you're considering all of the grammars of all of
the languages you'll ever possibly deal with. It's unlikely, even if your
string library perfectly supported unicode, that you'd correctly take a
substring of a human tradable string. It's better to design your application
with this in mind.

~~~
missblit
I have some unicode string truncation code at work. It just mindlessly chops
off any codepoints that won't fit in N bytes. No worrying about grammar,
combining characters, multi-codepoint-emoji, etc.

This is because the output doesn't have to be perfect, but it _does_
absolutely positively have to have bounded length or various databases start
getting real grumpy.

~~~
bastawhiz
If you're chopping a diacritic off, you're changing meaning. If you're
chopping an emoji off with a dangling ZWJ, you've potentially got an invalid
character. Depending on the language and text, you might be completely
changing the meaning of what you're storing.

Your database might be grumpy otherwise, but that doesn't make arbitrary
truncation correct. This is an issue with your schema, it doesn't mean
truncation is the best solution.

------
m-arnold
This is what I get on Python 3.6:

    
    
      Python 3.6.8 (default, Dec 30 2018, 13:01:27)
      Type 'copyright', 'credits' or 'license' for more information
      IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.
    
      In [1]: len("FACEPALM_EMOJI_DELETED_BY_HN")
      Out[1]: 1

~~~
masklinn
That would be because it's a single codepoint (U+1F926 FACE PALM).

Try out the family or flag emoji (which are composite) and you should get a
different result.

------
Kenji
I just wanted to say, this is a very very good article worth reading. Clearly,
the author has put serious work into researching for and writing this article.
A lot of original and deeply technical information is presented in an
entertaining and easily understandable manner.

------
jonnycomputer
gawd can's we all just stick to ascii. /s(/s(/s(..)))

~~~
jonnycomputer
what someone didn't like my fixed point sarcasm tag?

~~~
jonnycomputer
I get it. I hate self-referential humor too.

------
HeraldEmbar
tl;dr: Unicode codepoints don't have a 1-to-1 relationship with characters
that are actually displayed. This has always been the case (due to zero-width
characters, accent modifiers that go after another character, Hangul etc.) but
has recently got complicated by the use of ZWJ (zero-width joiner) to make
emojis out of combinations of other emojis, modifiers for skin colour, and
variation selectors. There is also stuff like flag being made out of two
characters, e.g. flag_D + flag_E = German flag.

Your language's length function is probably just returning the number of
unicode codepoints in the string. You need to a function that computes the
number of 'extended grapheme clusters' if you want to get actually displayed
characters. And if that function is out of date, it might not handle ZWJ and
variation selectors properly, and still give you a value of 2 instead of 1.
Make sure your libraries are up to date.

Also, if you are writing a command line tool, you need to use a library to
work out how many 'columns' a string will occupy for stuff like word wrapping,
truncation etc. Chinese and Japanese characters take up two columns, many
characters take up 0 columns, and all the above (emoji crap) can also affect
the column count.

In short the Unicode standard has gotten pretty confusing and messy!

~~~
hsivonen
> Your language's length function is probably just returning the number of
> unicode codepoints in the string.

The article didn't say that!

"Number of Unicode code points" in a string is ambiguous, because surrogates
and astral characters both are code points, so it's ambiguous if a surrogate
pair counts as two code points or one. (It unambiguously counts as two UTF-16
code units and as one Unicode scalar value.)

The article presented four kinds of programming language-reported string
lengths:

1\. Length is number of UTF-8 code units. 2\. Length is number of UTF-16 code
units. 3\. Length is number of UTF-32 code units, which is the same as the
number of Unicode scalar values. 4\. Length is number of extended grapheme
clusters.

------
ncmncm
The article perpetuates the fiction that Chinese characters provide more than
phonetic information. Mandarin uses a syllabary, so one character represents
what would be two or three phonemes, which about matches the numbers in the
table at the end.

~~~
nneonneo
Sorry, what? Chinese is emphatically NOT a syllabary. Characters have meaning,
two characters that are pronounced the same can have different meanings. The
spoken language is syllabary-ish but wide regional dialectic variations shift
different characters in different ways, which could not be the case in a true
syllabary.

To give you just one little example of how you are not correct:

长久 - cháng jiǔ - long time/long lasting

尝酒 - cháng jiǔ - to taste wine

Japanese uses Kanji (Chinese characters) not for their phonetic value (they’ve
got two whole syllabaries for that), but for their meaning.

~~~
ncmncm
There are lots of characters that represent the same syllable, and complicated
rules about which to use for writing common words, which helps to disambiguate
the very numerous hononyms.

The "dialectical variations" are really different languages. Mandarin speakers
are taught that only pronunciations vary, but a person transcribing Cantonese
to Mandarin is doing translation to a degree comparable to Italian -> French.
Politically this fact is not allowed in China, but ask anyone who is bilingual
in Cantonese and Mandarin. Prepare to be surprised.

~~~
brazzy
> There are lots of characters that represent the same syllable,

And also (less common, but existant) characters that represent different
syllables in different contexts. A prominent example would be 觉, pronounced
jué in 觉得 but jiào in 睡觉

Hanzi are logographs, not a syllabary.

> and complicated rules about which to use for writing common words

Maybe if you have some need to believe in the bizarre fiction that Chinese
characters only provide phonetic information. In the real world, the
characters have meaning and history, which simply dictates what characters to
use for which word.

~~~
thaumasiotes
> Hanzi are logographs, not a syllabary.

This is like saying English is written with logographs, not an alphabet. Kanji
are logographs and not a syllabary. Hanzi are closer to being a syllabary than
they are to being logographs. The sound is the primary concept.

Obviously, the characters do convey considerably more than just phonetic
information. But phonetic information is the first and most important thing
they carry.

~~~
ncmncm
It is quite remarkable how people continue to believe what their elementary
teacher told them, even after years and years' experience to the contrary.

In English, people believe that "Elements of Style" is full of good advice
despite everything good they have ever read violating every rule on every
page.

------
_ZeD_
> Each of the languages above reports the string length as the number of code
> units that the string occupies. Python 3 strings have (guaranteed-valid)
> UTF-32 semantics, so the string occupies 5 code units. In UTF-32, each
> Unicode scalar value occupies one code unit. JavaScript (and Java) strings
> have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code
> units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17
> code units. It is intentional that the phrasing for the Rust case differs
> from the phrasing for the Python and JavaScript cases. We’ll come to back to
> that later.

...And the OP is wrong.

1) python doesn't count byte sizes, or UTFxxx stuff. python counts the number
of codepoints. Do you want byte-lenght? decode to a byte array and count them.
2) javascript doesn't know about bytes, nor characters, knows only about the
fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such
thing a "code unit with UTF-16 semantics". Similar for java.

oh, and by the way, there are byte sequences that are invalid if decoded from
utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings.. (if
you want an encoding that can map each byte sequence to a character, there
are, like Latin1 and such on, but it's a different matter)

~~~
masklinn
> python doesn't count byte sizes, or UTFxxx stuff. python counts the number
> of codepoints.

UTF32 and codepoints is an identity transformation.

> knows only about the fact that "a char is a 16 bit chunk", with a UTF16
> encoding. there is no such thing a "code unit with UTF-16 semantics".
> Similar for java.

A UTF-16 code unit is 16 bits. The difference between "UTF16 encoding" and
"UTF16 code units" is the latter makes no guarantee that the sequence of code
units is actually validly encoded. Which is very much an issue in both Java
and Javascript (and most languages which started from UCS2 and back-defined
that as UTF-16): both languages expose and allow manipulation of raw code
units and allow unpaired surrogates, and thus don't actually use UTF16
strings, however these strings are generally assumed to be and interpreted as
UTF16.

Which I expect is what TFA means by "UTF-16 semantics".

> oh, and by the way, there are byte sequences that are invalid if decoded
> from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust
> strings..

Your comment makes no sense. There are byte sequences which are not valid
UTF-8. They are also not valid as part of a Rust string. Creating a non-UTF8
rust string is UB.

~~~
account42
> Creating a non-UTF8 rust string is UB.

So how does rust deal with filenames under Linux? Use somethinge other than
strings?

~~~
loonyphoenix
Yep. Rust has a PathBuf[1] type for dealing with paths in a platform-native
manner. You can convert it to a Rust string type, but it's a conversion that
can fail[2] or may be lossy[3].

[1] [https://doc.rust-lang.org/std/path/struct.PathBuf.html](https://doc.rust-
lang.org/std/path/struct.PathBuf.html)

[2] [https://doc.rust-
lang.org/std/path/struct.PathBuf.html#metho...](https://doc.rust-
lang.org/std/path/struct.PathBuf.html#method.to_str)

[3] [https://doc.rust-
lang.org/std/path/struct.PathBuf.html#metho...](https://doc.rust-
lang.org/std/path/struct.PathBuf.html#method.to_string_lossy)

