
What Is the “Length” of a String? - gk1
https://blog.etleap.com/2019/10/03/what-is-the-length-of-a-string/
======
mjevans
IIRC from some other article; most JavaScript engines use a UTF-16 Unicode
encoding for characters, and it looks like the .length attribute is reporting
the memory footprint in terms of 16 bit codepoints. A single grapheme (display
'character', pictograph, or fixed sized space) may be composed of one or more
codepoints.

'length' is subjective. It makes sense to want to know how many storage units
are required.

Knowing the number of printed 'graphemes' (do we count non-printing
'characters'?) might also be useful.

The display length (at a given scaling size, taking in to consideration font
kerning/etc) can also be useful.

That's where the linked article really goes off the rails. Characters don't
have printing widths, not outside of extremely specific circumstances. You
have to ask the layout engine in use what dimensions are occupied after it
solves the very complex problem.

------
udp
It really doesn't matter. A string is a binary blob until you need to parse
it, or display it. If you need to parse a string and it's UTF-8, you can
pretend it's an ASCII string because the control characters you need for
parsing are probably ASCII characters such as {, [, or ". UTF-8 continuation
characters set a high bit to ensure that UTF-8 is a superset of ASCII, so you
can use all of the standard C library functions. If you need to display a
string, you're already using a font library which, in addition to providing
the logic to iterate through the string character by character, can work out
kerning etc.

I never understood the rationale for using encodings such as UTF-16. They seem
to be the worst of both worlds: strings for which ASCII would be adequate take
2x the space, and the encoding is still multi-byte. I once worked with a
Windows developer who swore blind that UTF-16 was not a multi-byte encoding.
When I provided evidence to the contrary, they responded something along the
lines of "ok, but who would ever need more than 16 bits worth of characters?".
¯\\_(ツ)_/¯

~~~
mehrdadn
The(/a) rationale for UTF-16 is that it's more space-efficient for languages
that use high code points, like East Asian languages.

~~~
jandrese
IIRC the real rationale is back in the early days people bet that 16 bits
would be enough for a fixed length encoding, but the bet didn't pay off and
now they're stuck with the worst of both worlds.

~~~
lokedhs
Your explanation is correct. UTF-16 is a hack on top of a hack, and no one
uses it in new software. However, Java and Javascript are stuck with it for
legacy reasons.

It wouldn't be so bad if people just understood that there are (almost) no
valid uses-cases for measuring the size of a string. As someone else
mentioned, it's a binary blob for all intents and purposes, and if processing
of its content needs to be done (such as displaying it on the screen, or
performing, say, word-wrap) then it should be handed over to libraries that
have been designed for this purpose, because these things are very
complicated.

Simply by having a single non-ASCII character in my name, I'm seeing software
fail on this on a regular basis.

------
fireattack
I prefer this article: [https://hsivonen.fi/string-
length/](https://hsivonen.fi/string-length/) HN discussion:
[https://news.ycombinator.com/item?id=20914184](https://news.ycombinator.com/item?id=20914184))

------
maxdamantus
> Instead of rendering all the strings in each column, we can split the
> strings into their corresponding graphemes and render them individually.
> This allows us to cache the pixel length of each grapheme we encounter.

I wonder if they're aware of the `measureText` method used in canvas:
[https://developer.mozilla.org/en-
US/docs/Web/API/CanvasRende...](https://developer.mozilla.org/en-
US/docs/Web/API/CanvasRenderingContext2D/measureText)

Seems a lot simpler than trying to split strings up by graphemes, and probably
more reliable. Pairs of letters such as "fi" are rendered as ligatures (single
glyphs) in some fonts, and Unicode no longer standardises them (since the
combinations are completely arbitrary). Also, grapheme clustering is going to
change according to Unicode version, and not everyone's browser is going to be
using the same Unicode version.

If you want to know the "string length", you basically either want the number
of code units to store it (in languages that handle strings sensibly, this is
just the number of bytes, since UTF-8 should always be used, except for
historical reasons), or you want to pass it to a font rendering library to
tell you something about pixels.

------
kevincox
Of course measuring each character also doesn't work because of kerning,
ligatures and probably other rendering features.

------
wintorez
There are two data types that the more you think about them, the weirder they
become: String & Date

