
UTF-8 String Indexing Strategies - ingve
https://nullprogram.com/blog/2019/05/29/
======
maxdamantus
Is there any real case where code point indexing is useful? It seems like all
these attempts to restrict strings in such a way to accommodate code points is
just introducing complexity with no gain.

UTF-8 was designed to be an encoding (of code points) on top of the "bytes"
abstraction just as Unicode is designed to be an encoding (of human text) on
top of the "code points" abstraction. I think it should be uncontroversial
that there are very good reasons to at least handle arbitrary sequences of
code points (eg, you want to be able to handle input from future versions of
Unicode, and you don't know about the grapheme clustering of those code
points), but I don't see a good reason _not_ to handle arbitrary sequences of
bytes.

The only reason I can see is in ensuring that text is losslessly convertible
to other UTFs, particularly UTF-16 (which exists for historical reasons), but
this just seems like a matter of _when_ the information is lost (is it during
conversion from string to UTF-16, or from bytes to string), not _if_ it is
lost.

As far as I can tell with the Python story, for example, people decided to add
special "Unicode" strings into Python 2, then presumably some code used the
"Unicode" strings and some code used the "byte" strings, so this situation is
obviously underisable .. then in Python 3 they tried fixing it by replacing
which sort of string was the default. Why would it not have been better to
just improve Unicode support for the existing strings instead of splitting the
type into two and forcing everyone to decide whether their strings are for
"bytes" or for "Unicode"?

~~~
dtech
Programmers are rarely interested in individual bytes, except sometimes if a
string is abused as a byte array. In all other cases iterating or indexing
over characters is the intention, and code points are the proper abstraction
for characters, not bytes.

Also I might be wrong, but you can just look at the bytes to know how much
bytes a UTF-8 character is, since the first byte with value 0 to 127
represents the final byte while 128 to 255 represents that the next byte is
part of the character.

~~~
mort96
Except that code points aren't the proper abstraction for "characters". Most
people would think of <Family: Woman, Woman, Girl, Boy>[1] as one character,
but it's really five code points; woman, zero width joiner, woman, zero width
joiner, girl, zero width joiner, boy. If you tried doing an operation like
reversing a string or removing the last character, and you treated a unicode
code point as a "character", you would end up with the wrong result. If you
just removed the last code point to implement a backspace, you would end up
with a string which ends in a zero width joiner, which makes little sense; and
when the user wants to insert, say, a girl emoji, that emoji will end up as a
part of the family due to that trailing joiner, when the user expected it to
be a separate emoji.

This applies to more than just emojis by the way; there are languages whose
unicode representation is much more complicated than english or other
languages with latin characters.

[1]: [https://emojipedia.org/family-woman-woman-girl-
boy/](https://emojipedia.org/family-woman-woman-girl-boy/)

EDIT: This comment originally used the actual emojis as examples, but hacker
news just replaced every code point in the emoji with a space.

~~~
a1369209993
You don't even need emoji; eg U+63,U+300 (c̀) is one characters but two code
points, and U+1C4 (Ǆ) is two characters, but one code point. There's also
U+1F3,U+308 (ǳ̈), which is two characters in two code points, but segments
incorrectly if you split on code points instead of characters.

It's ambigous how to encode latin-small-a-with-ring-above (U+E4 vs
U+61,U+30A). Decoding is also ambigous (most infamously Han grapheme
clusters), but I'm not fluent enough in any of the affected languages to have
a ready example.

Also, that's seven code points, not five.

~~~
planteen
You example is a little confusing from the use of the word "characters". I
think glyphs would be more clear (U+1C4 is two glyphs). Though it might not
actually be 2 glyphs, its dependent on how the font implements it.

At the end of the day, an OpenType font through substitutions can do far more
crazy things than these "double glyph" examples. I once made a font that took
a 3 letter acronym, substituted this for a hidden empty glyph in the font,
then substituted this single hidden empty glyph into 20 glyphs. You were left
with something like your U+1C4 in software where you could only highlight all
20 glyphs of none of them. And this was happening on text where all input code
points were under 127. People often don't realize how much logic and
complexity can be put into a font or how much the font is responsible for
doing.

~~~
a1369209993
No, "characters" is the correct term; U+1C4 is two characters: latin-capital-d
followed by latin-capital-z-with-caron (or whatever you want to call the
little v thing). As you note, this means that non-buggy fonts will generally
use two glyphs to render it, but that's a implementation detail; a font could
process the entire word "the" as one glyph, or render "m" as dotless-i +
right-half-of-n + right-half-of-n, but that wouldn't affect how many
characters either string has.

~~~
planteen
Using characters is confusing because I don't know if you mean before or after
shaping. U+1C4 is unquestionably a single Unicode code point. I've heard
people call this 1 logical character. Other people might say how many
characters it requires for encoding in UTF-8 or in UTF-16. After shaping, some
people might say it is 1 or 2 "shaped characters". It's all horribly
confusing. I find using the term code point more precise.

~~~
a1369209993
There's no shaping involved; I'm not talking about the implementation details
of the rendering algorithm. There is a D, followed by a Ž. This only seems
confusing because Unicode (and - to be fair - other, earlier character
encodings) willfully misinterprets the term "character" for self-serving
purposes.

------
ChrisSD
> One issue to consider is that strings typically feature random access
> indexing of code points

True but I find it's much rarer to actually need random access to arbitrary
code points. Most of the time I either use strings as opaque "things" or I'm
iterating through characters to find something interesting (e.g. parsing)
where I can build my own index, if and when necessary.

I do agree with the article that if an abstraction is very leaky it's better
to be upfront about that.

~~~
richardwhiuk
I think that accessing characters by index is _probably_ a code smell in most
places, especially if that string may contain arbitrary UTF-8.

~~~
masklinn
Accessing string content by arbitrary indices is probably an error. Accessing
string content by indices you got from previous lookups is useful for a number
of situations.

~~~
scatters
If you have an index from a previous lookup, that can be a byte index.

------
nabla9
Iterating code points is OK, as long as you know that know that iterating code
points is not the same as iterating grapheme clusters aka user perceived
characters. You get away with it most of the time, but you should know you are
not dealing with full Unicode and have a plan to deal with exceptions. Unicode
normalization does not solve it all.

Unfortunately almost all "Absolute minimum you must know about Unicode"
articles don't cover the absolute minimum you have to know about Unicode.

Arbitrary well formed UTF-8 combined with advanced string algorithms and data
structures where the unit is 'char' requires more than code points.

~~~
eridius
Iterating grapheme clusters is OK, as long as you know that iterating grapheme
clusters is not the same as iterating unicode scalars (code points) aka the
fundamental unit of textual parsing grammars.

This is something that really bugs me about how Swift changed its mind and
made the String type a Collection of Characters (i.e. grapheme clusters).
Originally they recognized this issue and required you to write
`str.characters` to work with the grapheme clusters as a collection (and
String itself wasn't a collection at all), but then in Swift 3 (I think) they
changed course and said String is a collection after all. And the problem is
now people work with Characters without even thinking about it when they
really should be working with unicode scalars.

In my personal experience, I only ever actually want to work with grapheme
clusters when I'm doing something relating to user text editing (for example,
if the user hits delete with an empty selection, I want to delete the last
grapheme cluster). Most of my string manipulation wants to operate on scalars
instead.

~~~
raphlinus
The rules for what you want to do on backspace are complex - you want to
delete the grapheme cluster if it's an emoji or ideograph with variation
selector, but if it's a combining mark, most of the time you want to just
delete that. One place this is written down is [1].

Of course, this might sound like a nitpick but only confirms the actual point
you were making, that treating text as a sequence of grapheme clusters is
often but not always the right way to view the problem.

If you're talking about cursor motion when hitting an arrow key, then yeah,
grapheme cluster.

[1]: [https://github.com/xi-editor/xi-
editor/blob/master/rust/core...](https://github.com/xi-editor/xi-
editor/blob/master/rust/core-lib/src/backspace.rs)

~~~
eridius
macOS and iOS delete the entire grapheme cluster on backspace, not just the
combining mark (which is to say, backspace with no selection is identical to
shift-left to select the previous character and then hitting backspace).

~~~
svat
Not sure what scripts you intended your comment about, but this is not true in
general. If I type anything like किमपि (“kimapi”) and hit backspace, it turns
into किमप (“kimapa”). That is, the following sequence of codepoints:

    
    
        ‎0915 DEVANAGARI LETTER KA
        ‎093F DEVANAGARI VOWEL SIGN I
        ‎092E DEVANAGARI LETTER MA
        ‎092A DEVANAGARI LETTER PA
        ‎093F DEVANAGARI VOWEL SIGN I
    

made of three grapheme clusters (containing 2, 1, and 2 codepoints
respectively), turns after a single backspace into the following sequence:

    
    
        ‎0915 DEVANAGARI LETTER KA
        ‎093F DEVANAGARI VOWEL SIGN I
        ‎092E DEVANAGARI LETTER MA
        ‎092A DEVANAGARI LETTER PA
    

This is what I expect/find intuitive, too, as a user. Similarly अन्यच्च is
made of 3 grapheme clusters but you hit backspace 7 times to delete it (though
there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of
अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).

~~~
eridius
Looks like you're right. I don't have experience with languages like this one.
I was thinking more of things like é (e followed by U+301), or 🇦🇧 (which is
two regional indicator symbols that don't map to any current flag), or a
snippet of Z̛̺͉̤̭͈̙A̧̦͉̗̩̞͙LG͈͎͍̺̖̹̘O̵̫ which has tons of combining marks but
each cluster is still deleted with a single backspace.

~~~
raphlinus
Interesting. The rules seem to be different on different systems. Deleting two
RIS symbols (whether they map to a flag or not) seems right in any case. Some
other systems (Android included) will take the accents off separately when
they are decomposed (but not for precomposed accented characters). Also note
macOS takes just the accent off for Arabic (tested on U+062F U+064D).

------
bakery2k
The article does not mention Python, other than to reference CPython's
"Flexible String Representation". However, it's interesting that alternative
Python implementations have decided against that model and indeed use UTF-8
strings internally.

MicroPython saves memory by simply making indexing into its strings O(n) [1],
while PyPy's UTF-8 strings have "an optional extra index data structure to
make indexing O(1)" [2].

For compatibility, of course, Python implementations have to provide indexing
of code points - it would be interesting to examine the pros & cons of the
different string representations. I wonder if new high-level languages would
be better off using one of these representations, or taking the Go/Julia
approach of only indexing bytes.

[1]
[https://github.com/micropython/micropython/blob/a4f1d82757b8...](https://github.com/micropython/micropython/blob/a4f1d82757b8e95c21a095c99b7c3f04ded88104/py/objstrunicode.c#L156)

[2]
[https://twitter.com/pypyproject/status/1095971192513708032](https://twitter.com/pypyproject/status/1095971192513708032)

~~~
raiph
"इंडेक्स" का क्या अर्थ है? [1]

Including the quote marks, spaces, and question mark, that's 18 characters.
This isn't just about text editing, far from it. For a lot of string
processing, indexing into the underlying codepoints is even less interesting
than indexing into the underlying bytes.

[1]
[https://translate.google.com/#view=home&op=translate&sl=hi&t...](https://translate.google.com/#view=home&op=translate&sl=hi&tl=en&text=%22%E0%A4%87%E0%A4%82%E0%A4%A1%E0%A5%87%E0%A4%95%E0%A5%8D%E0%A4%B8%22%20%E0%A4%95%E0%A4%BE%20%E0%A4%95%E0%A5%8D%E0%A4%AF%E0%A4%BE%20%E0%A4%85%E0%A4%B0%E0%A5%8D%E0%A4%A5%20%E0%A4%B9%E0%A5%88%3F)

~~~
onboardram
I am not a linguist, but as a native speaker, shouldn't they be considered 15
characters? क्स, क्या and र्थ each form individual conjunct consonants.
Counting them as two would then beget the question as to why डे is not
considered two characters too, seeing as it is formed by combining ड and ए,
much like क्स is formed by combining क् and स.

~~~
raiph
If you say they should be considered 15 characters then software and devs
should support optionally indexing and counting them as 15 characters. This is
the most important point.

And, as a corollary, software devs should aspire to have and know about string
functions in software that recognize that the text string I used is 15
characters long in contexts where that's the right way to view it.
Furthermore, those functions should asap be as easily available for use as
they are today for recognizing that the text 'What does "index" mean?' is 23
characters long.

This notion of software and devs properly indexing and counting characters was
the ultimate point of my comment, as I will elaborate below. I hope that you
will reply to confirm you understand the gist of what follows; that would make
my day and leave this exchange on HN to hopefully shine light where it's
sorely needed. :)

\----

The OP title is "UTF-8 String Indexing Strategies". I could write that this
begs the question What does "index" mean? _Unfortunately it seems it still
doesn 't beg the question -- in 2019 -- for most western devs._

Last century devs generally assumed the index unit was bytes. So they created
programming languages whose string type assumed indexing in bytes and
functions and libraries that did the same. Nowadays they're starting to assume
"codepoints", which is an equally broken assumption. (Codepoints are a Unicode
notion and they're great for what they're great for. But being "characters"
is, in the general case, something they're terrible for.)

Both these western devs and the OP are effectively ignoring the possibility
that "इंडेक्स" का क्या अर्थ है? could be considered to be 15 characters (or
18). They're ignoring you, the half of the planet that are in a similar boat,
and the whole of the planet that's coming together, sharing text like we are
here.

\----

bakery2k demonstrated the problem. They wrote:

> MicroPython ... indexing ... O(n) ... PyPy's ... O(1)

Neither of these deals with indexing _characters_ , as one might expect based
on an ordinary human's understanding of the word "characters". Instead they're
myopically focused on indexing bytes and codepoints.

This goes hand-in-hand with Python's length function returning 26 for the text
"इंडेक्स" का क्या अर्थ है?. It's counting codepoints, not characters, which is
close to useless for that text.[1]

But you wouldn't have any clue about that from bakery2k's comment and it looks
like bakery2k has no awareness of this:

> I wonder if new high-level languages would be better off using one of these
> [byte and codepoint] representations, or taking the Go/Julia approach of
> only indexing bytes.

Imo that's shockingly retrogressive given the lack of discussion of
characters.

\----

Chances are good that if you try to select the text I wrote one character at a
time you will find that you can cursor across 18 units.

Why/how does software do this? It relies on part of the Unicode standard for
indexing that builds on the concept of "what a user thinks of as a
character".[2]

This mechanism allows the string to be indexed/counted as N characters, where
N varies according to the definition of "character". Software is supposed to
choose the definition with appropriate adherence to the Unicode standard,
which includes customizing it as necessary to produce practical results. And,
as I noted, most good modern software dealing with cursoring/editing text gets
it right per the Unicode standard.

My guess the Unicode standard by default has software consider क्स to be 2
characters because the consonant is comprised of क् and स placed visually
side-by-side whereas it has डे be considered 1 character because it's
comprised of ड and ए somehow overlapping visually. (That's a pure guess.
Please let me know if it sounds crazy. :))

For some other use cases, like a native speaker just reading text abstractly,
what's a character changes. You say the text I wrote is 15 characters;
therefore software should be able to index and count it as 15 characters.

I hope that all makes sense. Thank you for your comment, reading my reply, and
TIA for any reply. :)

[1]
[https://tio.run/##K6gsycjPM/7/v6AoM69EIyc1T0Nd6cGS9gdLmh4sWf...](https://tio.run/##K6gsycjPM/7/v6AoM69EIyc1T0Nd6cGS9gdLmh4sWfhgKZAx9cHS3gdLdigpgJhL9inARNZDOK0PlmwA85cCOTsfLO2wV9fU/P8fAA)

[2]
[https://unicode.org/glossary/#grapheme](https://unicode.org/glossary/#grapheme)

~~~
onboardram
Sorry for the late reply, I don't use HN much. No idea if you'll actually
notice this, does HN even have a "Reply Notification" feature?

Regarding what you wrote, I agree pretty much. As I said, I am not an expert
in this field, so I am not aware of the most cutting edge stuff put there. But
even the few languages I know and have seen are so different from each other
(some more than others) that it seems unlikely that a single "theory of
everything" would suffice for text, especially in the way we process text
presently.

Perhaps there is some way to abstract out the differences, but I don't really
see how. After all, characters are where the differences only begin. Start
thinking about words or sentences and no single route seems viable for the way
we do string processing today.

You probably expected a more substantial comment, but I don't really know
enough of this field to make one.

Regarding क्स and डे, the difference between them is that the former is a
combination of two consonants (pronounced "ks") while the latter is formed by
a consonant and a vowel ("de"). However, looking at the visual representation
is wrong, since डा (consonant+vowel) would also look like two characters. If
you copy these into a text field and try to erase them through backspace or
delete, you should see how it all works (assuming the text field functions
correctly).

But again, these confusions only exist because Devnagari allows simple
characters to form compound characters. That is obviously completely different
than how Roman script works, which is probably completely different than
various pictographic languages. So, how to reconcile the differences (except
by hiring native speakers of every language out there)? I wish I knew, but
currently I don't.

------
masklinn
It's sad and odd that Rust and (probably especially) Swift are missing from
the article.

~~~
chrisseaton
Why are there interesting technical differences in the way those languages do
things compared to the other examples given?

The author obviously can't cover _all_ languages and strategies in a short
article can they?

~~~
saagarjha
Yes: Swift groups by grapheme clusters, and Rust makes it difficult to do byte
indexing.

~~~
masklinn
> Rust makes it difficult to do byte indexing.

Not sure how. If you want to get a specific byte, just convert to a bytes
slice (that's free) and index that. And you can slice strings (using byte-
indexed indices), but your boundaries have to fall on codepoint boundaries.
The only thing that's difficult is getting a codepoint at a specific index
(byte or otherwise).

~~~
afiori
> Not sure how. If you want to get a specific byte, just convert to a slice
> (that's free) and index that.

But then it is not automatic to cast that slice as a string.

~~~
burntsushi
If you want a single byte and `s` is a `&str`, then `s.as_bytes()[i]` returns
a `u8` in `s` at index `i`. If the index `i` is out of bounds, then it panics,
but no other UTF-8 checking is performed.

You do not need to do this if you're slicing. For example, if you know that
`i..j` indexes a valid UTF-8 subslice of `s`, then `&s[i..j]` returns a
subslice of `s` with type `&str`.

The only reason to subslice `s.as_bytes()` is if you want the raw bytes which
may or may not be valid UTF-8. And in this case, it is a _good thing_ that it
is not automatic to convert that back to a `&str` since it may not be valid
UTF-8.

~~~
afiori
> it is a good thing that it is not automatic to convert that back to a `&str`
> since it may not be valid UTF-8.

My comment was unclear in meaning, but the aim was to point out exactly this.

------
Someone
_”So, Emacs pretends it has constant time access into its UTF-8 text data, but
it’s only faking it with some simple optimizations. This usually works out
just fine.”_

Usually, except when you’re writing in, and searching for Chinese, Greek,
Hindi, Korean, Russian, Turkish, etc, text, like 50+% of the world’s
population? It seems Emacs is made for programmers, who predominantly type and
search for ascii text.

~~~
munchbunny
That sounds about right. You don't pick up vim for a shopping list. You pick
it up because you're a programmer.

~~~
anoncake
Some people use Emacs just for Org. That's a lot closer to a shopping list
than to programming. And programmers sometimes write text in natural language.

~~~
munchbunny
I would guess that the parent comment's point is still true: Emacs (and Vim)
are far more commonly used for programming and other work, probably ASCII
heavy, than for natural language text editing.

I'd be willing to bet that for both Emacs and Vim, 90%+ characters by volume
are ASCII. I wouldn't make a similar bet for Microsoft Word.

------
cjohansson
Thanks for the article, I’m curious of what the author think of string
indexing in Rust. It’s also explicit so I guess you would like it as well

~~~
the_mitsuhiko
From my personal experience I think Rust's string system is hard to beat at
the moment. It's pretty darn good from a usability point of view and it also
found a nice solution to work with UCS2 windows APIs by providing a OsStr
type.

~~~
chrismorgan
I’m glad that Rust strings aren’t indexable by integer, but I think that
making them indexable by range (of UTF-8 code unit offsets) was an error.
`foo[0..10]` should have been `foo.slice(0..10)` or similar instead.

~~~
the_mitsuhiko
It’s a bit of a footgun indeed but it’s quite handy in combination with the
char index iterator.

~~~
chrismorgan
Sure, you do want to be able to index by code unit range, but it shouldn’t
have been with the Index trait.

------
jasonhansel
Could you store the string in multibyte form, and then keep a skip list (or
other data structure) to get indexing in O(log n)?

------
cryptonector
The xi rope science blog post series is, I think, the definitive answer to
UTF-8 "indexing".

------
tracker1
Isn't it just best to do NFKC (or similar) normalization on input for
indexing?

------
rwmj
It's a good article, but it would be nice if he'd also covered the Python 3 C
API anti-pattern: forcing strings to be utf-8. This means that you have latent
bugs in your code. Notably when trying to treat all filenames as strings,
sooner or later your code will explode when it meets someone's filesystem
which has ancient Latin1 filenames. Also when dealing with unfiltered user
input.

(And to head off replies - yes I understand you can "just do X" where "X" is
some complicated thing to avoid the bug if you remember to do "X" beforehand)

~~~
eadmund
I don't really consider that an anti-pattern. Sure, you can get away with just
blitting them to the terminal and hoping that they display properly, but
sooner or later you're going to have to decode such byte strings anyway.

The real anti-pattern is conflating byte strings and character strings in the
first place. We got away with it for decades, but in a UTF-8 world it just
isn't possible.

~~~
skybrian
I don't see why not? Go has a string type that contains arbitrary bytes,
interpreted as UTF-8. This seems to work as well as anything else. If there
are non-codepoints and it matters, you just have to deal with it (for example
by printing an escape sequence or Unicode replacement character).

[https://blog.golang.org/strings](https://blog.golang.org/strings)

~~~
anoncake
Because arbitrary bytes cannot be interpreted as UTF-8. I guess this kind of
thing is tolerated by Go users because anyone who values a proper type system
uses a language with generics.

~~~
skybrian
How do you fix a file that has errors in it if the standard library of the
language you're using won't even let you read it?

~~~
eadmund
If you're fixing bytes then you load bytes and fix them.

You won't, though, fix bytes by loading characters and then trying … to fix
the bytes … the characters encode to. Just doesn't make sense.

We were able to get away with stuff for a long time because bytes were
characters and characters were bytes and we could think sloppily and not break
anything. But with Unicode they really are different things, and we need to be
tidier in our thinking.

~~~
skybrian
Seems like you're just reasserting it doesn't make sense, without giving a
reason. But it does make sense in Go.

~~~
eadmund
> But it does make sense in Go.

No, Go doesn't work that way. You asked, 'How do you fix a file that has
errors in it if the standard library of the language you're using won't even
let you read it?' In Go, you don't read file as strings, but rather as bytes
(proof: [https://golang.org/pkg/os/#Open](https://golang.org/pkg/os/#Open),
which returns a File which implements Read:
[https://golang.org/pkg/os/#File.Read](https://golang.org/pkg/os/#File.Read)).

You would do the same thing in Python: open the file in binary mode, and the
iterate over the bytes it yields.

Now, the one thing that _would_ be annoying in Go is fixing a broken filename.
I'd have to think a bit to figure that out.

~~~
skybrian
You can cast between byte arrays and strings in Go. The difference is that
strings are immutable (so it does a copy).

~~~
eadmund
> You can cast between byte arrays and strings in Go.

Yes, you can. But, in the specific case you mentioned, no competent programmer
would cast the bytes of an invalidly-encoded file to a string, then iterate
through the runes of the string. That wouldn't even begin to make sense!

I really don't understand what you're trying to argue here.

~~~
skybrian
Although it only works for smallish files, that seems fairly useful for
getting as much info as you can out of a corrupt but mostly UTF-8 file?

Any runes that aren't valid will come back as the replacement character. And
you can count newlines and print the location of the error(s). You also have
the index of the error.

