Why is capitalizing the first letter of a string so convoluted in Rust?

Someone · on June 9, 2021

There’s also Dutch, where ‘ij’ sort-of is a single letter, so capitalizing “ijs” yields “IJs” (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation)

Unicode has separate code points for those, but their use is discouraged.

emergie · on June 10, 2021

I had same thoughts. Uppercasing the first "character" is a wrong way of achieving titlecase.

'dz' digraph may be expressed in few ways:

  dz - \u0064\u007a, 2 basic latin block codepoints
  DZ - \u0044\u005a
  Dz - \u0044\u007a
  
  ǳ - \u01f3, lowercase, single codepoint
  Ǳ - \u01f1, uppercase
  ǲ - \u01f2, TITLECASE!

Uppercasing the first character works only if the original text used 'dz' in decomposed two-characters form.

Similar thing happens with transliteration. You cannot just transliterate Þ -> TH because then transliterating something like Þorlákshöfn would yield THorlakshofn.

inshadows · on June 9, 2021

I know nothing about Rust, and couldn't grok most of what was said in the post, but I'm curious. Is this all just to please the type system, and will it result in efficient code as if I did the following in C (for some implementation of to_uppercase())?

    c[0] = to_uppercase(c[0]);

steveklabnik · on June 9, 2021

Doing that in C is not Unicode aware, but Rust's language built-in string type (&str) and main standard library string type (String) are UTF-8.

You could write the same thing as the C if you wanted to, if you had a string type that was encoded differently. Say, in ASCII.

scoutt · on June 10, 2021

But letting aside the unicode reason, wouldn't be the answer the following?

&str is a slice of something that is const (it can be in flash, for example). Like:

    const char* str = "cat";
    str[0] = 'C'; // <--- wrong

So, because of this, an extra allocation and copy is required.

(read the above answer like it's coming from an embedded C developer learning Rust as I am).

steveklabnik · on June 10, 2021

Sure, there are multiple reasons. str[0] doesn't exist either, that would fail regardless of the mutability thing, but you're also right that the &str is immutable, and that would fail.

ibraheemdev · on June 9, 2021

Yeah, when working with ascii, I just use `Vec<u8>`/`&[u8]`. There's also the `ascii` crate which encodes ascii into the type system (enums), but I find it easier to work with raw bytes. The equivalent code in Rust would be:

    let mut bytes = c.as_bytes();
    bytes[0].make_ascii_uppercase();

Arnavion · on June 9, 2021

Your code will not compile. `str::as_bytes` returns a `&[u8]` and indexing that gives a `&u8`, but `u8::make_ascii_uppercase` requires `&mut u8`. There is no safe way to get a `&mut [u8]` from a `&mut str` or a `&mut String` (because it would violate str's utf-8 invariant).

You don't need to downgrade to `[u8]` in the first place. `&mut str` already has its own `make_ascii_uppercase`

ibraheemdev · on June 9, 2021

I assumed `c` was a `String` - my code should be `c.into_bytes()`. If you have a str, then of course mutating the bytes directly is unsafe.

Also, `str::make_ascii_uppercase` would require using `get_mut` which does utf-8 checks.

ibraheemdev · on June 9, 2021

Rust strings are guaranteed to be valid utf-8. Mutating a string byte directly might result in an invalid String. You perform a free conversion to bytes, mutate the slice, and then perform an free but unsafe conversion back to String, or a potentially expensive but safe (utf-8 validation) conversion. If you want to handle non-ascii, then you must search the string for unicode character indices, which is not free.

patrick451 · on June 10, 2021

I don't know rust, but it sounds like you're saying that indexing into a string gives you the i-th byte, not the i-th unicode character. If so, that sounds like a design flaw.

TheCoelacanth · on June 10, 2021

There is no valid use case for indexing into a unicode string by character. Any language that allows you to do that has a design flaw.

wtetzner · on June 11, 2021

There isn't really even a good definition for "character" in unicode.

There are code points, grapheme clusters, etc.

ibraheemdev · on June 10, 2021

No, you cannot index into a string directly. You must either convert it to bytes to index by byte, or call `.chars()` to get an iterator over unicode scalars. The former is free, but the latter requires searching for character indices.