Uppercasing the first character works only if the original text used 'dz' in decomposed two-characters form.
Similar thing happens with transliteration. You cannot just transliterate Þ -> TH because then transliterating something like Þorlákshöfn would yield THorlakshofn.
I know nothing about Rust, and couldn't grok most of what was said in the post, but I'm curious. Is this all just to please the type system, and will it result in efficient code as if I did the following in C (for some implementation of to_uppercase())?
Sure, there are multiple reasons. str[0] doesn't exist either, that would fail regardless of the mutability thing, but you're also right that the &str is immutable, and that would fail.
Yeah, when working with ascii, I just use `Vec<u8>`/`&[u8]`. There's also the `ascii` crate which encodes ascii into the type system (enums), but I find it easier to work with raw bytes. The equivalent code in Rust would be:
let mut bytes = c.as_bytes();
bytes[0].make_ascii_uppercase();
Your code will not compile. `str::as_bytes` returns a `&[u8]` and indexing that gives a `&u8`, but `u8::make_ascii_uppercase` requires `&mut u8`. There is no safe way to get a `&mut [u8]` from a `&mut str` or a `&mut String` (because it would violate str's utf-8 invariant).
You don't need to downgrade to `[u8]` in the first place. `&mut str` already has its own `make_ascii_uppercase`
Rust strings are guaranteed to be valid utf-8. Mutating a string byte directly might result in an invalid String. You perform a free conversion to bytes, mutate the slice, and then perform an free but unsafe conversion back to String, or a potentially expensive but safe (utf-8 validation) conversion. If you want to handle non-ascii, then you must search the string for unicode character indices, which is not free.
I don't know rust, but it sounds like you're saying that indexing into a string gives you the i-th byte, not the i-th unicode character. If so, that sounds like a design flaw.
No, you cannot index into a string directly. You must either convert it to bytes to index by byte, or call `.chars()` to get an iterator over unicode scalars. The former is free, but the latter requires searching for character indices.
Unicode has separate code points for those, but their use is discouraged.