Do you need a vector of displayed characters? Usually all you care about is the ...

nabla9 · on Jan 16, 2014

Yes. If I want to work with language and do some stringology, that's what I want. I might want to swap some characters, find length of words etc. To have vector of characters (as what humans consider characters) is valuable.

twic · on Jan 18, 2014

Yes. A really good string type, that actually modelled a sequence of characters, not bytes, codepoints, interspersed glyphs and modifiers, or what have you, would be very useful at times.

The acid test for this sort of 'humane string' type would be whether you could splice together any two substrings from any two input strings and get something that could be validly displayed. UTF-8 bytes fail because you can get fractional codepoints. Codepoints in >=20-bit integers fail because you can get modifier characters which don't attach to anything.

A similar test would be whether you can reverse any input string by reversing the sequence of units. For example, reversing "amm͊z" should yield "zm͊ma", which it doesn't in unicode, because "m͊" is made with a combining mark, and doesn't have a composed form.

For extra fun, i suspect that reversing the string "œ" should yield "eo".

It should also be simple to do things like search for particular characters in a modifier-insensitive way. For example, i should be able to count that "sš" contains two copies of the letter 's' without having to do any deciphering. I suspect i should also be able to count that the string "ß" contains two copies of the letter 's', but i'm not nearly as sure about that.

I think i essentially want a string that looks like:

List<Pair<Character, List<Modifier>>>

But i'm not sure. And i'm even less sure about how i'd encode it efficiently.

pornel · on Jan 16, 2014

> To have vector of characters (as what humans consider characters) is valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?

sanxiyn · on Jan 17, 2014

I can answer the question on Korean. Treat each block of letters as a character. Never ever separate for human uses.

Guvante · on Jan 21, 2014

Can you really automate that problem? You could provide a "split at glyphs" function, but I doubt that would actually be useful without tons of caveats. Even English doesn't do split at glyphs well given the existence of ligatures. `flat -> fl a t`.

Not to mention you would need to make any such function language aware since different languages could theoretically have different mapping rules for the same sequence of characters.