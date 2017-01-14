Another algorithm for unicode strings which wants to index by code point is unicode regular expression matching. I heard that you do lose something here if you can't assume all code points encode to same length. Unlike other algorithms you mentioned which only iterate forward, depending on implementation regex needs to backtrack.
I once heard that complexity of implementing regex directly on UTF-8 is one reason why Python will not use UTF-8 internally.
The reason I bring it up is that UTF-16 is a very commonly used internal encoding for strings, and regular expression engines for platforms which use UTF-16 internally are generally designed to work on that encoding without conversion to UTF-32. Examples that spring to mind are most JavaScript regular expression engines, the ICU regular expression engine, and java.util.regex. If variable-width encodings were so difficult to deal with then you would expect these to have run into some problems with it; however, as far as I can tell, they have not. This is evidence against variable-width encodings being difficult for regular expression engines to deal with.
(Sanity check for my argument above: If you read through https://swtch.com/~rsc/regexp/regexp2.html, which covers both backtracking and non-backtracking regex engines and includes code, none of the code examples would need more than trivial modification to deal with variable-width encodings.)
burntsushi's utf8-ranges library used by the Rust Regex lib probably makes this easier to deal with. I wonder how it performs in relation to the Python implementation (which I assume is in native code)
