That's not right at all. Two equal strings /may/ be represented by the same pointer. Only if all strings are interned will this trick be useful and I don't know of any languages that intern all strings
Also, cache and branch prediction effects don't seem to be accounted for.
This is similar to a trick for representing bignums on a 32-bit system. You can use the lowest bit of a 32-bit value as a tag: if it's zero, then the number is a 31-bit integer; if it's 1, then the number (with the low bit cleared) is a pointer to a heap-allocated bignum. The common case is that you're doing operations with numbers that fit in 31 bits, so the code for handling bignums is avoided by the branch predictor. (This works even better on a 64-bit system, where the common case is even more common.)
As for cache effects, I think the main issues there, for ropes, are tree depth and chunk size. You want the tree to be shallow (so as to minimize the number of potential cache misses before you reach the chunk you're looking for) and you want the chunks to be large enough that they get good cache locality, but small enough that you get fast insertion/deletion/concatenation/etc.
> This also makes it possible to store only a single copy of short strings, to improve memory efficiency, cache locality, and so on.
He's specifically talking about interning strings. He even links to the wikipedia article on string interning. I recall reading that Lua interns strings by default, and a quick google search seems to confirm my belief. However, with only one example, you're right: interning /all/ strings is unusual. Interning all "short" strings, his suggestion, is a more feasible heuristic, and then (if finding the length of a string is O(1)) comparing two short strings amounts to little more than a pointer comparison. I don't know whether there are any languages that do that.
...and the author names every data type after something stringy. I wanted to continue the theme, but it took some serious Wikipedia digging before I could find something that was textile-related and not already taken. In retrospect, this was a bad choice, since I kept forgetting how to spell it. :-(
And for your future endeavors: twine, thrum, strand, and lisle are, as far as I know, unused.
If you are building an index ahead of time, just use byte offsets in your index instead of character offsets. You'll still be iterating through it from the beginning to build the index.
If you use a fast text search algorithm like Aho-Corasick, you'll still get reliable and accurate fast text search, and while you will perform unnecessary checks against invalid UTF-8 sequences, you won't have to create extra data structures or use any other operations and the algorithm can be implemented in a way that's pretty straightforward. The convenience of that outweighs the benefits you'd get of using a larger, elaborate data structure.
This is not an argument against heavyweight strings, but an argument for better programming languages.
If you combine this with heavyweight strings, you can get incremental and/or parallel string matching, but it's almost as easy to do on a byte array.
If I recall correctly an implementation of "ropes" ships with the Boehm GC sources.