
How to represent strings, the heavy way - pjscott
http://finger-tree.blogspot.com/2010/11/my-strings-can-beat-up-your-strings.html
======
pjscott
It's funny, strings are one of the most often-used data structures, and yet
people don't think about them as much as they think about more sexy things
like hash tables. Strings are just part of the scenery. They deserve more
glory. (Except for C strings, which need to die.)

------
prodigal_erik
Has anyone ever created a robust Unicode string API that treats graphemes as
indivisible units? I keep seeing libraries that force you to directly handle
codepoints, making it your problem to be aware of which characters are
combining and have to be kept together (which almost nobody has done their
homework on) so as not to corrupt the text.

~~~
ewiethoff
Would Perl's Unicode::Normalize module fit your bill? I believe
Unicode::Collate uses it so that a single codepoint accented 'e', for example,
sorts the same as 'e' with a combining accent.

~~~
prodigal_erik
That's similar. What I'd really like to see is a string API that doesn't let
you separate the letter and the combining accent at all, whether it has a
precomposed normalization or not (unless you dig all the way down to asking
which codepoints a grapheme is composed from).

------
ryanpetrich
> two equal strings will always be represented by the same pointer.

That's not right at all. Two equal strings /may/ be represented by the same
pointer. Only if all strings are interned will this trick be useful and I
don't know of any languages that intern all strings

Also, cache and branch prediction effects don't seem to be accounted for.

~~~
pjscott
Okay, two _interned_ strings will be represented by the same pointer. That
doesn't complicate the comparison logic that much; the common case of
interned-interned comparisons can be checked for first, and the branch
predictor will get rid of most of the overhead of having an extra check in
there.

This is similar to a trick for representing bignums on a 32-bit system. You
can use the lowest bit of a 32-bit value as a tag: if it's zero, then the
number is a 31-bit integer; if it's 1, then the number (with the low bit
cleared) is a pointer to a heap-allocated bignum. The common case is that
you're doing operations with numbers that fit in 31 bits, so the code for
handling bignums is avoided by the branch predictor. (This works even better
on a 64-bit system, where the common case is even more common.)

As for cache effects, I think the main issues there, for ropes, are tree depth
and chunk size. You want the tree to be shallow (so as to minimize the number
of potential cache misses before you reach the chunk you're looking for) and
you want the chunks to be large enough that they get good cache locality, but
small enough that you get fast insertion/deletion/concatenation/etc.

------
RodgerTheGreat
This type of data structure is sometimes called a "rope". (get it?)

~~~
pjscott
The list of textile-related terms which correspond to data structures includes
string, rope, cord, yarn, weave, weft, thread, and even quipu. It's getting
kind of ridiculous.

~~~
andolanra
I looked all of these up; in doing so, the only quipu I could find is on what
I believe to be your own GitHub page. Are quipus (or, to use the Quechua
plural, quipukuna) something you devised, or are you borrowing the name from
elsewhere?

~~~
pjscott
That would be my own contribution to the ridiculousness. See, I'm working on
operational transformation based on this paper:

<http://bouillon.math.usu.ru/articles/ctre.pdf>

...and the author names every data type after something stringy. I wanted to
continue the theme, but it took some serious Wikipedia digging before I could
find something that was textile-related and not already taken. In retrospect,
this was a bad choice, since I kept forgetting how to spell it. :-(

~~~
andolanra
Most varieties of Quechua wouldn't spell it 'quipu' any more; that's the
influence of Spanish orthography, where 'qu' meant 'k' and 'hu' meant 'w'.
Most modern orthographies would spell it 'khipu', which might be easier to
remember.

And for your future endeavors: twine, thrum, strand, and lisle are, as far as
I know, unused.

------
moultano
If you need to access a utf-8 string indexed by _character_ rather than byte
you are probably doing something wrong.

~~~
tomjen3
You have an issue with fast text search?

~~~
moultano
If you aren't building an index ahead of time, you're writing a state machine,
in which case you're better off treating it as bytes for efficiency and
simplicity. (An array of 256 pointers vs a nearly unbounded dynamic map of
pointers.)

If you are building an index ahead of time, just use byte offsets in your
index instead of character offsets. You'll still be iterating through it from
the beginning to build the index.

~~~
tomjen3
You can't treat its as bytes because you are than dependent on what the came
before those bytes, which messes with your ability to skip ahead on a miss.

~~~
andolanra
He's arguing that you don't need to skip ahead on a miss. If you're test some
UTF-8 string for membership in some other UTF-8 text, then go ahead and
compare the individual bytes even if they don't line up as characters.

If you use a fast text search algorithm like Aho-Corasick, you'll still get
reliable and accurate fast text search, and while you will perform unnecessary
checks against invalid UTF-8 sequences, you won't have to create extra data
structures or use any other operations and the algorithm can be implemented in
a way that's pretty straightforward. The convenience of that outweighs the
benefits you'd get of using a larger, elaborate data structure.

~~~
pjscott
If you just need get-current-char and go-forward-n-chars operations, you can
get those cheaply by having heavyweight strings implement an iterable API.
(Yes, I know you also need the ability to iterate backward for Boyer-Moore,
but that's also fairly straightforward.)

This is not an argument against heavyweight strings, but an argument for
better programming languages.

~~~
andolanra
Well, the other advantage to using language-provided ropes or some variation
on it, especially in functional languages with the capability for parallel
execution, is that the rope structures (i.e. trees) are similar enough to
Steele's conc lists (c.f. Guy Steele's "Organizing Functional Code for
Parallel Execution") that searching in parallel could be achieved with little
to no extra work from the programmer, which when possible would be much better
than straight Aho-Corasick on a byte array.

~~~
pjscott
What if you're searching for substrings which span more than one chunk? What
you'd end up needing to do is make a monoid of Aho-Corasick automaton state
transition functions, and use a parallel prefix method. Steele gives examples
of this sort of thing in the talk you mentioned. He and Danny Hillis go into
more detail in their classic paper:

<http://portal.acm.org/citation.cfm?doid=7902.7903>

If you combine this with heavyweight strings, you can get incremental and/or
parallel string matching, but it's almost as easy to do on a byte array.

------
turbofail
Related paper by Boehm:

[http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/...](http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf)

If I recall correctly an implementation of "ropes" ships with the Boehm GC
sources.

------
postfuturist
Small nit: Ruby actually represents strings as mutable objects.

~~~
gchpaco
Which does cause occasionally hilarious but usually mind rendingly irritating
to ferret out bugs, from time to time. Common Lisp strings are the same way
(symbols in both languages being the immutable type).

