Hacker News new | comments | show | ask | jobs | submit login
How to represent strings, the heavy way (finger-tree.blogspot.com)
36 points by pjscott 2402 days ago | hide | past | web | 29 comments | favorite

It's funny, strings are one of the most often-used data structures, and yet people don't think about them as much as they think about more sexy things like hash tables. Strings are just part of the scenery. They deserve more glory. (Except for C strings, which need to die.)

Has anyone ever created a robust Unicode string API that treats graphemes as indivisible units? I keep seeing libraries that force you to directly handle codepoints, making it your problem to be aware of which characters are combining and have to be kept together (which almost nobody has done their homework on) so as not to corrupt the text.

Would Perl's Unicode::Normalize module fit your bill? I believe Unicode::Collate uses it so that a single codepoint accented 'e', for example, sorts the same as 'e' with a combining accent.

That's similar. What I'd really like to see is a string API that doesn't let you separate the letter and the combining accent at all, whether it has a precomposed normalization or not (unless you dig all the way down to asking which codepoints a grapheme is composed from).

> two equal strings will always be represented by the same pointer.

That's not right at all. Two equal strings /may/ be represented by the same pointer. Only if all strings are interned will this trick be useful and I don't know of any languages that intern all strings

Also, cache and branch prediction effects don't seem to be accounted for.

Okay, two interned strings will be represented by the same pointer. That doesn't complicate the comparison logic that much; the common case of interned-interned comparisons can be checked for first, and the branch predictor will get rid of most of the overhead of having an extra check in there.

This is similar to a trick for representing bignums on a 32-bit system. You can use the lowest bit of a 32-bit value as a tag: if it's zero, then the number is a 31-bit integer; if it's 1, then the number (with the low bit cleared) is a pointer to a heap-allocated bignum. The common case is that you're doing operations with numbers that fit in 31 bits, so the code for handling bignums is avoided by the branch predictor. (This works even better on a 64-bit system, where the common case is even more common.)

As for cache effects, I think the main issues there, for ropes, are tree depth and chunk size. You want the tree to be shallow (so as to minimize the number of potential cache misses before you reach the chunk you're looking for) and you want the chunks to be large enough that they get good cache locality, but small enough that you get fast insertion/deletion/concatenation/etc.

Read the sentence before that:

> This also makes it possible to store only a single copy of short strings, to improve memory efficiency, cache locality, and so on.

He's specifically talking about interning strings. He even links to the wikipedia article on string interning. I recall reading that Lua interns strings by default, and a quick google search seems to confirm my belief. However, with only one example, you're right: interning /all/ strings is unusual. Interning all "short" strings, his suggestion, is a more feasible heuristic, and then (if finding the length of a string is O(1)) comparing two short strings amounts to little more than a pointer comparison. I don't know whether there are any languages that do that.

This type of data structure is sometimes called a "rope". (get it?)

The list of textile-related terms which correspond to data structures includes string, rope, cord, yarn, weave, weft, thread, and even quipu. It's getting kind of ridiculous.

To be fair, a quipu has been a data structure, or rather a data storage medium, for at least 1200 years, and possibly several thousand years: http://www.charlesmann.org/articles/Science-khipu-decipher-0...

You forgot warp. Am on it is not ridiculous, because the first thing in the world you could program was a loom. We are all an of-shot of the textile industry.

I looked all of these up; in doing so, the only quipu I could find is on what I believe to be your own GitHub page. Are quipus (or, to use the Quechua plural, quipukuna) something you devised, or are you borrowing the name from elsewhere?

That would be my own contribution to the ridiculousness. See, I'm working on operational transformation based on this paper:


...and the author names every data type after something stringy. I wanted to continue the theme, but it took some serious Wikipedia digging before I could find something that was textile-related and not already taken. In retrospect, this was a bad choice, since I kept forgetting how to spell it. :-(

Most varieties of Quechua wouldn't spell it 'quipu' any more; that's the influence of Spanish orthography, where 'qu' meant 'k' and 'hu' meant 'w'. Most modern orthographies would spell it 'khipu', which might be easier to remember.

And for your future endeavors: twine, thrum, strand, and lisle are, as far as I know, unused.

Lolcode uses YARN. Then again, it's almost the same as BUKKIT OF NUMBAR... No wonder other languages don't use silly names.

Are you telling me that thread now means something in addition to an execution context?

If you need to access a utf-8 string indexed by character rather than byte you are probably doing something wrong.

You have an issue with fast text search?

If you aren't building an index ahead of time, you're writing a state machine, in which case you're better off treating it as bytes for efficiency and simplicity. (An array of 256 pointers vs a nearly unbounded dynamic map of pointers.)

If you are building an index ahead of time, just use byte offsets in your index instead of character offsets. You'll still be iterating through it from the beginning to build the index.

You can't treat its as bytes because you are than dependent on what the came before those bytes, which messes with your ability to skip ahead on a miss.

He's arguing that you don't need to skip ahead on a miss. If you're test some UTF-8 string for membership in some other UTF-8 text, then go ahead and compare the individual bytes even if they don't line up as characters.

If you use a fast text search algorithm like Aho-Corasick, you'll still get reliable and accurate fast text search, and while you will perform unnecessary checks against invalid UTF-8 sequences, you won't have to create extra data structures or use any other operations and the algorithm can be implemented in a way that's pretty straightforward. The convenience of that outweighs the benefits you'd get of using a larger, elaborate data structure.

If you just need get-current-char and go-forward-n-chars operations, you can get those cheaply by having heavyweight strings implement an iterable API. (Yes, I know you also need the ability to iterate backward for Boyer-Moore, but that's also fairly straightforward.)

This is not an argument against heavyweight strings, but an argument for better programming languages.

Well, the other advantage to using language-provided ropes or some variation on it, especially in functional languages with the capability for parallel execution, is that the rope structures (i.e. trees) are similar enough to Steele's conc lists (c.f. Guy Steele's "Organizing Functional Code for Parallel Execution") that searching in parallel could be achieved with little to no extra work from the programmer, which when possible would be much better than straight Aho-Corasick on a byte array.

What if you're searching for substrings which span more than one chunk? What you'd end up needing to do is make a monoid of Aho-Corasick automaton state transition functions, and use a parallel prefix method. Steele gives examples of this sort of thing in the talk you mentioned. He and Danny Hillis go into more detail in their classic paper:


If you combine this with heavyweight strings, you can get incremental and/or parallel string matching, but it's almost as easy to do on a byte array.

Additionally, the larger elaborate data structure will almost certainly be slower.

I'm not understanding. In which case? Aho corasick works just as well on a sequence of bytes as on a sequence of characters.

Related paper by Boehm:


If I recall correctly an implementation of "ropes" ships with the Boehm GC sources.

Small nit: Ruby actually represents strings as mutable objects.

Which does cause occasionally hilarious but usually mind rendingly irritating to ferret out bugs, from time to time. Common Lisp strings are the same way (symbols in both languages being the immutable type).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact