It works brilliantly for mostly string concatenation workloads, like rendering web page templates.
Gets tricky when you decide to flatten it if people are repeatedly mutating single characters or indexing.
Cool thing is that it can provide a char* interface to a rope data structure, in order to support C extensions not expecting ropes.
What surprised me is that Ruby template engines make heavy use of string concatenation? Like 'a' + 'b' ?
The generated Ruby fragment from ERB makes heavy use ofstring concatenation, which is generally favorable to ropes. Theright-most graph of Figure 5 shows that TruffleRuby ropes are 9.4× faster than MRI, rendering around 870 000 templates per second.
I would have thought the idiom for writing a template engine is to use something like Java's StringBuffer, or the Python pattern of appending immutable strings to a list, and joining them at the end. So you avoid a ton of intermediate strings that way. (And I have written a template engine myself in both Python and JS, and that is the technique I used, and it's plenty fast)
Would that technique speed up ERB, or is there something I'm missing? Not to say it's not worthwhile to still do ropes, but it seems like low hanging fruit.
Though what turned me off of ropes is that for many workloads, most strings are small. And small strings are often smaller than records of pointers (and they have more locality). If I have a pointer to a string, a start offset, and an end offset, on machines that's 8 + 4 + 4 = 16 bytes, which is already approaching the size of an HTML tag or two. And a GC has to chase that pointer (in my situation), whereas a GC can stop if it's a blob of bytes.
Not to say it's not a useful technique of course ... I think it does make sense as an optimization when you have existing workloads to measure, but I was hesistant to do it starting from scratch.
somewhat related: Elixir RAM and the Template of Doom https://www.evanmiller.org/elixir-ram-and-the-template-of-do... (a syscall optimization that goes well with ropes)
But doesn't appending to a string buffer still mean copying a lot of bytes as you build up the final string?
In ropes you don't need to copy anything, you just create a little node of a couple of words to remember that two strings were joined. Yeah there's non-locality and things like that, but most people are building up strings not reading them. Then you write out the final string - and we have system calls like writev that work with ropes.
Finally, remember that if your string is created and used within a single compilation unit remember that rope nodes with disappear with escape-analysis and scalar-replacement-of-aggregates. Helps operations on strings fold and become constant.
Look at the article and see for yourself. They do the string-builder measurement and show results. It's in figure 5.
The "GEOMETRIC" approach to string building from small fragments is slightly slightly faster than the rope approach. That's where characters are copied to a contiguous buffer, and the buffer is reallocated by doubling in size whenever it is filled.
The slowest by far is the "UNBUFFERED" [cord] version, which creates a new cord node for each append, instead of copying the characters.
"BUFFERED" [cord] is quite good. It's still a bit slower than "GEOMETRIC", but good enough.
In a nutshell, the paper shows that copying bytes when the number of bytes at a time is small is essential to all fast methods of string building.
That's because creating nodes and changing pointers is slower than copying contiguous bytes, up to some number of bytes. That number is probably larger now than it was when the paper was written due to modern memory architecture.
It claims that they make systems "more robust". I think they basically mean that certain apps that use long strings are less likely to run out of memory.
I guess my response is that I don't think Python or JS really need native ropes, even for string heavy workloads like templating: https://news.ycombinator.com/item?id=24960806
(Also, I would expect penalties of pointer chasing and thus ropes to be much more severe in 2020 vs. 1995, like maybe 10x more?)
We claim that the traditional implementations ofstrings, and often the supported functionality, are not well suited to such general-purpose use.They should be confined to applications with specific, and unusual, performance requirements.We present ‘ropes’ or ‘heavyweight’ strings as an alternative that, in our experience leads to systems that are more robust, both in functionality and in performance.
It is not clear whether the implementation we suggested above is in any senseoptimal. However, it performs sufficiently well for our purposes, and sufficientlywell that for most purposes it is preferred over traditional ‘flat’ string representations.Our experience has been that it results in software that is noticeably more robust,particularly in that it scales correctly to unexpectedly large inputs.
edit: I guess they are making a computational complexity claim then? But again I would say it's a common idiom in Python and JS to append to a list and join(). At least now it is, maybe in 1995 programmers weren't used to that idiom.
I guess what I'm saying is that if you "properly" write a template engine, there should be no advantage to ropes at all.
Rather than a tree of non-copied fragments in the language runtime (ropes), you can simply create an array of non-copied fragments in the language itself.
Following the template engine I wrote in Python and JS:
- Fragments come from two places: holes and consts. holes come from variables, and consts are figured out at template compile time (e.g. which happens when the web server starts, and shouldn't be benchmarked).
- Fragments are never copied. They are simply appended to a big array of fragments. The template engine is a recursive tree walk interpreter that always appends to an array.
This is even true for a feature like #include. Some template engines may do the "right" thing for local fragments, but the copying may sneak back in for #include. But if you care, it's easy to handle #include in a consistent way.
So if you use that "obvious" architecture, then I claim there should be no advantage to ropes. But I guess that's not what happens in practice!
(Well, seeing some sibling comments about copying being faster, at least the claim is that this architecture should be as fast or as slow as ropes! :) But yeah I think ERB is probably doing something different and even worse, which is not surprising.)
If you're building up your string as an array of strings, and you want to pass that to some existing code that expects a full string (like the web framework you're using) then you're going to need to flatten it to create the final full string.
Ropes don't need to do that. They still have an advantage.
Also... the implementation doing things for you so you don't have to manually optimise your code is kind of the point of an optimisation. We can always say 'but I could do this manually' to any optimisation, couldn't we?
- This wasn't an optimization, it was literally the first thing I thought of when I wrote my first and only template engine 10 years ago. And I have seen plenty of others do it; it is "obvious".
The code is the exact same length as the string version, so I don't consider it an optimization.
- Python's WSGI accepts an iterable of strings for the body. From a cursory reading, looks like Ruby's rack does too:
e.g. it's ["Hello, World"] and not "Hello, World" (I don't know Ruby that well though)
Maybe frameworks mess it up in the middle...
Depends on what you mean by "a lot", doesn't it? Yes, you are going to copy all the bytes in the final result once (at least). But it's hard to see how to avoid that, and no, delegating that task to the kernel doesn't magically/consistently make it faster.
> you just create a little node of a couple of words
Hmm..."just" creating a "little" node can often be quite expensive on today's machines. And "words" are pretty big, typically 8 bytes per word.
> Yeah there's non-locality and things like that
Well, non-locality and "things like that" are kind of a biggie, these days. The biggie, actually, most of the time.
"...computation is essentially free, because it happens “in the cracks” between data fetch and data store; on a clocked processor, there is little difference in energy consumption between performing an arithmetic operation and performing a no-op." -- Andrew Black in Object-oriented programming: some history, and challenges for the next fifty years
Particularly, you 'actually...' me about how fetch and store dominates... but we're not storing anything into existing memory! To concatenate strings we only create new nodes and refer to the old nodes. You don't even need to read what's in the existing nodes! The essential fast-path of building up a template is write-only with bump-allocation into the extremely well-managed TLAB. No reading required. That's why it works so well.
Yes traversing the tree means writing it out is a little slower, but that's one small part at the end, with good pre-fetching opportunities.
Now CPU's are in the GHZ ranges, and memory is a huge bottleneck. This is why a significant portion of the CPU is the integrated memory controller and various levels of cache.
With the current paradigm, cache friendliness will often trump something that has fewer instructions. With regular strings, scanning or copying a string is in order memory access which will take advantage of pre-fetching of the cache. With ropes, there will be pointer chasing which will cause all sorts of cache misses.
So, I think ropes are likely an even more niche use case than they were in 1995.
Yes it was. 1995 was not 1965.
> This is why a significant portion of the CPU is the integrated memory controller and various levels of cache.
Microprocessor architectures available in 1995 had L1 and L2 caches. L2 was available for Pentiums, so it basically trickled to the consumer market.
These caches were small: misses were more likely, so optimizing for locality was super important.
I remember hearing that same talk in the 1990's: you don't want linked lists: think of the pointer chasing and cache misses, and the storage overhead for the pointers that wastes your precious little cache space (8K in an 80486).
> With regular strings, scanning or copying a string is in order memory access which will take advantage of pre-fetching of the cache.
The point of ropes is to handle very long strings. Scanning a long string, many kilobytes or even megabytes long, is not a simple memory access, though. If you have to copy the entire tail of the flat string to insert a character, but a rope does not, the rope can win in that case. See the motivating statements in the paper: "Common string operations should scale to long strings. There should be nopractical bound on the length of strings. Performance should remain acceptable for long strings."
There are various trade-off levels possible in ropes depending on the chunk sizes: a rope can have a small number of large pieces or a large number of small ones.
I suspect why ropes are perhaps not popular is that people don't care that much about the utmost performance in string processing. A performance hiccup in some string processing isn't going to make video skip, a game framerate lag, or a CNC cutter spoil a piece. A lot of text processing is batch processing, or else non-real-time interactive (like in a text editor). Moreover, the cases where ropes are useful (huge strings) are a niche within text processing. People often don't have long strings so they don't feel the sting of bad performance.
in a 2020 machine with a normal workload I have yet to see a list outperform a contiguous array outside of pathological cases
> The point of ropes is to handle very long strings. Scanning a long string, many kilobytes or even megabytes long
I mean, my 2016 CPU has 20 megabytes of L3 cache, you can fit a fair amount of text in that, likely more than 99.9% of average text editor usage
Even if we completely eliminate all caching effects, so that every memory access is uniform, regardless of past access history, then arrays are still going to be faster in many cases than linked lists. Obviously, in the random access case: fetching the n-th element. Pointers are metadata; a linked list has metadata for every node, and that requires memory bandwidth to suck through the processor. If every list node has a word of pointer for a word of data, then it means scanning through a million word list means we are loading two million words. If the data is aggregated into an array, then scanning it means we are loading only one million words. That's half the cycles. Likely fewer instructions executed in the loop and so on.
Note that arguments like, "but in year 20XX, such a case now fits into my Ln cache" work toward arguing that caching is less relevant. The more cache you have, the less relevant it is. As a cache gets bigger, more and more situations that previously exhibited "poor locality" now have "good locality" of reference.
Your 2016 machine can cache much more of a given linked list than your 1995 machine; it performs relatively better with linked lists than the 1995 machine, even if we account for basic differences of instruction issue rate (pipelining, branch prediction, etc).
I guess an size(array)+1 is a pathological case. Shame it seems to happen in every code base I've looked at.
Like, every codebase I've worked on had some array sorting at some point in it... but the frequency of needing to sort something vs having to loop over all the X makes it a completely not-relevant thing.
If the only time you add an element to an array is as a result of the user clicking somewhere, then I'd say it's definitely not worth taking into account if your array isn't 5000000 elements (which it will likely won't be if you are giving the user the ability to just add something to it).
You should probably only use data structures with a lot of indirection if you really can measure it to be faster in the situation.
Aren't these all arguments in favour of ropes?
Ropes are write-only when being constructed. They're not just cache friendly, they're cache oblivious!
Do read the caveats in the header comments before using it in lieu of std::string.
This would give us structures that work well in multi-processor environments, due to the immutability and lock-free nature, handle Unicode in a sane and safe manner that saves a lot of headaches, and does small common operations in a fast manner saving many pointer operations.
They're also good for forming larger strings by concatenating smaller ones. But if you can structure your code so that it just appends to a buffer, for example, by threading a mutable string buffer reference through function calls rather than returning strings, that's also faster.
So, as with most such tradeoffs, it depends on what you're trying to do.
Another reason might be compacting GC. When you compact the heap, you need to rewrite pointers. So you need less objects overall and less pointers. When your string is just one array, it's much more friendly than having bunch of substrings with pointers.
If you're dealing with megabyte strings, you need to think about algorithms, ropes might be the right choice.
Nearly three decades of utility since, quite the occasion!
Certainly, we could all have been wrong this entire time. So often in programming I find that my working, perfectly useful software, unbeknownst to me, was wrong all along! Imagine my face on the day!
(I am being facetious, it is of course perfectly valid to challenge the status quo)
Otherwise, why do we say 'trees' when we could use the more general term 'graphs' if that's your argument?