
Never create Ruby strings longer than 23 characters - ctaglia
http://patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters
======
nly
This is known as the "small string optimisation" in C++, so you can see a
similar implementation in Clangs libc++[1].

One interesting corollary is that _moving_ short strings in an implementation
that does this could actually be ever so slightly (negligibly) slower than
moving long ones (since byte copies are slower than word copies). But
generally, this is a free lunch optimisation and can save you hundreds of megs
of memory when writing programs dealing with millions of short strings.

[1] [http://llvm.org/svn/llvm-
project/libcxx/trunk/include/string](http://llvm.org/svn/llvm-
project/libcxx/trunk/include/string) \- search for "union"

------
Someone
[http://www.slideshare.net/nirusuma/what-lies-beneath-the-
bea...](http://www.slideshare.net/nirusuma/what-lies-beneath-the-beautiful-
code) (from march 2012) also discusses this.

Also (pedantic):

    
    
       #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))
    

sizeof(char) is always 1, so that division is superfluous.

~~~
BudVVeezer
sizeof(char) is implementation defined; see limits.h for more information on
the probable size of char for your target. If the sizeof(char) is 1, the
division will be optimized away, so there's no loss by keeping the code
portable.

~~~
Sharlin
No, the size of char (in bits) is implementation defined, but sizeof(char) is
defined to be 1, no matter what its size in bits.

------
danielweber
More like "ruby optimizes for short strings, and chose 23 at the cut-off point
for Reasons."

~~~
yapcguy
Can't wait for someone to write a new faster better string class which handles
strings of any length by internally chopping them into 23 character
portions....

~~~
fat0wl
lol yes....... how reasonable. ahah when i saw the title of the article all i
could think was "click comments to tune in for the most amusing flamewar this
week" but so many of these comments are like "so... its 23 characters.... why
not?!"

comeon peopleeeee, a _bit_ of an arbitrary internal standard, no?

i understand the point about "CONCLUSION: it doesn't matter for a few
strings!" but.... comeonnnn it must matter on some level, otherwise why is
Rails such a pain in the ass to optimize? these things must add up...

~~~
EpicEng
What non-arbitrary "small string" length would you chose?

Btw, the fact that you don't understand it doesn't make it arbitrary. Hint:
it's not arbitrary at all.

~~~
fat0wl
seriously... this?

#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

explain "3" in terms of your comp sci 101 divine computer law... Ok I get
itttt "len/ptr/capa/shared". I've never written an interpreter so I can't
rewrite the implementation myself, but I've worked with Rails enough to know
that Ruby's is terribly inefficient, and these are problems created in it's
abstraction from C, not inherent to C. Maybe this RString case is a minor
issue, but it seems to me to lack some rigorousness, to be a performance leak.
It's one of those things that makes you wonder "What Would Linus Do?".
Although I do understand the points about including it AT ALL to be "free
gain, little loss" and the language can't be C, but sometimes I don't know....

Just admit that maybe something is wrong here. Please. you'll be the first
Rubyist ever, the dawn of a golden era. I've been intrigued more and more by
the Lisp community because many of those devs seem to have a philosophical
unwillingness to compromise performance and proper engineering, whereas in
Ruby people get mad at you for not embracing the compromises!! Geez.

Hint: writing "Hint:" and then saying something really obvious that you
learned from the article we all just read makes you sound craaaaaazy obnoxious

(Incidentally, it's issues like this that make me thank my lucky stars I do my
web work on the JVM now.)

~~~
EpicEng
Haha, oh, I'm not a rubyist. Far from it; I'm a systems programmer. My ruby
experience comes from writing add-ons for a hobbiest video game engine. I
agree that ruby is often quite inefficient.

However, in this case, the number 23 comes from the amount of space that would
be needed to store the smallest string in heap memory after taking into
account malloc's bookkeeping info and whatnot.

As an aside, if you already knew the answer, why go on a rant that makes you
seem ignorant?

~~~
fat0wl
I mean I understand the logic of optimizing for cases when you are going to be
less than the lower-bound of data you can throw to the heap as you explain. I
guess my question (I am honestly asking this one) is why not use the struct
char array approach or some variant of it to optimize in the other direction
as well (using 256 as the limit, for example, since it is closely associated
with DB applications)? Is there a concern for wasted memory, or would it screw
up their implementation of "union" or something? (I will, btw, willfully admit
ignorance on some super in-depth C mechanics but I don't think it invalidates
my point.)

I've heard of cases where Ruby performance supposedly increased tremendously
because of raising some memory / garbage collection limits, and in today's
world I always wondered why they don't try harder to make the sacrifice on
memory footprint in order to gain speed. As a platform that is to this day
routinely blasted for being so damn slow (I'm sure a lot of this falls on
Rails' shoulders especially since ActiveRecord still was glacial when I
stopped following... ~1.3 seconds of Ruby processing to render simple pages
IIRC) why not start inserting some performance hacks or re-implementing with
performance as top priority?

The answer to your question is that I'm sociopathic on the internet.

~~~
vidarh
It would be a tremendous waste of memory, as a large percently of strings
_are_ small. It would also be a massive _slowdown_ for any situations where
you are copying a big string unchanged (in which case Ruby "just" marks the
string as "copy on write") or where you are copying a small string.

As for slowness, there are many, many obvious targets much higher up the list
than String. Slow method dispatch being the primary culprit.

~~~
fat0wl
I feel you..... I was more wondering if there was some better compromise point
between pre-allocated / dynamic but as you say, there are tradeoffs.

You're right, there must be a lot more low-hanging fruit out there. Something
interesting though... isn't the Ruby interpreter just a standard (RubySpec)?
JRuby (and Rubinius I think?) are other, better-performing implementations,
correct? I've been thinking about trying JRuby since I was looking into Grails
and thought "eh why not just try JRuby on Rails".

but i've kinda gone in the other direction anyway and am more interested in
Clojure > Ruby, especially since Clojure "frameworks" are just dead simple
libs. It's weird because RoR was around first (I think, maybe some overlap in
development...) but it seems that now that Clojure and Scala exist, RoR is
kindof like a bad compromise between the two.

Though Clojure is kindof minimal and not everyone is singing Scala's
praises...

------
ben0x539
There's some discussion at
[https://news.ycombinator.com/item?id=3425164](https://news.ycombinator.com/item?id=3425164)
, including some interesting technical/benchmarky comments.

------
ra88it
Title: "Never create Ruby strings longer than 23 characters"

Conclusion: "Don’t worry! I don’t think you should refactor all your code to
be sure you have strings of length 23 or less."

------
spoiler
This is MRI (C Ruby) behaviour and _not_ Ruby - specific , though. However,
this is still interesting information.

------
anon4
Wouldn't it be better to use this declaration though:

    
    
        struct RString {
    
          struct RBasic basic;
    
          union {
            struct {
              long len;
              char *ptr;
              union {
                long capa;
                VALUE shared;
              } aux;
            } heap;
        
            char ary[];
          } as;
        };
    
        /* apologies if I messed up the syntax here */
        #define RSTRING_EMBED_LEN_MAX (sizeof(((RString*)(0))->as) - 1)
    

Then you can even use the padding the compiler added, if any, plus you can add
more things to heap and the embed length will grow automatically.

------
markburns
For anyone interested, he points to an older translation of the Ruby Hacking
Guide, there is a pretty much complete translation at

[http://ruby-hacking-guide.github.com](http://ruby-hacking-guide.github.com)

~~~
alecdbrooks
Thanks for the link! I'm not interested in Ruby per se, but it's fascinating
nonetheless from the perspective of data structures and how they are
implemented in C.

On a related note, I've found a much less comprehensive (but still useful)
guide to Python internals: [http://tech.blog.aknin.name/category/my-
projects/pythons-inn...](http://tech.blog.aknin.name/category/my-
projects/pythons-innards/).

------
gaius
I suppose the thing to do is analyse your app for the average string length,
and just recompile your Ruby with that. Would be even better of it was a
command line parameter.

~~~
throwaway0094
This isn't quite right. Even if your average string length is 1k+, you
shouldn't change the embedded string size to 1k+. I think these objects sit on
the C stack internally, which doesn't handle large objects like this well.

Also, I would guess the performance gains (from skipping malloc) would wash
out the longer your average string gets -- even if the huge stack use doesn't
kill your performance for some other reason (blowing the d-cache?).

~~~
ben0x539
I don't think these strings ever sit on the C stack, except maybe if some C
code/extension is being really clever. The standard representation for
variables is a tagged pointer as far as I know, so I would assume that is all
that goes on the stack. This optimization probably just saves another level of
indirection.

~~~
throwaway0094
Hm, ok. Still, at some level fragmentation is eating gobs of space even if
correct for the "average".

------
pedrocr
Why does "str2 = str" actually allocate a new RString instead of just pointing
both str and str2 to the same RString?

~~~
alecdbrooks
That's what it is doing. The additional RString structure associates the label
"str2" with the characters (on the heap) allocated for the original string.

Ruby experts can correct me if I'm wrong, but when Ruby sees a name like
"str2" it looks it up in a table, which points it to the RString structure.
From there, it can follow the pointer to the actual array of characters, which
in this case is only stored once.

~~~
pedrocr
According to the article both str and str2 will point to the same char[] on
the heap, but they are represented by two different RString objects. As you
said when you want to access str and str2 you need to look them up in a table.
So why not have both entries on the table point to the same RString, instead
of pointing to two different RString's that point to the same char[]?

~~~
alecdbrooks
I'm sorry, I misunderstood your question. Apparently what you describe is
actually the case, as the link posted by danieldk below describes.

------
grosbisou
Extremely interesting. But I cannot quite understand why RSTRING_EMBED_LEN_MAX
is calculated that way.

VALUE seems to be unsigned int defined via "typedef uintptr_t VALUE;" and
"typedef unsigned __int64 uintptr_t;"

But why is it calculated like that I don't get. Anyone can explain?

~~~
Sharlin
The small string buffer should be the same size as the "heap" struct so as not
to waste memory -- remember, they shared the memory as they're members of a
union. The heap struct contains three members which, taking into accoult
alignment restrictions, usually add up to three times the machine word size
(which is basically what sizeof(uintptr_t) is). The "-1" is because C strings
are null-terminated, so the maximum length is one less than the size of the
buffer.

What I don't know is why they don't simply use sizeof(heap) as the buffer
size.

~~~
grosbisou
Ah that was obvious. Thanks, very clear answer.

------
gesman
I wonder why they didn't make cut-off optimization points at 33?

When programmers don't know in advance how long name/email/input/whatever
field is going to be - they just use the magic "power of two" length :)

So 32 (or 33) in this case would be more reasonable.

~~~
gliese1337
Because of this line:

    
    
        #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))
    

23 wasn't _chosen_ , it was _calculated_ to be the size that would be required
for a struct describing a heap string, and will actually be a different number
for different architectures. Choosing to make it bigger would add unnecessary
overhead to the RString struct.

------
badman_ting
Reminds me of this Mr Show sketch :)
[https://www.youtube.com/watch?v=RkP_OGDCLY0](https://www.youtube.com/watch?v=RkP_OGDCLY0)

------
throwaway0094
Is Ruby's internal encoding UTF-8, then?

~~~
sluukkonen
Each String in Ruby has their own encoding. But by default, it is UTF-8 these
days.

------
jokoon
"never use ruby" works well for me

~~~
ctrager
The designer of a string class in any language. C++ and Java - has to deal
with the same issue - that heap allocations are slower than stack allocations.
But to do a stack allocation means reserving some fixed length memory which is
a waste if you have a lot of small strings. It's a tradeoff. The Ruby approach
is reasonable. I think in Microsoft's C++ STL library, the limit is 16 rather
than 32. Even with the low-level closer-to-the-metal power of C/C++, the
string class designer still has to make a decision about the tradeoff.

~~~
jokoon
Strings are overrated, they should never be used until you really need them.

------
drakaal
Who needs more than 23?

~~~
drakaal
This comment is also 23

------
corresation
This all sounds rather terrible for Ruby, doesn't it? It isn't so much that
the short string is faster (though I'm left unclear whether it itself is on
the stack/heap, though given the GC nature of Ruby and practical
considerations of the language, it must be the heap), but rather that the cost
of the short string is _also_ added to the long string in the heap (assumed)
allocation of the RString (which becomes larger and thus more difficult to
malloc).

If this is intended to sit on the stack, which I find highly unlikely
(especially given the timings that seem to be the delta between one malloc and
two, and would be much more significant if it were a stack allocation versus a
heap allocation. This is not comparable to small string optimizations for the
stack in C++), maybe. But otherwise it seems like a poorly considered hack.

The string type could as easily have been dynamically allocated based upon the
length of the string, where the ptr by default points inside that same
allocated block. If the string is expanded it can then be realloced and the
string alloced somewhere else. No waste, a single allocation, etc.

~~~
gliese1337

        allocation of the RString (which becomes larger and thus more difficult to malloc), and the 23 string bytes that will sit unused for longer strings.
    

I got the distinct impression (backed up by an actual code snippet defining
the max embedded string size) that the 23 byte limit was calculated to exactly
match the size of the data that would otherwise have to be stored for a heap
string anyway. Thus, it doesn't actually take any extra space in the struct,
and those 23 bytes do not go unused in other strings.

~~~
corresation
You're exactly right on the union: I tried to edit out my error on that before
someone noticed (my principal point is about the mallocs), but your comment
appeared right as I saved. Shame be upon me.

So this holds (on a 64-bit machine) 8-bytes for the pointer, 8-bytes for the
length (string not null terminated), and 8-bytes for the capacity.
Alternately, via a union, it stores 24 bytes of string (null terminated). It
knows whether it is a or b via a separate flag that it holds separately in
RBasic.

I retract my jab about memory loss, but it still sounds rather terrible. Every
bit of code dealing with strings needs to validate flags on every use to
determine what it is dealing with, alternate between length specified or null
terminated, etc. Ugh.

~~~
vidarh
Look up a C++ string implementation sometime, and you'll find that this is
almost exactly how most efficient C++ string implementations do this too.

And it doesn't need to alternate between length specified or null terminated -
they're all length specified (or how do you think short Ruby strings are also
able to store ASCII NUL)

