
Never create Ruby strings longer than 23 characters - there
http://patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters
======
minimax
He's nearly there, but the reason for the number 23 is staring you right in
the face. It's not arbitrary. A heap allocated string requires 8 bytes for the
length, 8 bytes for the pointer to the string, and 8 bytes for the capacity /
reference count. The sum is 24 bytes.

So you either use it as a length/pointer/capacity/refcount struct, or save the
malloc and use the 24 bytes directly for 23 chars and a NULL terminator.

~~~
charliesome
_use the 24 bytes directly for 23 chars and a NULL terminator._

Ruby strings aren't null terminated though

EDIT: So I went and looked at the code, they _are_ null terminated, but as far
as I can tell, Ruby doesn't rely on this directly.

~~~
iclelland
Pardon?

The code is right there in the article -- in this case, for this specific kind
of string, at the C level, they _very clearly_ are null-terminated strings.

No, you don't get to see that at the Ruby level, it's all nicely abstracted
away, but that is exactly what is happening inside the VM.

~~~
ComputerGuru
Not really. RString operates in the same way std::string does - it has a
character array and it has a member variable denoting the length.

It's not null-terminated. You can store a sequence of nulls and that will not
affect the result of std::string.size()

In C, you'll be forgiven for thinking it was null terminated, because
attempting to assign a std::string a value from a null-containing array of
characters would terminate the copy upon reaching the null, but that's only
because the original char array is null-terminated when read as a C string.

However, you can manually construct a std::string with \0 sequences in the
middle and that will _not_ terminate the string, nor affect the separate
length calculation. The same applies for Ruby's RString.

So that was the reason they're _not_ null-terminated. Now the reason why they
technically are (i.e. the reason NULLs are stored at the end of the string) is
for compatibility and optimization. At the cost of one byte per string (for
the trailing \0), we get instant compatibility with non-RStrng/std::string
functions. If a function needs a C string, we can just pass the internal
pointer to the character array - no need to copy the string to a temporary
buffer and append a null.

Therefore, while null-termination is absolutely NOT required when dealing with
an exclusively counted-length implementation of C strings (a la RString,
CString, std::string, etc.) if you can just pass the pointer and the length
separately, it would be a ridiculously foolish optimization for a general
string implementation to NOT have the option of directly exposing the
underlying null-terminated string to any functions that need it, with the
caveat that null-containing counted strings will obviously terminate sooner
than expected.

~~~
mbell
You seem to confusing C and C++, there is no std::string in C and MRI is
written in C.

A 'C String' is by quasi definition, a segment of memory that can be properly
processed by the string functions in the C standard library, which requires
null termination.

>Now the reason why they technically are (i.e. the reason NULLs are stored at
the end of the string) is for compatibility and optimization.

Really its just because they are C Strings, that is they use the C standard
library string functions, if you want to use them, you must null terminate.

>Therefore, while null-termination is absolutely NOT required when dealing
with an exclusively counted-length implementation of C strings (a la RString,
CString, std::string, etc.)

None of those are implementations of "C Strings", they aren't even available
for C.

The determination as to whether your using null terminated strings or not
comes down to the String library your using. If your on C, your probably using
C std lib and need to null terminate your 'strings'. There really isn't much
more to it than that.

~~~
ComputerGuru
No, I'm perfectly well-versed in the differences between C and C++, having
written in one or the other for a long time. A trivial look-alike
implementation of std::string can be written in C, and would look a lot like
the RString class.

Your argument is actually, essentially mine. The need to use the platforms'
string functions heavily swings (but does not force) the choice of null-
terminating the RString members. As I mentioned, it would be really stupid but
_entirely possible_ to simply clone the non-null-terminated string into a
temporary null-terminated char array every time you want to use a function
that takes standard "C strings" if you really, truly, madly wanted to have an
RString implementation that was one byte smaller to store. But that would be
insane.

------
ComputerGuru
I read the title and thought that I was going to see some really stupid design
decisions. To the contrary, it's vary clean and smart.

It's not that strings over 24 are slow, it's that strings below length 24 have
extra optimizations.

It's a great article, but I really, really despise the linkbait title.

~~~
nonrecursive
I work with Pat and kicked around some title ideas with him yesterday. It's
interesting to see a title that he created out of a sense of fun and
excitement being perceived as linkbait. I can see why that perception would
arise, but I hope people think of him as a good writer trying to have fun
rather than some dude writing linkbait titles trying to get attention.

Some other title ideas:

* 23 characters ought to be enough for anybody

* 23: How Twitter could have solved their ruby scaling problem

~~~
ComputerGuru
Heh. I would have gone with "Ruby Optimizations: Why short strings are an
order of magnitude faster" or something.

------
dpeck
Terrible title, but the content is quite good. Ruby programmers, at least
here, should have enough foundations to be able to understand these "deep"
dives into the interpreters, and the more you understand the hows and whys of
the tools you build on the better your end product will eventually be

~~~
endgame
Language enthusiasts in general, not just Ruby programmers will enjoy this
article, I think. In any case, I got to see a cool little optimisation that I
hadn't thought of before.

~~~
pat_shaughnessy
Thanks a lot for the nice comments, guys! Sorry about the "link bait" - I
really was just so surprised that the limit was 23, such a strange number,
that I just had to put it in the title of the post. I didn't expect it to end
up on HN.

------
parfe
Unicode of course takes up more space and fills up your buffer sooner. Looks
like the jump happens after 8 chars.

    
    
      Benchmark.bm do |bench|
        run("と", bench)
        run("がと", bench)
        run("りがと", bench)
        run("ありがと", bench)
        run("ありがとあ", bench)
        run("ありがとあり", bench)
        run("ありがとありが", bench)
        run("ありがとありがと", bench)
        run("ありがとありがとあ", bench)
        run("ありがとありがとあり", bench)
        run("ありがとありがとありと", bench)
        run("ありがとありがとありがと", bench)
      end
    
                   user     system      total        real
      2  chars  0.210000   0.000000   0.210000 (  0.212420)
      3  chars  0.200000   0.000000   0.200000 (  0.199957)
      4  chars  0.200000   0.000000   0.200000 (  0.199356)
      5  chars  0.200000   0.000000   0.200000 (  0.199142)
      6  chars  0.200000   0.000000   0.200000 (  0.198047)
      7  chars  0.190000   0.000000   0.190000 (  0.198984)
      8  chars  0.190000   0.000000   0.190000 (  0.196917)
      9  chars  0.250000   0.000000   0.250000 (  0.245808)
      10 chars  0.240000   0.000000   0.240000 (  0.247153)
      11 chars  0.250000   0.000000   0.250000 (  0.248083)
      12 chars  0.250000   0.000000   0.250000 (  0.247753)
      13 chars  0.240000   0.000000   0.240000 (  0.250674)
    

Grabbed those unicode chars from [http://blog.trydionel.com/2010/03/23/some-
unicode-tips-for-r...](http://blog.trydionel.com/2010/03/23/some-unicode-tips-
for-ruby/) no clue what that says.

~~~
aaronblohowiak
I'm going to split some hairs, because it matters for the topic at hand.

>Unicode of course takes up more space and fills up your buffer sooner. Looks
like the jump happens after 8 chars.

It sounds like you are conflating unicode with UTF-8. There is more than one
way to represent the unicode code points, and UTF-8 is one of them. Further,
it seems like you assume that "unicode characters" have a constant size. This
is a potentially dangerous misunderstanding of how UTF-8 works. UTF-8 code
points have a variable number of bytes (from one to four bytes, IIRC.) You
happen to have copied some code points that take 3 bytes each.

The UTF-8 encoding scheme is a great compromise, and the wikipedia article is
easy to follow: <http://en.wikipedia.org/wiki/UTF-8>

~~~
NinetyNine
I also used to believe Unicode and UTF-8 were different types of encoding
until someone corrected me. I just remembered why I had thought such a thing
in the first place:

[http://msdn.microsoft.com/en-
us/library/system.text.encoding...](http://msdn.microsoft.com/en-
us/library/system.text.encoding\(v=vs.71\).aspx)

~~~
waterside81
You and probably everyone else the first time they encounter unicode / UTF-8.
I wonder if it's because both terms start with 'U'.

------
tptacek
Interesting deep dive; but remember that calling :+ to append to strings is a
pessimization (in both Ruby and Python); :join'ing a list of 2 strings is
about as fast as :+'ing them together, but :join'ing 3 strings is about twice
as fast.

~~~
teaspoon
I was curious, so I benchmarked this with Ruby MRI. _join('')_ beat
_reduce(:+)_ even on two strings, but is twice as fast only on four or more
strings.

    
    
       $ ruby -v
       ruby 1.8.7 (2010-01-10 patchlevel 249) [universal-darwin11.0]
       $ ruby benchmark.rb
             user     system      total        real
       add  2  0.390000   0.000000   0.390000 (  0.385910)
       join 2  0.300000   0.000000   0.300000 (  0.305812)
    
       add  3  0.530000   0.000000   0.530000 (  0.527388)
       join 3  0.320000   0.010000   0.330000 (  0.329390)
    
       add  4  0.720000   0.000000   0.720000 (  0.720500)
       join 4  0.350000   0.000000   0.350000 (  0.353237)
    

Code: <https://gist.github.com/1562617>

~~~
bodhi
In case you were wondering why, it is most likely* because `#join` is
appending to a single string, whereas the `#reduce` call is creating
intermediate strings for each step of the fold.

* I'd have to check the implementation of `#join` to be sure.

~~~
teaspoon
I thought so too, but appending to a single string does even worse than
reduce(:+):

    
    
      add 2  0.390000   0.000000   0.390000 (  0.390529)
      +=  2  0.540000   0.000000   0.540000 (  0.537450)
    
      add 3  0.530000   0.000000   0.530000 (  0.534131)
      +=  3  0.760000   0.000000   0.760000 (  0.752280)
    
      add 4  0.660000   0.000000   0.660000 (  0.668154)
      +=  4  0.960000   0.000000   0.960000 (  0.954727)
    

Code: <https://gist.github.com/1562994>

~~~
subwindow
+= is not append

    
    
      str += append_str
    

is the equivalent to:

    
    
      str.dup << append_str
    

That is, it copies the string and then appends to the copy. Benchmarking the
speed of a true append is difficult because in order to preserve the original
string in the benchmark you must dup it anyhow (done here outside of the
bench.report block). However, the bigger the strings get, the more pronounced
the advantage of appending is.

    
    
                    user     system      total        real
      add  2      0.260000   0.000000   0.260000 (  0.263414)
      join 2      0.320000   0.000000   0.320000 (  0.325341)
      append 2    0.230000   0.010000   0.240000 (  0.235669)
    
      add  3      0.500000   0.000000   0.500000 (  0.497219)
      join 3      0.840000   0.020000   0.860000 (  0.866401)
      append 3    0.260000   0.010000   0.270000 (  0.268676)
    
      add  4      0.360000   0.040000   0.400000 (  0.397778)
      join 4      0.970000   0.030000   1.000000 (  0.997321)
      append 4    0.280000   0.000000   0.280000 (  0.281020)
    
      add  5      0.460000   0.000000   0.460000 (  0.454780)
      join 5      0.910000   0.030000   0.940000 (  0.946600)
      append 5    0.330000   0.000000   0.330000 (  0.336214)
    

Code: <https://gist.github.com/1563290>

------
adriand
This was interesting and caused me to go off on a neat little tangent too.

I was curious about the VALUE declaration in:

    
    
        struct RString {
          long len;
          char *ptr;
          VALUE shared;
        };
    

From the Hacker Guide referenced, I found this definition:

    
    
        typedef unsigned long VALUE;
    

This is then casted, when needed, to a pointer to whatever type of struct you
are dealing with. How does that work?

Well, on a 32-bit machine, if I'm correct in my reading, an unsigned long is 4
bytes in size and can contain these numbers: 0 to 4294967295

That last number is 4 gigabytes, which is the size of byte-addressable memory
on a 32-bit machine. So VALUE can point anywhere. Neat!

~~~
charliesome
Fun fact: a VALUE is not only a pointer. Ruby takes advantage of the fact that
pointers are aligned to a certain boundary (so a few of the least significant
bits are always zero) and stores flags in the least significant bits.

For example, if the LSB is 1, then the VALUE isn't a pointer - it's a Fixnum.
This is why the maximum size of a Ruby Fixnum is one bit less than the pointer
size on the machine.

~~~
derleth
This is called 'type tagging' or just 'tagging'; needless to say, Lisp and
Smalltalk and other interpreted language implementations have been doing it
for decades now, and at one time there was direct hardware support for it in
some architectures.

~~~
charliesome
Right! The 68k supported type tagging by having 24 bit addresses on a 32 bit
machine. I'm not sure if allowing tagging was intentional when Motorola
designed the 68k, but Mac OS did store flags in the upper bits of a pointer at
some stage.

~~~
vidarh
It was a quirk due to the number of address lines on the low end models, and
it was a massive code smell, as 68020 and up could use 32 bit addresses so any
code that did this would fail on machines with the faster CPUs.

Amiga Basic for example, wouldn't run on 68020 up because Microsoft used the
upper 8 bits (amongst a whole slew of other horrible bugs and performance
problems - it's probably the worst Microsoft product ever in terms of code
quality)

------
jbooth
This article should really have the word "stack" in it someplace.

~~~
skatenerd
Not knowing much C, I was confused about why malloc() wouldn't get called for
the RString structure. This is elucidating:

<http://www.cs.usfca.edu/~wolber/SoftwareDev/C/CStructs.htm>

Particularly this part: "// automatic allocation, all fields placed on stack"

~~~
metageek
Even if you're not putting the RString struct on the stack, the embedded
string optimization means calling malloc() just once instead of twice.

~~~
ori_b
You can allocate the string with one malloc:

    
    
        RString *s = malloc(sizeof(RString) + length); /* allocate 'length' bytes extra memory past the end of 's' */
        s->data = s + 1; /* to the extra memory past the start of the string struct */

~~~
teaspoon
Ruby strings are mutable, so you need to be able to free _s- >data_ without
freeing the RString itself.

~~~
ori_b

        s = realloc(s, sizeof(RString));

~~~
charliesome
MRI objects are not relocatable so that won't work if realloc has to move the
structure in memory

------
jeffremer
Link bait titles irk me. Nevertheless nice post on some of the MRI internals.

------
anrope
Cool dip into Ruby internals.

If you roll your own ruby, instead of redoing all your strings, you could just
change RSTRING_EMBED_LEN_MAX. This would cause more wasted memory if you have
a lot of short strings (0 < len << RSTRING_EMBED_LEN_MAX), and probably isn't
worth it since there isn't much performance improvement.

The most confusing part of this article was the actual RString struct
implementation. Are the anonymous unions and structs used to control structure
padding and alignment?

~~~
barrkel
Values in interpreters written in C are frequently implemented as (manually)
discriminated unions - i.e. unions that share a field at the start to indicate
the type and contents of the remainder - because that's a handy way of
implementing the polymorphism required for a straightforward interpreter. It's
pretty much necessary to use structs inside unions in order to have a more
than one field per layout; the struct is just grouping, so it doesn't need a
type name.

So without looking at any of MRI source, I'd be willing to guess that most, if
not all, of its structures representing Ruby values start with a field of type
RBasic, and that type contains information necessary to distinguish and
interpret the remainder of the value.

~~~
ben0x539
Yeah. I just looked at the source because I was confused about how it knew how
long the embedded string was (since the length field is in the other half of
the union and ruby strings can have embedded \0 bytes), and RBasic is a struct
containing a VALUE referring to a class and a VALUE "flags" that tends to have
a lot of bit fiddling done to it.

Apparently out of the non-reserved bits, one is used to tell whether the
string is embedded or not and five more are combined to give the string's
length. Makes sense!

------
extension
Why would str2=str create a new string?

How are RStrings modified when they are referenced by other shared RStrings?

Is there any way to create a shared RString other than calling .dup?

~~~
pat_shaughnessy
Great questions! To be honest I don't have enough knowledge of the MRI
internals yet to be able to answer these 100% correctly. Maybe I'll write a
follow up post or an update to this one explaining these issues when I have
time for more research.

But for today:

1\. str2 = str doesn't actually create a new string. RString represents a
string value, not a string object. So str2=str creates a new RString because
that essentially defines what str2's value refers to… Think of RString as an
internal pointer to the value that str or str2 is referring to. Sorry - not a
good explanation :(

2\. How are shared RStrings modified? Not sure yet, but I'm curious to find
out. Will let you know on my site somehow.

3\. Yes, simply calling str2 = str or in a variety of other ways will do it.

~~~
ben0x539
I'm 99% sure you're wrong about 1). The internal pointer is the VALUE that is
pointing to the struct RString. `str2 = str` will make another VALUE point to
the same struct RString. Consider

    
    
      str = "foo"
      str2 = str
      str.object_id == str2.object_id #=> true
    

I'm fairly sure that implies that str and str2 share everything including the
RString object, so it couldn't really have one of str/str2 have a field that
points to the other.

I am not sure about 2) and 3) but I'd expect some copy-on-write mechanism
based on a flag field in the RBasic struct, and taking substrings might be
O(1) thanks to sharing too.

------
dblock
It's too bad that Ruby requires strings to be GC-ed, otherwise you could get
away with an even faster stack string up to a certain size. Basically you
would do the same thing as RString, but when you declare a string on the
stack, there's no malloc unless you need >N chars.

<https://github.com/dblock/baseclasses/tree/master/String> for an
implementation (a bit thick).

------
rodw
I'll admit to only having skimmed much of this article, but that's a lot of
words to say this:

"It turns out that the MRI Ruby 1.9 interpreter is optimized to handle strings
containing 23 characters or less more quickly than longer strings."

The rest of the article seems to back that some with benchmarking numbers that
suggest allocating a 23 character string is about 50% faster than allocating a
24 character string, which in this particular test worked out to about 200
milliseconds difference in the time it takes to allocate 1 MILLION strings,
which makes the time savings about 0.2 picoseconds (200 nanoseconds) per
allocation if I remember my SI units right.

~~~
aaronblohowiak
It is also a nice introduction to C for rubyists -- it explains the basic ruby
c object model and how different kinds of strings are laid out in memory and
the implications that has on performance. Ultimately, the author shares your
conclusion.

------
jcoder
Robert Anton Wilson would be proud (<http://en.wikipedia.org/wiki/23_enigma>).

------
endgame
Does that maximum length need to be a magic number, or could it be an
expression written in terms of `offsetof` and `sizeof`?

~~~
ars
It could be (and probably should be):

    
    
       sizeof(long) + sizeof(char *) + sizeof(long)
    

But it's possible the DEFINE for RSTRING_EMBED_LEN_MAX is that and not a fixed
number. You need the DEFINE since it would be needed elsewhere to check the
length of a string before storing it.

------
wtn
Link-bait title…

Author directly contradicts the title in the tl;dr at the end.

~~~
phzbOx
The way wtn said it might be a bit provocative, but here's the last paragraph
of the post:

""" I don’t think you should refactor all your code to be sure you have
strings of length 23 or less. That would obviously be ridiculous. The 50%
speed increase sounds impressive, but actually the time differences I measured
were insignificant until I allocated 100,000s or millions of strings – how
many Ruby applications will need to create this many string values? And even
if you do need to create many string objects, the pain and confusion caused by
using only short strings would overwhelm any performance benefit you might
get. """

So, basically, "Never create Ruby strings longer than 23 characters" is not
true at all. In some _specific_ cases, it might be true. A more accurate title
might have been "Why ruby strings with more than 23 characters are handled
differently." (Or something similar)

But then, it was an interesting post and I enjoyed it; so it doesn't really
matter I guess.

------
gte910h
I have this sinking feeling reading that article took more time than I'll ever
save by knowing this.

Computers go unimaginably fast now. Really. Humans can't intuitively
comprehend how fast it is. I doubt that this fact will save perceptible time
for more than a dozen of its readers.

~~~
jamesgeck0
Except in the most egregious cases, how many optimization articles _ever_ save
you more time than it takes you to read them? As you say, computers are
unimaginably fast.

~~~
gte910h
Indeed. I am just continually amazed at the lack of caution people make when
expressing these sentiments

"Never use a ruby string longer than 23 characters!!!"....or you'll just take
a slightly less infinitesimal amount of time.

~~~
artursapek
I think the title is meant to be a hook.

------
pillbug88
Isn't this just the small string optimization with copy on write?

~~~
ComputerGuru
No - instead of allocating memory on the heap for the C character array aka
"string", Ruby is the text itself directly in the RString object.

These short strings would actually NOT have COW optimization because they're
structs and cloning one would clone its embedded string as well (as there are
no pointers involved in < 24 byte strings).

~~~
pillbug88
isn't that the definition of the small string optimization?

This is how the dinkumware implementation of std::string has behaved for
years. The basic form of the structure is a buffer and a pointer. If the
pointer is filled in, it points to a heap string and follows cow semantics.
Otherwise, the buffer is used, accomplishing the small string optimization.

I suppose I should have elaborated more. It just feels like the OP is
"discovering" the wheel.

~~~
ComputerGuru
Absolutely. Keep in mind that the OP is coming from a Ruby background and TFA
targets people more familiar with high-level interpreted languages than with
C/C++.

BTW, I _think_ though I cannot be sure that both the GCC and MSVC std::string
implementations use this optimization in release mode, but I gotta dash and
don't have time to verify this fact, so take it with a grain of salt, if you
will.

------
mikehoward
good to know - thanks

