
String Interning Done Right - cronjobber
https://getkerf.wordpress.com/2016/02/22/string-interning-done-right/
======
Someone
Interning is a trade-off: you get decreased memory usage (if you use a lot of
long-lived duplicated strings) and faster string comparisons (for pairs of
strings that you do intern), at the cost of extra work to create the interned
strings.

Problem is that it is almost impossible to decide whether interning makes
sense. About the only way I know is that, if you run out of memory and use a
lot of strings, you can think “let’s give interning a try”.

Even if you know the answer, you cannot tell libraries you use to intern
strings they create. So, if your libraries create lots of long-lived strings,
you do not have control over whether they intern strings.

Also, if you do write a library, the call whether to intern or not to intern
is impossible to make because you cannot know whether your callers prefer
speed or memory usage, and whether the objects you return to them will be
long-lived.

For example, if you write a XML parser library, interning tag names may be the
best guess, but libraries that do not do that may beat you in short
benchmarks.

Because of that, I think it would be more useful to either have some way to
globally install a ‘should I intern this string?’ handler (but what kind of
information should that get, so that it can make an informed decision?), or to
have the garbage collector intern strings as it sees fit. Problem with that is
that it can change the behaviour of programs that compare strings using ‘are
the same object’ comparisons. Maybe that feature should be removed from your
language.

Also note that modern Java allows one to tweak its interning behaviour a bit
([http://java-performance.info/string-intern-in-java-6-7-8/](http://java-
performance.info/string-intern-in-java-6-7-8/))

~~~
uxcn
> Interning is a trade-off: you get decreased memory usage (if you use a lot
> of long-lived duplicated strings) and faster string comparisons (for pairs
> of strings that you do intern), at the cost of extra work to create the
> interned strings.

For garbage collected languages the benefit isn't only memory consumption,
it's performance, since using canonical objects eliminates the additional
allocations and garbage collections.

The string comparison argument is a bit of a dubious one though. Comparing
length is only an integer comparison, and depending on the architecture, you
can compare up to eight (or more) characters per cycle.

~~~
Someone
_" since using canonical objects eliminates the additional allocations and
garbage collections"_

That requires a quite advanced compiler. Looking at Java, the process is:

    
    
      - create a new string in some way.
      - call String.intern() to create or retrieve
        the interned string with the same contents.
    

If you do

    
    
      String si = (s+t).intern();
    

or

    
    
      String si = s.replace('a', 'b').intern();
    

it would take quite a compiler to prevent the creation of an intermediate
string. You could have every function return an iterator over the characters
that would end up in the string, iterate over that to check whether the string
already is interned, and if not, iterate again to allocate a new string, but I
think it would typically be cheaper to create it and let he young generation
garbage collector collect it.

~~~
uxcn
> it would take quite a compiler to prevent the creation of an intermediate
> string. You could have every function return an iterator over the characters
> that would end up in the string, iterate over that to check whether the
> string already is interned, and if not, iterate again to allocate a new
> string, but I think it would typically be cheaper to create it and let he
> young generation garbage collector collect it.

I can't think of a language where you would always want to canonicalize
strings at a global scope. For example, consider the case where you have a
large number of threads and cores. Unless the strings are explicitly allocated
on their own cache lines, any thread that references a string now has to worry
about false sharing. You would also need to worry about the contention on the
canonical store.

At a user level, in languages like Java, there's generally no reason to create
any intermediate string if you're already reading from a direct byte buffer.
This covers a fairly large set of use cases. There may be other techniques
considering Java supports scalar interpolation now, but direct byte buffers
have been the most effective in my experience.

------
wtetzner
>You can see why Java needs these types. With permanent default interning, any
sort of sequence involving character-level appends, such as reading the
contents of a book from a file, would result in a preposterous O(n²) version
of an otherwise trivial technique.

I don't think this is true. It looks like Java only interns string literals by
default. [1] If you get a string another way (user input, for example), it's
not interned unless you call the String.intern() method.

[1]
[https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#intern--)

~~~
richardwhiuk
Yeah, the explanation in this article about Java's string interning made be
lose confidence that the author knew what they were talking about. If you
compare Strings, you should use .equals() or you are risking a logic error if
you aren't very careful. Indeed, you can deliberately un-intern a string by
doing new String("my string") (with little realistic value).

------
chrisohara
I just finished a C library for string interning:
[https://github.com/chriso/intern](https://github.com/chriso/intern)

There are also some bindings for Go: [https://github.com/chriso/go-
intern](https://github.com/chriso/go-intern)

------
dgreensp
Interned strings in Java are garbage collected. "Permanent" interning, where
anything interned becomes a memory leak, is not a thing.

------
blt
Has anyone compared performance of interned and non-interned versions of a
large, string-heavy application? Seems like one of those things that might not
be worth it.

------
dllthomas
> This is most noticeable when writing objects to disk or sending them across
> the network. As soon as the process needs to communicate the scheme breaks
> down.

As another point in this space, X Windows moves things out - processes can
create "ATOMs" registered with the server.

------
hendekagon
How is Clojure's interning of keywords different to this ?

~~~
Per_Bothner
Read the article. I assume Clojure uses what the article calls a "global
pool".

Kerf's idea of per-object intern pool does not seem as useful for other
languages.

