

Referencing Substrings Faster, without Leaking Memory - Strilanc
http://twistedoakstudios.com/blog/Post8147_referencing-substrings-faster-without-leaking-memory

======
nteon
The original behavior, where a substring pins the full original string, is
also how Go's substring slicing works. I much prefer this behavior, as it lets
you profile and add a copy yourself if you find yourself pinning large
strings. The Java change to do a copy for every substring seems like it harms
the general case to help out poorly reasoned code.

While the tracking of intervals with a tree is certainly impressive, it is a
lot of complexity to again (in my opinion) support poorly written code. If you
know you are going to only use 10 bytes of a 1 MB incoming HTTP request, make
a copy.

~~~
ape4
I think the best answer is to take the size of the parent and child strings
into account. If the parent is 1M and the child is just a few bytes then make
a copy so the parent can be freed up. Otherwise use the old behavior.

~~~
enjo
That'd be even worse. Now I have yet another variable to deal with when
tracking down performance issues in my code. One that would be non-obvious to
anyone not terribly familiar with how things are working behind the scenes.

Better would be to introduce a methods that explicitly exposes the type of
substring method being used.

~~~
ape4
But you wouldn't have to track down performance issues in your code if it
never was slow ;)

------
wfunction
Why not just do the copying conditionally?

If it's at least half the size of the parent, reference the parent. Otherwise
just make a copy...

~~~
emn13
Part of the reason for the change was likely to allow better optimizations of
the strings. The new strings are basically equivalent to arrays and might be
optimized as such; even without that they certainly don't need an offset nor
length field anymore since the underlying array can provide that.

If you conditionally used one implementation or the other you'd lose the
optimization opportunity. You could alternatively make string non-final; but
then your method resolution is more expensive and that would really be a
pretty big change in any case.

I think java made the right choice here since its trivial to wrap string in
another structure whenever you need a fast substring, it's just that that
custom string won't be passable to lots of other API's which makes it a
potentially nasty surprise for existing codebases.

~~~
cpeterso
Couldn't the substring copying be deferred until the parent string is ready to
be GC'd? In the common case of a short-lived substring, no string copies are
necessary. Only if the substring outlives the parent string would a copy be
necessary.

------
voidlogic
This change is going to destroy the performance of a lot of apps that have not
been modified in years.... There are lots of programs that might substring
every word in a large block of text...

Letting a programmer use too much memory and have to learn why, and when to
copy is much better than drastically changing the runtime characteristics of
many existing programs. Not copying is way more efficient, you just need to be
in the know-

There are even solution that can do both like conditional copying or keeping
track of the parent and doing a conditional copy if needed at GC time.

~~~
wfunction
Substringing every word in a block of text is actually the ideal situation for
the change here...

~~~
Groxx
I think they're referring to what Java's doing:

> _Last month, Java changed how it computes substrings.

... [tl;dr: previously always kept whole string, substring just stored
indexes]

Java now uses a different method: always copying the substring._

------
binarycrusader
The author of the change in Java discusses it in more detail here (from the
article):
[http://www.reddit.com/r/programming/comments/1qw73v/til_orac...](http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/cdhb77f)

------
VanillaCafe
Alternative: on substring, reference the original string through a
SoftReference registered with a global ReferenceQueue. Have one global system
thread poll that reference queue and to convert substrings to hard copies
(requiring a reverse map from the parent to its children via WeakReference)
when the parent is about to be collected.

Problem is this is still reasonably complicated (though much less so than the
article). A lot of memory churn and possible OOMEs will occur at the time of
GC instead of the time of substring allocation. That makes the overall system
less predictable and less debuggable, as will any deferred substring
reallocation.

------
robryk
JVM has a garbage collector, so if we are talking about modifying JVM itself,
we wouldn't need anything that works online: you could determine which parts
are referenced in a simpler way during each GC.

~~~
Strilanc
The JVM garbage collector is not capable of tracking intervals nested inside
other objects like arrays. At least, not that I'm aware of.

------
jheriko
this is all very interesting, especially to see the end result, but as a rule
i don't leak memory or use garbage collection.

again i am left feeling that garbage collection causes more problems than it
solves. certainly every instance of debugging a memory 'leak' in that context
is much harder than it needs to be (don't use gc or refcounting) - and also a
slap in the face considering this is the problem its meant to solve anyway.

if you care this much about performance your biggest bottleneck is the gc -
not just interms of performance cost, but the amount of your time you will
waste solving a problem like this instead of optimising your code.

this is a problem i don't need to solve and pretty much everything i write is
high performance code... whilst i know this isn't targetted optimisation, but
a fun experiment, i would worry that people will read some kind of best
practice here where it really isn't...

~~~
icebraining
It's not the gc that causes this problem. The problem is: you allocated a
block of, say, 100KB. Then you needed two 10KB subsets of that block. While
you're using those, you may stop needing the rest of the block. How do you
create those subsets? Do you just point to the original memory? Then you can't
free() the 80KB you don't need anymore. Do you copy those 20KB? Then you're
potentially using more memory than necessary.

The problem holds even if you're not using a gc or refcounting at all.

------
ape4
This proposed new method seems way over complicated to me.

~~~
involans
the author appears to have rediscovered nested containment lists
[http://www.ncbi.nlm.nih.gov/m/pubmed/17234640/](http://www.ncbi.nlm.nih.gov/m/pubmed/17234640/)

~~~
brazzy
Which only proves him right when he says

> I also dislike trying to find existing implementations of trees because, for
> some odd reason, trees tend to be named by abbreviations that only make
> sense in hindsight

~~~
Strilanc
I'd consider "nested containment list" a pretty good name, actually. Of course
then they abbreviate it to NCList...

