
Strings and the CLR – A Special Relationship - matthewwarren
http://mattwarren.org/2016/05/31/Strings-and-the-CLR-a-Special-Relationship/
======
jsingleton
Nice post.

There are many things that aren't possible in C# but can be expressed in pure
IL. Sigil [0] is an IL generator used in the high-performance JSON serializer
Jil [1], which takes advantage of this.

[https://github.com/kevin-montrose/Sigil](https://github.com/kevin-
montrose/Sigil)

[https://github.com/kevin-montrose/Jil](https://github.com/kevin-montrose/Jil)

~~~
matthewwarren
I've seen Jil/Sigil before, but not thought about it in this way.

Are you saying that you could exactly replicate a C# System.String using IL,
or am I mis-understanding?

~~~
jsingleton
No, I wasn't suggesting that you could implement the same high-performance
System.String in IL, especially considering the assembly language used. I was
just saying that there are many other things that can't be done in C# but are
possible in .NET, using other languages.

BTW there is a typo: "passed in my the calling code" should probably be
"passed in _by_ the calling code".

I wonder if the new C# 6 string interpolation also uses string builder like
format does. e.g.

    
    
      $"Name = {name}, hours = {hours:hh}"
    

[https://msdn.microsoft.com/en-
us/library/Dn961160.aspx](https://msdn.microsoft.com/en-
us/library/Dn961160.aspx)

This post seems to suggest it uses concatenation, which may be bad for
performance in some situations due to the immutability.

[https://weblogs.asp.net/bleroy/c-6-string-interpolation-
is-n...](https://weblogs.asp.net/bleroy/c-6-string-interpolation-is-not-a-
templating-engine-and-it-s-not-the-new-string-format)

~~~
matthewwarren
I just did a quick test with Reflector and it looks like string interpolation
is turned into String.Format(..) calls, which in turn will use StringBuilder.
This makes sense as String.Format(..) understands all the string formatting
placeholders.

> BTW there is a typo: "passed in my the calling code" should probably be
> "passed in by the calling code".

Thanks for spotting that, I'll fix it in a bit

------
MarkSweep
Before the CLR was open sourced you could investigate how things were
implemented by using decompilers on the MSIL code or later shared source, but
but investgations often stopped at a InternalCall just as things started to
get interesting. Now you can see the whole story of how abstractions are
implemented part in C# and part in C++ when the runtime needs to "bend the
rules" of what C# would normally allow. The corert[0] project is interesting
because a larger portion of the run is implemented in C# (or as compilation
passes instead of runtime code generation). For example, you can find
exception dispatch and type casting implemented in C# there [1].

One example from the article that may be a little off is getting the length of
string. It looks like the call to string.get_Length is probably replaced with
a JIT intrinsic[2]. The importer then turns that intrinsic call into a
GT_ARR_LENGTH[3] node that is eventually lowered into something like[4]:

    
    
        *(stringAddr + offsetOfStringLengthField)
    

It's a bit more complicated than that, as there are several references to
GT_ARR_LENGTH in various optimization passes.

[0]: [https://github.com/dotnet/corert/](https://github.com/dotnet/corert/)
[1]:
[https://github.com/dotnet/corert/tree/master/src/Runtime.Bas...](https://github.com/dotnet/corert/tree/master/src/Runtime.Base/src/System/Runtime)
[2]:
[https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...](https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa1810fde0c7ffb842fa31b/src/vm/ecalllist.h#L226)
[3]: (warning, big file)
[https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...](https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa1810fde0c7ffb842fa31b/src/jit/importer.cpp#L3119)
[4]: (warning, big file)
[https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...](https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa1810fde0c7ffb842fa31b/src/jit/flowgraph.cpp#L8828)

~~~
kaushiks
Not necessarily. As long as it wasn't related to the JIT or the GC, one could
always look up the SSCLI (ROTOR) code to understand how something was
implemented.
[https://en.wikipedia.org/wiki/Shared_Source_Common_Language_...](https://en.wikipedia.org/wiki/Shared_Source_Common_Language_Infrastructure)

~~~
MarkSweep
That's a good point. It's fun to diff Rotor and CoreCLR to see how much has
changed and what has stayed the same. The JIT and GC seem to not match up at
all, but good chunks of the vm folder match up nicely.

------
ygra
Another point for this, besides optimization, is easier interop with existing
technologies. The memory layout is identical to Windows' BSTR, as far as I
know, and also includes a terminating U+0000. All of that makes it possible to
marshal a CLR string as is in many circumstances without the need to copy or
convert the characters around.

~~~
MarkSweep
There is an amusing comment in the CLR about it's support for odd length
BSTRs[0]. There can be an extra wchar after the null terminator to store the
extra byte.

It looks like if you use the automatic BSTR marshaler it creates a copy[1],
though the CLR is so big it's hard for me to say that with confidence. Taking
the address of a string with the fixed statement in C# of course won't
allocate and that pointer can be passed to APIs that expected Unicode strings.

[0]:
[https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...](https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa1810fde0c7ffb842fa31b/src/vm/object.cpp#L1853-L1869)
[1]:
[https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...](https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa1810fde0c7ffb842fa31b/src/vm/olevariant.cpp#L5250-L5275)

~~~
matthewwarren
I love comments like that, it's almost a history lesson on VB and the CLR!

~~~
adzm
Then you'd probably love to know why the OLE DATE epoch is 1899-12-30...

also the B in BSTR is for Basic!

[https://blogs.msdn.microsoft.com/ericlippert/2003/09/16/eric...](https://blogs.msdn.microsoft.com/ericlippert/2003/09/16/erics-
complete-guide-to-vt_date/)

------
Someone
So, are there high-level languages that make it easy to write classes that
include length-varies-by-instance member(s) in their memory layout?

That could be a huge efficiency gain because of cache misses.

~~~
twic
Smalltalk has that:

[https://www.gnu.org/software/smalltalk/manual-
base/html_node...](https://www.gnu.org/software/smalltalk/manual-
base/html_node/Class_002dinstance-creation.html)

[https://www.gnu.org/software/smalltalk/manual/html_node/Insi...](https://www.gnu.org/software/smalltalk/manual/html_node/Inside-
Arrays.html)

Although see the footnote - "This is not always true for other Smalltalk
implementations, who don't allow instance variables in variableByteSubclasses
and variableWordSubclasses". So, in some implementations, you can have either
normal fields or array slots, but not both. That said, i think if you have a
base class with fields, you can always make a subclass with slots.

------
nblumhardt
Great post. So the follow-up question is, how much work would be involved in
mirroring all that in a native UTF-8 encoded string type? :-)

Windows interop, and 8/16 conversions, would obviously be an expense, but
~half the storage requirements of UTF-16 have to represent a substantial
CPU/RAM saving.

Guess it's too much cruft and complexity to introduce into .NET now, but who
knows?
[https://twitter.com/terrajobst/status/717935598904807424](https://twitter.com/terrajobst/status/717935598904807424)

~~~
matthewwarren
Yeah I think the legacy interop is the tricky part, see "Why does C# use
UTF-16 for strings?" ([http://blog.coverity.com/2014/04/09/why-
utf-16/#.V1AT9vkguUl](http://blog.coverity.com/2014/04/09/why-
utf-16/#.V1AT9vkguUl)) for example.

Interestingly enough Java just implemented compact strings that can be
ISO-8859-1/Latin-1 (one byte per character) or as UTF-16 (two bytes per
character). See
[http://openjdk.java.net/jeps/254](http://openjdk.java.net/jeps/254) and
[https://www.infoq.com/news/2016/02/compact-strings-Java-
JDK9](https://www.infoq.com/news/2016/02/compact-strings-Java-JDK9) for more
info.

------
ed_blackburn
This makes me wonder..

Can one extend the CLR (modulalary) and access via: extern and
[MethodImplAttribute(MethodImplOptions.InternalCall)]?

Native UTF-8 is an obvious example but the scope could be far greater in scope
such as native interop for a hardware device (IoT), funky filesystems or
hypervisors?

~~~
matthewwarren
Good question, I don't know how much (if at all) it can be extended.

Those [MethodImplAttribute(MethodImplOptions.InternalCall)] calls are still
within the CoreCLR codebase, they're just from the managed part -> un-managed
part.

