Hacker News new | comments | show | ask | jobs | submit login
Strings and the CLR – A Special Relationship (mattwarren.org)
83 points by matthewwarren on June 1, 2016 | hide | past | web | favorite | 29 comments

Nice post.

There are many things that aren't possible in C# but can be expressed in pure IL. Sigil [0] is an IL generator used in the high-performance JSON serializer Jil [1], which takes advantage of this.



I've seen Jil/Sigil before, but not thought about it in this way.

Are you saying that you could exactly replicate a C# System.String using IL, or am I mis-understanding?

No, I wasn't suggesting that you could implement the same high-performance System.String in IL, especially considering the assembly language used. I was just saying that there are many other things that can't be done in C# but are possible in .NET, using other languages.

BTW there is a typo: "passed in my the calling code" should probably be "passed in by the calling code".

I wonder if the new C# 6 string interpolation also uses string builder like format does. e.g.

  $"Name = {name}, hours = {hours:hh}"

This post seems to suggest it uses concatenation, which may be bad for performance in some situations due to the immutability.


I just did a quick test with Reflector and it looks like string interpolation is turned into String.Format(..) calls, which in turn will use StringBuilder. This makes sense as String.Format(..) understands all the string formatting placeholders.

> BTW there is a typo: "passed in my the calling code" should probably be "passed in by the calling code".

Thanks for spotting that, I'll fix it in a bit

I'm not sure if you could replicate System.String, because the oddity is happening at the metadata level - apart from special cases like String, .Net types must be compile-time constant length (if I remember correctly).

However, at the instruction level you can do some pretty awesome stuff. This slice[1] library is able to verifiably and safely grab a slice of data out of any arbitrary structure by using a MSIL feature that was invented for C++/CLR. The call site[2] is not unsafe.

[1]: https://github.com/joeduffy/slice.net/blob/master/src/PtrUti... [2]: https://github.com/joeduffy/slice.net/blob/master/src/Slice....

Yeah the slice stuff is pretty nice, it's actually now being developed on the CoreFX Labs[1], so it might eventually make it into the Core CLR.

[1]: https://github.com/dotnet/corefxlab/tree/master/src/System.S...

Given that IL needs to be expressive enough to represent C++/CLI, I would guess it is possible.

But C++/CLI results in a mixed-mode half-native, half-managed assembly. It doesn't compile down to only IL.

Not if you use the /clr:pure flag

Before the CLR was open sourced you could investigate how things were implemented by using decompilers on the MSIL code or later shared source, but but investgations often stopped at a InternalCall just as things started to get interesting. Now you can see the whole story of how abstractions are implemented part in C# and part in C++ when the runtime needs to "bend the rules" of what C# would normally allow. The corert[0] project is interesting because a larger portion of the run is implemented in C# (or as compilation passes instead of runtime code generation). For example, you can find exception dispatch and type casting implemented in C# there [1].

One example from the article that may be a little off is getting the length of string. It looks like the call to string.get_Length is probably replaced with a JIT intrinsic[2]. The importer then turns that intrinsic call into a GT_ARR_LENGTH[3] node that is eventually lowered into something like[4]:

    *(stringAddr + offsetOfStringLengthField)
It's a bit more complicated than that, as there are several references to GT_ARR_LENGTH in various optimization passes.

[0]: https://github.com/dotnet/corert/ [1]: https://github.com/dotnet/corert/tree/master/src/Runtime.Bas... [2]: https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18... [3]: (warning, big file) https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18... [4]: (warning, big file) https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...

> Before the CLR was open sourced you could investigate how things were implemented by using decompilers on the MSIL code or later shared source, but but investgations often stopped at a InternalCall just as things started to get interesting. Now you can see the whole story of how abstractions are implemented part in C# and part in C++ when the runtime needs to "bend the rules" of what C# would normally allow.

Yep, that's a large part of the motivation for me writing this post, although after looking at the source, I then went back to WinDBG and the Microsoft Symbol Server, which I could've done before it was open sourced!!

Thanks for the clarification on what the JIT is doing to optimise the string length. So far I've not managed to find the time to look at it's source, it took me long enough to find my way round CoreCLR!

In the post, I sort-of hand-waved about what was happening there, it's nice to have some more concrete details.

Not necessarily. As long as it wasn't related to the JIT or the GC, one could always look up the SSCLI (ROTOR) code to understand how something was implemented. https://en.wikipedia.org/wiki/Shared_Source_Common_Language_...

That's a good point. It's fun to diff Rotor and CoreCLR to see how much has changed and what has stayed the same. The JIT and GC seem to not match up at all, but good chunks of the vm folder match up nicely.

Another point for this, besides optimization, is easier interop with existing technologies. The memory layout is identical to Windows' BSTR, as far as I know, and also includes a terminating U+0000. All of that makes it possible to marshal a CLR string as is in many circumstances without the need to copy or convert the characters around.

There is an amusing comment in the CLR about it's support for odd length BSTRs[0]. There can be an extra wchar after the null terminator to store the extra byte.

It looks like if you use the automatic BSTR marshaler it creates a copy[1], though the CLR is so big it's hard for me to say that with confidence. Taking the address of a string with the fixed statement in C# of course won't allocate and that pointer can be passed to APIs that expected Unicode strings.

[0]: https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18... [1]: https://github.com/dotnet/coreclr/blob/a123c3972c447c8bcfa18...

I would think, especially for out parameters and return values, you would need to copy to interop with BSTRs because they are supposed to live on the COM heap (SysAllocString) and CLR wouldn't use that by default.

I love comments like that, it's almost a history lesson on VB and the CLR!

Then you'd probably love to know why the OLE DATE epoch is 1899-12-30...

also the B in BSTR is for Basic!


Yeah that's a good point, I completely steered clear from talking about marshalling and/or interop (mostly due to lack of time), maybe I'll do a Part 2 and include that.

So, are there high-level languages that make it easy to write classes that include length-varies-by-instance member(s) in their memory layout?

That could be a huge efficiency gain because of cache misses.

Smalltalk has that:



Although see the footnote - "This is not always true for other Smalltalk implementations, who don't allow instance variables in variableByteSubclasses and variableWordSubclasses". So, in some implementations, you can have either normal fields or array slots, but not both. That said, i think if you have a base class with fields, you can always make a subclass with slots.

Rust has unsized (aka dynamically-sized) types, trait objects and slices are probably the most commonly encountered DSTs.

There are two main limitations to them:

* by default most generic code can not operate over them, generics require the special `?Sized` bound to allow DSTs (essentially generics have a default Sized generic bound)

* Rust doesn't currently support dynamic stack allocation so DSTs can't be stack-allocated (https://github.com/rust-lang/rfcs/issues/618)

There is a proposal for Java http://objectlayout.org/

Thanks for asking this, I didn't even think about other languages and if they made this possible or not

I imagine Eiffel, Modula-3 and Ada are possible candidates, but would need to research it.

Great post. So the follow-up question is, how much work would be involved in mirroring all that in a native UTF-8 encoded string type? :-)

Windows interop, and 8/16 conversions, would obviously be an expense, but ~half the storage requirements of UTF-16 have to represent a substantial CPU/RAM saving.

Guess it's too much cruft and complexity to introduce into .NET now, but who knows? https://twitter.com/terrajobst/status/717935598904807424

Yeah I think the legacy interop is the tricky part, see "Why does C# use UTF-16 for strings?" (http://blog.coverity.com/2014/04/09/why-utf-16/#.V1AT9vkguUl) for example.

Interestingly enough Java just implemented compact strings that can be ISO-8859-1/Latin-1 (one byte per character) or as UTF-16 (two bytes per character). See http://openjdk.java.net/jeps/254 and https://www.infoq.com/news/2016/02/compact-strings-Java-JDK9 for more info.

This makes me wonder..

Can one extend the CLR (modulalary) and access via: extern and [MethodImplAttribute(MethodImplOptions.InternalCall)]?

Native UTF-8 is an obvious example but the scope could be far greater in scope such as native interop for a hardware device (IoT), funky filesystems or hypervisors?

Good question, I don't know how much (if at all) it can be extended.

Those [MethodImplAttribute(MethodImplOptions.InternalCall)] calls are still within the CoreCLR codebase, they're just from the managed part -> un-managed part.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact