
How Microsoft rewrote its C# compiler in C# and made it open source (2017) - nudpiedo
https://medium.com/microsoft-open-source-stories/how-microsoft-rewrote-its-c-compiler-in-c-and-made-it-open-source-4ebed5646f98
======
_hardwaregeek
Roslyn's parser and syntax tree is pretty amazing. You can recreate the
precise source text from the parse tree up to whitespace. This sort of
"bijective parsing" is truly incredible and probably one of the cooler
innovations I've seen in parsing technology. I can see a bunch of really
interesting ideas that you could do with bijective parsing. For instance,
imagine Rails style boilerplate generation but done at a semantics aware,
within file basis. You could conceivably have a code generator that finds a
class, introspects if it has the corresponding method, then generates it if
not.

Or imagine syntax reformatting but purely locally. Or semantically aware git
diffs that can actually compare the underlying parse trees instead of just raw
text. There's so much cool stuff you can do. I wish every language had a
parser like this.

~~~
Sharlin
Pardon me if I'm missing something, but doesn't just about any AST
implementation allow recovering the source text up to whitespace – or indeed
including whitespace, given that most real-world parsers have to retain
lexical information in order to support reasonable diagnostic messages anyway.

~~~
Karliss
There is reason AST is called abstract syntax tree not just syntax tree. Many
syntax details like parentheses or choice between alternative syntax forms are
there only to avoid syntactic ambiguity, improve readability or historical
reasons and don't affect the semantics or even error messages. It's not
surprising for compiler authors to choose discarding some of the unnecessary
information as early as possible to save memory.

~~~
lonelappde
It's called abstract syntax tree because the concrete syntax is not a tree,
it's an list of characters. Not because it makes _semantic_ -invariant
transformations.

~~~
yatac42
> It's called abstract syntax tree because the concrete syntax is not a tree,
> it's an list of characters.

It's called abstract syntax tree to distinguish it from parse trees (a.k.a.
concrete syntax trees).

------
ablekh
How would you compare special features, if any, as well as advantages (and
disadvantages) of using C# for developing an _embedded_ DSL versus using Julia
macros for the same purpose versus using specialized toolsets (e.g., MPS) for
developing an _external_ DSL? Please note that I'm aware of the Modeling SDK
for Visual Studio. However, since it only allows integration with / targets
the Visual Studio environment, it is not a good general approach, hence the
question.

~~~
nudpiedo
Maybe you can look at F#, it will compile for Android, iOS, windows, Linux,
MAC, javascript frontends and server side. The feature set is different than
Julia’s but it is very powerful and versatile.

~~~
ablekh
Thank you for the suggestion. Will definitely take a more detailed look at F#.
I have read about it some time ago, including some very positive general
feedback. However, even if F# is an excellent fit feature-wise, I can see two
potential issues: 1) lack of a decent package ecosystem (since the planned DSL
is only one part of multi-feature multi-aspect platform puzzle) and 2) lack of
a significant enough pool of experienced developers, which would make building
a good team a challenge (due to its relative popularity, C# is IMO much better
than F# in this regard).

------
nudpiedo
I find amusing how microsoft moved from the closed source referent in the
industry into such an open source player. Right now even allows to hook some
of their tools and platforms to its competing platforms and tools.

~~~
m0xte
Their intent is still entirely market share. We need to be vigilant on how
they get there. History has taught us a lot of lessons we seem to have
forgotten because shiny and new layer of marketing. There is still a massive
cultural and technical impedance mismatch.

~~~
kqr
I agree. A lot of what they are doing now looks like EEE when you peel back a
couple of layers.

I say this as someone who happily uses a lot of the stuff they produce in the
process. I really, really, intensely hope I'm wrong and my fears aren't
realised.

~~~
JamesBarney
It's the same corporate entity, but Microsofts incentives and the majority of
executives have turned over. So I don't see any reasons why Microsoft would be
any more likely to EEE than another company. In fact I think they would be
less likely because they they have more to risk reputationally.

------
swasheck
I've always loved C# but can't quite get past having to target specific .net
frameworks and keeping the different frameworks in my different servers
straight and targeting each of them differently. Maybe that'll change with
.net core (and once I'm able to get that on my servers), but for now I've
discovered a personal love for Go as a way around this for my small utilities.

~~~
WorldMaker
.NET Core 3.1 is a good LTS release and no time like the present to start
migrating to it. Unlike .NET Framework, there's a lot less emphasis in .NET
Core on machine-wide installs and a lot more capabilities for application-
specific framework deployments. .NET Core even has tools to bundle all of your
framework dependencies into a single "EXE", and tree-shaking that down to at
least something of a minimal bundle size. There's a small but growing world of
"go-like" small utilities that are entirely self-contained .NET Core
dependencies.

[https://www.hanselman.com/blog/MakingATinyNETCore30EntirelyS...](https://www.hanselman.com/blog/MakingATinyNETCore30EntirelySelfcontainedSingleExecutable.aspx)

There's even fun experiments of AOT compiling to get interesting in "EXE golf"
results from .NET Core applications such as getting them below 8 KB or running
on Windows 3.1 (because why not):

[https://www.hanselman.com/blog/NETEverywhereApparentlyAlsoMe...](https://www.hanselman.com/blog/NETEverywhereApparentlyAlsoMeansWindows311AndDOS.aspx)

.NET 5 will integrate further AOT capabilities as the Mono world is merging
in, in addition to the raw marketing advantage that 5 > 4 for anyone still
struggling to convince non-technical managers that .NET Core is a better
investment in 2020 than .NET Framework.

------
jaked89
While the rewrite enabled faster development of the language as a whole, it
also gradually destroyed the IDE's performance. The editor in VS 2019 is
simply unworkable.

I blame this directly on the immutable AST. While a nice concept in theory, it
causes too many allocations, and is cumbersome to work with.

I predict another rewrite in 2 or 3 years.

~~~
alkonaut
I had to ditch ReSharper to get to 2019 because together with the performance
of the IDE itself, it just wasn't usable. R# ate gigs of Ram and Roslyn does
the same. It's not surprising since they basically do the same thing, in
managed code! But I can't pay the CPU time and memory to analyze my code TWICE
on every edit. I also suspect things like switching build configuration offers
thousands of opportunities to have some IDE widget hold references to old
compilation data structures which are never garbage collected. On the bright
side at least 2019 works pretty well without R#.

~~~
to11mtm
> I had to ditch ReSharper to get to 2019 because together with the
> performance of the IDE itself, it just wasn't usable. R# ate gigs of Ram and
> Roslyn does the same. It's not surprising since they basically do the same
> thing, in managed code! But I can't pay the CPU time and memory to analyze
> my code TWICE on every edit.

I'm holding off on 2019 as much as I can. Between 'forcing' an upgrade for
.NET Core 3.0 and the fact Resharper slows it down too much, I decided to give
Rider a try.

I'm finding myself not missing VS a whole lot; on one hand Rider is taking way
more RAM to start and load, but it stays pretty constant after the first debug
session, winds up staying under VS for memory on longer loads (Especially if
I've got multiple solutions open) and it's smoother than VS the whole time.

~~~
mycall
Is Rider 64-bit?

~~~
ygra
It is, but that's not too relevant, as the ReSharper component runs in another
process (but that's managed code as well, so probably also runs as a 64-bit
process).

------
fultonfaknem42
I still can't believe more people aren't leveraging the Roslyn APIs to write
compiler extensions or additional tools around C#. It's conceptually powerful.

~~~
eropple
Personally speaking, I'd love to, but I'd then have to figure out how to make
them work with an IDE. I'm very tired of waiting for record classes to show up
and I could have written a (to be clear: _inferior_ ) set of stuff around
regular classes that magics one into "everything is readonly and we
autogenerate a `copy` method", much like Kotlin does for its data
classes...but my IDE isn't gonna understand it unless I do a lot more work,
and so I never bothered.

~~~
moron4hire
I'm not sure that it would be that difficult with Roslyn and Visual Studio
these days. There _are_ Roslyn-based syntax-highlighters and linters and
snippet-generators and code-transformers. I use one for making sure all of my
code is not just formatted the way I want it, but auto-inserting "readonly"
modifiers for fields that don't get re-written outside of the constructor, one
that treats not implementing IDisposable correctly as an error (it can also
track the lifetime of a Disposable object and warn when it detects that it
never gets disposed, which is super cool), and another one that rainbow-
highlights code blocks. The tooling is there to support just a thing, it just
needs someone to put all the pieces together.

~~~
thrower123
What is the name of that IDisposable checking one? That sounds very, very
useful.

~~~
moron4hire
I think the easiest way to set it all up is to manually edit your CSProj
files, especially if you are still building .NET Framework projects. By
default, Visual Studio will only create the new SDK Style project file format
for .NET Standard and .NET Core projects, but it's still usable for .NET
Framework projects if you manually change the format. Once you change it, it
sticks, so you can use VS to edit the config after that, but it's still pretty
easy to edit by hand now.

So here is my base project config:
[https://github.com/capnmidnight/Juniper/blob/master/Juniper....](https://github.com/capnmidnight/Juniper/blob/master/Juniper.props)

The most important part is the first PropertyGroup sets values for all build
configs, in particular is setting LangVersion to 8.0. Framework 4.8 taps out
at C# 7.2, but you can use most of the C# 8.0 features, including fully async
streams if you manually set the language version. Features that aren't
available are some minor things like the array ranges and indexing:
[https://docs.microsoft.com/en-us/dotnet/csharp/language-
refe...](https://docs.microsoft.com/en-us/dotnet/csharp/language-
reference/proposals/csharp-8.0/ranges)

And here are my base project with the analyzers I use:
[https://github.com/capnmidnight/Juniper/blob/master/Juniper....](https://github.com/capnmidnight/Juniper/blob/master/Juniper.targets)

They're all ones provided directly from Microsoft, though there are a bunch
more from other vendors:
[https://www.nuget.org/packages?q=analyzer](https://www.nuget.org/packages?q=analyzer)

Then here is an example project using that targets file:
[https://github.com/capnmidnight/Juniper/blob/master/src/Juni...](https://github.com/capnmidnight/Juniper/blob/master/src/Juniper.Server/Juniper.Server.csproj)

You can see just how much the new SDK Style project file format simplifies
things. There is no importing of any base Targets files hidden deep in Visual
Studio's install directory anymore.

I manually import the .props and .targets file instead of using
Directory.Build.props and Directory.Build.targets because I have other
projects that use these configs, included via a git submodule.

Here is my .editorconfig file, where I set most of the rules related to
Disposable types to errors:
[https://github.com/capnmidnight/Juniper/blob/master/.editorc...](https://github.com/capnmidnight/Juniper/blob/master/.editorconfig)

And this Visual Studio extension makes .editorconfig files a lot nicer to work
with:
[https://marketplace.visualstudio.com/items?itemName=MadsKris...](https://marketplace.visualstudio.com/items?itemName=MadsKristensen.EditorConfig)

(BTW, I pretty much install all of Mads Kristensen's extensions)

And while I'm here, I'll give a shout-out to Viasfora for its syntax
highlighting modifications that rainbow-highlight code blocks:
[https://marketplace.visualstudio.com/items?itemName=TomasRes...](https://marketplace.visualstudio.com/items?itemName=TomasRestrepo.Viasfora)

And VSColorOutput for making the Output window in Visual Studio actually
readable:
[https://marketplace.visualstudio.com/items?itemName=MikeWard...](https://marketplace.visualstudio.com/items?itemName=MikeWard-
AnnArbor.VSColorOutput)

------
chrisseaton
Why isn't the native code compiler also now written in C#, like Java is doing?

~~~
kevingadd
People have done research before into having the C#->native JIT be written in
C#, I recall seeing prototypes. There are a ton of barriers between a
prototype and a shipping implementation though, so I'm not sure we'll ever see
it. In particular you risk regressions in startup time or memory usage since
the amount of infrastructure needed to run C# (pre-jitted?) to generate all
your jitcode is much higher than a small blob of hand-written C/C++ that's
spitting out jitcode. See [https://www.mono-
project.com/news/2018/09/11/csharp-jit/](https://www.mono-
project.com/news/2018/09/11/csharp-jit/) for one example.

There's an old saying (from one of Unity's developers, I think?) that you
can't run Hello World in C# or Java without an XML parser because so much
configuration for things like locales ends up pulling in serialization
libraries, reflection, etc. Things have improved in this area for both
platforms but it's definitely still very difficult to trim managed code down
as far as you can trim native code.

For something as critical as the JIT you also want to keep the generated code
small (for cache efficiency) and if possible, PGO it - things that existing
managed code generators aren't especially great at compared to clang or modern
MSVC.

From my past experience writing/maintaining C# compiler and runtime code, I
would never bother trying to port the JIT to C#. I don't think the returns are
worth the massive investment. I'd sooner port it to a language like Rust for
safety benefits or to some sort of hypothetical language that enables
producing smaller/faster code to improve JIT performance. I'm not sure I'd
invest in that either though because for most workloads JIT time is not the
bottleneck (and you can optimize that out in many cases by pre-generating the
JITcode). EDIT: Also, for workloads where JIT is the bottleneck, it's
questionable whether it's possible to extract big gains out of optimizing the
JIT because of the nature of JIT workloads - you may just be bottlenecked on
memory bandwidth or instructions per clock. A JIT converting IR to machine
code is not trivially vectorized.

People are figuring this out (again, from scratch) now with webassembly as all
the modern browsers go through the churn and angst involved in answering the
question 'can we actually JIT this whole 50mb executable from scratch at load
time?' even though we already knew the answer back when WebAssembly wasn't a
spec yet.

[DISCLOSURE: I get paid to work on Mono right now and was paid to help draft
the initial WebAssembly spec, and before that my work on the JSIL MSIL->JS
compiler was sponsored. So I have some massive biases here.]

~~~
chrisseaton
The way Java does it is their JIT written in Java is optionally AOT compiled
to native code, so it doesn't have a startup time or warmup time problem and
it can be PGOd.

We do know that the returns are worth it, because people are able to develop
new optimisations in the Java version that people just won't attempt in the
C++ version because the code is so much harder to work with (but possibly just
due to the age of the C++ version) and its achieving 13% speedup over the C++
version in practice at places like Twitter, which is worth millions of
dollars.

~~~
kevingadd
This is fascinating. Is the claim here that the C++ version could have never
been as fast as the current Java one or just that people have been optimizing
the Java one and getting improvements past the C++ one?

Also, it sounds like the old C++ JIT has to be maintained and shipped in the
event that the Java-based JIT isn't available at startup, right? If so that
makes sense as a stop-gap and it would be more of a tiered JIT, not a port.
Tiered JITs are definitely proven, successful technology.

~~~
chrisseaton
> Is the claim here that the C++ version could have never been as fast as the
> current Java one

The claim is that the work to add new optimisations to the old C++ code is so
difficult that people aren't prepared to do it. The Java code is easier to
write and debug, enough so that people are managing to add new optimisations
that they haven't added to the C++ code.

> the old C++ JIT has to be maintained and shipped in the event that the Java-
> based JIT isn't available at startup

Well only until the new Java JIT is mature. The AOT build of the Java JIT can
be shipped in the binaries so it's always there.

> If so that makes sense as a stop-gap and it would be more of a tiered JIT,
> not a port. Tiered JITs are definitely proven, successful technology.

It's used as a new top-tier yes, replacing the C++ top tier. The C++ JIT is
likely to go I think, in the medium-to-long term.

------
The_rationalist
It should be noted they are now contributing to openjdk too!

------
pcunite
If .NET/C# could output native binaries for Windows Desktop apps, I could
seriously think about switching to C# over C++. How do people deal with
securing their source code otherwise? Quite trivial to get the source from a
decompiled C# exe file.

~~~
LeifCarrotson
It's quite trivial to get the source from a C# exe file in the same way as a
decompiler makes it trivial to get the source from a C++ exe file.

Obfuscators are common if you're concerned about the symbols leaking.

If you don't include the .PDB debugging symbol files it's much the same as
native binary code. The .NET virtual machine code is a little more expressive
but is superficially similar to x86 native code.

In my opinion, if you're concerned about people reading your compiled machine
code, the only solution is to run your app on a server and give users an API.

~~~
pcunite
A C# binary can be decompiled back to easy to read source code. Show me how
this can be done with a compiled C++ binary. It is a valid concern.

~~~
pjmlp
With IDA, HexRays, Hopper and a couple of Python scripts.

