
Query-Based Compiler Architectures - matt_d
https://ollef.github.io/blog/posts/query-based-compilers.html
======
sly010
I have some negative feelings about this trend (of increased integration in
compilers), but I can't quite put my finger on the reason.

Before the language server idea came along all compilers were pure functions.
Dependency management and caching were the responsibility of the build system.
A flexible build system could handle things that the language designers
haven't though of, like code generation, mixed language projects, linking
between different languages etc. Things are very composable and extendable and
everything can be modeled as a DAG.

With language servers and incremental compilation becoming part of some long
running compiler process the responsibilities are blurred, it all leads to
increased integration, less flexibility and when things break, you will not be
able to tell why.

Aren't we giving up too much to get faster recommendations in IDEs?

~~~
ratmice
The assertion "all compilers were pure functions", is a strange one, because
it is almost entirely backwards.

the purity of compilers was abandoned almost immediately (when they started
creating a file a.out and writing to that instead of writing binaries to
stdout, and in the c-preprocessor when #include was added, and in the
assembler with the .incbin directive, If compilers were pure, there would be
zero need for Makefile style build systems which stat files to see if they
have changed.

while Makefiles and their ilk are modeled as a dag is true, The only reason an
external file/dag is actually necessary is due to impurity in the compilation
process.

There have been very few compilers which have even had a relatively pure core
(TeX is the only one that I can actually think of), language servers are if
anything moving them to a more pure model, simply due to the fact that its
sending sources through some file descriptor rather than having to construct
some graph out of filenames.

Long story short, "purity" in the sense of a compiler is a function from
source text -> binary text, "foo.c" is not source text, and a bunch of errors
is not binary text.

At least language servers take in source text as input.

~~~
rumanator
> the purity of compilers was abandoned almost immediately (when they started
> creating a file a.out and writing to that instead of writing binaries to
> stdout

I don't understand your point. A function doesn't cease to be a function if it
sends it's output somewhere else.

> and in the c-preprocessor when #include was added,

The C preprocessor is not the compiler. It's a macro processor that expands
all macros to generate the translation unit, which is the input that compilers
use to generate their output.

And that is clearly a function.

> If compilers were pure, there would be zero need for Makefile style build
> systems which stat files to see if they have changed.

That assertion makes no sense at all. Compilers take source code as input and
output binaries. That's it. The feature you're mentioning is just a convenient
trick to cut down build times by avoiding to compile source files that haven't
changed. That's not the responsibility of the compiler. That's a function
whose input is the source files' attributes and it's output is a DAG of files
that is used to run a workflow where in each step a compiler is invoked to
take a specific source file as input in order to generate a binary.

It's functions all the way down, but the compiler is just a layer in the
middle.

> while Makefiles and their ilk are modeled as a dag is true, The only reason
> an external file/dag is actually necessary is due to impurity in the
> compilation process.

You have it entirely backwards: build systems exist because compilers are pure
functions with specific and isolated responsibilities. Compilers take source
code as input and generate binaries as output. That's it. And they are just a
component in the whole build system, which is comprised of multiple tools that
are designed as pure functions as well.

~~~
ratmice
> I don't understand your point. A function doesn't cease to be a function if
> it sends it's output somewhere else.

I think here lies the miscommunication, I'm talking about pure functions, it
doesn't cease to be a function, but it does cease to be a pure one if sending
its output somewhere else is done by side-effect.

~~~
haakonhr
I guess there is pure and pure. Pure in the sense of no side-effects at all,
such as for example writing to a file, and pure in the sense of not relying on
state.

------
hobo_mark
Reminded me of this lecture from last year:

Responsive compilers - Nicholas Matsakis - PLISS 2019

[https://youtube.com/watch?v=N6b44kMS6OM](https://youtube.com/watch?v=N6b44kMS6OM)

(Of course it's based on Rust, but the same principles would be applicable
elsewhere)

~~~
fluffything
The blog post does cite salsa, which is the frame work that was created to
create Lark, the language that was created to protoype Rust's implementation
of query-based compilation.

[https://github.com/lark-exploration/lark](https://github.com/lark-
exploration/lark)

[https://github.com/salsa-rs/salsa](https://github.com/salsa-rs/salsa)

------
_bxg1
Really interesting ideas, though I think the content is hampered a bit by the
use of what I think is Haskell in the examples. It hinders accessibility when
you're having to learn two unrelated things in parallel, and I don't think the
average audience for this piece can be expected to know Haskell well enough to
follow along effectively.

~~~
vajrabum
This is from a blog that has 2 posts. Both discuss a compiler for an
experimental dependently typed language. That's a fairly specialized topic.

------
mshockwave
Good...but why does the author think modern compilers / language servers DON'T
do caching? Or in another way: why does the author think caching mechanism in
modern compiler / language server is insufficient? I think the author is
proposing a caching mechanism that has finer granularity and design the whole
system around this idea from day one. But first, at least many components in
LLVM have been doing caching for a long time (e.g. OrcJIT has a pretty decent
caching layer and libTooling also supports incremental parsing with AST
caches). Second, what is the memory overhead of this (fine grain caching)
design when it's facing some larger input program in real world (e.g. OS
kernel)? Does it scale well?

I know it's refurbished from an old post so I probably shouldn't be so harsh.
But it will be better to compare some old ideas against state-of art works and
found some insights rather than doing pure archaeology

~~~
matklad
I am not the author, but work in the same domain. Empirically, existing
compilers are impossible to turn into good IDEs, unless the language has
header files and forward declarations. Otherwise, you get a decent IDE by
doing one of the following:

* writing two compilers (C#, Dart, Java(and, in some sense, every major language, supported in JetBrains tooling)

* starting with IDE compiler from the start (Kotlin & TypeScript)

The two examples where I’ve heard a batch compiler was successfully used to
build a language server are C++ and OCaml (haven’t tried these language
servers myself though). Curiously, they both use header files, which means
that it’s the user who does fine grained caching.

I don‘t see how caching in LLVM is relevant to the task of building LSP
servers.

In terms of prior art, I would suggest studying how IntelliJ works,
specifically looking at the stubs:

[https://www.jetbrains.org/intellij/sdk/docs/basics/indexing_...](https://www.jetbrains.org/intellij/sdk/docs/basics/indexing_and_psi_stubs.html)

~~~
seanmcdirmid
There are ways to turn batch compilers into incremental IDE compilers with
some tree and effect caching on the periphery of the compiler. You can even go
all the way up to the eval level for a full live programming environment. You
don’t need module signatures if you are willing to trace dependencies
dynamically.

See [https://www.microsoft.com/en-
us/research/publication/program...](https://www.microsoft.com/en-
us/research/publication/programming-with-managed-time/).

