
D compilation is too slow and I am forking the compiler - ingve
https://blog.thecybershadow.net/2018/11/18/d-compilation-is-too-slow-and-i-am-forking-the-compiler/
======
szemet
It is interesting to use fork() to snapshot application state. It seems so
high-level and powerful (Compared to other C/POSIX level stuff - so I'm not
comparing it to Scheme/Lisp/"your favorite high/low level environment" here!).
We may collect some other examples here - just for fun:

\- I know that Redis does this (RDB snaphots) too, while the child process
saves the actual snapshot, the parent can operate continously relying on the
system provided copy on write mechanism

\- I have also found this:
[https://github.com/thomasballinger/rlundo](https://github.com/thomasballinger/rlundo)
\- It overwrites readline library to get undo function in any REPL using it

Other examples?

~~~
chrisseaton
There is a time-travelling debugger that forks the process at each statement,
allowing you to go back to a previous statement and continue from that point.

~~~
adtac
Very interesting. Do you have a link where I can read more about this?

~~~
CyberShadow
Mozilla's rr (mentioned in the article) is a time-traveling debugger, but as
far as I know it doesn't use fork() for snapshots / time-travel.

~~~
roca
rr does use fork() internally to create checkpoints, and checkpoints are
created implicitly as the program executes to support time-travel. The COW
behaviour of fork() is extremely helpful for performance.

There's more to checkpoints than just fork(), though; for example, when two
processes are sharing memory, checkpointing the process group fork()s both
processes and then must explicitly duplicate the shared memory region,
otherwise the checkpoint would share memory with the original, which would be
disastrous. Unfortunately this means we don't get COW behaviour for shared
memory regions; unfortunately the Linux/POSIX APIs are a bit deficient here.

------
modeless
I used this forking trick to speed up the ninja build system too:
[https://github.com/ninja-build/ninja/pull/1438](https://github.com/ninja-
build/ninja/pull/1438)

Unfortunately the maintainers didn't like the idea so it won't be merged. But
if your project is large enough that ninja takes more than a second to parse
your build files every time you build (true for Chromium and Android at
least), you might want to try the patch.

On the topic of Windows forking, the RtlCloneUserProcess function does exist
and it works. The problem is that Microsoft's implementation of the Win32 API
keeps state (such as an IPC connection to csrss.exe) that is invalid in the
forked process and there's no easy way to clone or reinitialize that state.
The child process will run until its next Win32 API call and then probably
die. Since almost every Windows program uses Win32 APIs, RtlCloneUserProcess
is not useful for forking Windows programs.

Microsoft could fix this if they wanted to, at least for a useful subset of
Win32. But I imagine it requires a lot of hacking in deep, dark parts of
Windows that nobody wants to touch.

~~~
dblohm7
I have successfully reinitialized enough Win32 in a forked process to be able
to throw up a MessageBox, but without source I'd never be able to confidently
state that every Win32 call would work correctly.

~~~
modeless
Cool, is your implementation public somewhere? I'd be interested to see it.

------
WalterBright
Back in the 1970's, I was fascinated by the (now famous) ADVENT game and how
it worked, so I studied the Fortran source code.

When started up, it would do all kinds of slow (at the time) initialization of
its data structures. Then the genius move kicked in - it saved the memory
image to disk as the executable file! Next time it ran, it was already
initialized and started up instantly. It blew my mind.

I liked the idea so much, that years later when I worked on the text editor I
use (MicroEmacs), I did the user configuration the same way. Just change the
global variables, then patch the executable file on the disk with the changed
values. It was so, so much simpler than defining a configuration file,
serializing the data, writing/reading the file, and deserializing it. It
worked especially great on floppy disk systems, which were incredibly slow.

But then came viruses, malware, etc. Patching the exe file became a huge
"malware alert", and besides, Microsoft would mark the executable as "read
only" while it was running. Too bad.

~~~
wging
GNU Emacs does a similar-sounding trick in its build process (called 'dumping'
/ 'unexec'), but not at runtime.

[https://lwn.net/Articles/707615/](https://lwn.net/Articles/707615/)

Come to think of it, I think there are applications to my own work... I'd have
to be careful, though.

~~~
caf
If you're looking at implementing something along those lines these days, take
a look at CRIU:
[https://www.criu.org/Main_Page](https://www.criu.org/Main_Page)

------
AndyKelley
Impressive demo and lovely writeup. That was fun to watch.

I've been working on building the Zig self-hosted compiler this way from the
ground up, except with reference counting the cached stuff rather than using
fork(). This lets me do the caching at the _function_ level, even more fine
grained than the file level. Here's a 1min demo:
[https://www.youtube.com/watch?v=b_Pm29crq6Q](https://www.youtube.com/watch?v=b_Pm29crq6Q)

------
spatz
Why are the results in a video instead of a table? I don't want to wait for
dmd to know how much faster you made it.

~~~
CyberShadow
> Why are the results in a video instead of a table?

Seeing is believing! :)

> I don't want to wait for dmd to know how much faster you made it.

Here are the timings from the video in table form:

    
    
      Normal compilation (without forking):         9.163 seconds
      Full compilation + creating forks:           10.464 seconds
      Final step only (code generation + linking):  2.847 seconds
      After editing entry point file only:          3.792 seconds
      After editing deep dependency (first build):  9.562 seconds
      After editing deep dependency (second build): 4.675 seconds

~~~
sam0x17
would have preferred this format as well. I hate when the goods are buried in
a video.

------
simonsaidit
7 seconds slow.. I was happy when I decreased the 32 hours build to 4 hours.

~~~
jstimpfle
7 seconds for a rebuild after a small edit is already a little unergonomic
when you are iterating heavily. But it all depends on the project. Simple
C-style code should compile at ~1M lines of code per second on today's
machines without heavy optimization, and maybe compiling on 4 cores or so.
(Not possible with common infrastructure like C, gcc/clang/msvc, large headers
per compilation unit, many small files, object file style linking! and the
awfully slow linkers (I wonder if they could be faster))

~~~
vbezhenar
I knew one guy who used C++ but hated STL and most of standard library. He
usually implemented data structures in-place as he need them (linked list,
growable array, etc). While it seems strange to reimplement similar data
structure over and over again, it worked for him. And one particularly
wonderful thing was how fast his projects compiled. He had one project with
few libraries, around 100k lines and it compiled very fast, something like
second from the scratch and it was 10 years ago.

~~~
jki275
That's the way I learned C++. While I mostly use the STL now as it's much more
convenient to write, there is an elegance to writing one's own tools.

~~~
sam0x17
you'd also be shocked, SHOCKED at how inefficient some of the STL
implementations of basic data structure algorithms are

~~~
chillee
Of special note is the unordered_map/unordered_set implementations.

[https://news.ycombinator.com/item?id=7849213#7849607](https://news.ycombinator.com/item?id=7849213#7849607)

~~~
sam0x17
imagine all the CS research that relies on the STL expecting it to behave
optimally

------
w8rbt
The first part of this reads like an intro to algorithms class where I think
they cover all of these things. Directed graphs, back edges (cycles), convert
to a DAG (no cycles), strongly connected components, topological sorting, etc.
Nice to see it applied to solve the problem at hand.

------
CJefferson
I wish that Jupyter used some kind of similar trick, to allow quickly
recalculating a workbook starting at some cell.

~~~
Waterluvian
If a notebook was written functionally without side effects, couldn't each
function be explicitly or implicitly decorated with an @memoized ?

~~~
dooglius
That's not really how notebooks are written, at least in my experience. Even
if there aren't side effects, work is often done at the lowest level without
defining functions at all. Also, the fact that function definitions can change
makes it tricky, and I suspect you'd run into lots of subtle issues: e.g. if
you modified a function foo and a subsequent function bar calls foo, then I
think calls to bar would return stale memoized data.

~~~
Waterluvian
Yes, agreed. I'm spitballing. I was thinking about how a hypothetical type of
notebook could be structured in frames, where each frame can have side
effects, but ultimately its input comes from the output of the previous frame.
So each frame could be memoized. The UI could indicate what frames are being
re-processed.

I'm basically thinking about ArcGIS Model Builder[1] that I used a lot in
university. It was a great way to make complex process pipelines for GIS data,
but only re-run the pieces that change. It allowed me to experiment at a very
fast pace.

[1] [http://pro.arcgis.com/en/pro-
app/help/analysis/geoprocessing...](http://pro.arcgis.com/en/pro-
app/help/analysis/geoprocessing/modelbuilder/what-is-modelbuilder-.htm)

------
bwasti
there’s a single algorithm (Tarjan’s) that actually solves all your graph
needs. It returns a reverse topologically sorted DAG of strongly connected
components

~~~
CyberShadow
Yes, but I need to sort by two criteria, and Kosaraju's algorithm is simpler
to implement :)

------
snissn
I used to use your actionscript3 decompiler, thank you cybershadow!!! It was a
great tool :)

------
faragon
Ask Fabrice Bellard to write tdc after tcc [1] :-)

[1] [https://bellard.org/tcc/](https://bellard.org/tcc/)

------
WhitneyLand
Nice work thank you for sharing.

Tangentially it’s not clear to me what relevance D should have for any
greenfield projects where there is not already heavy investment in a code base
to consider.

First there’s the whole Rust comparison, and then the incredible impact of the
comparative size and momentum of ecosystems for various languages.

This reddit thread has a few interesting comments. It was quite surprising to
see D’s architect/designer discount the value of memory management issues in a
sub link there. Anecdotally what I’ve read, and experienced, is that it’s one
of the most important sources of bugs when quantity and time to debug and fix
are considered.

[https://www.reddit.com/r/rust/comments/5h0s2n/what_made_rust...](https://www.reddit.com/r/rust/comments/5h0s2n/what_made_rust_more_popular_than_d/)

~~~
EvenThisAcronym
> Tangentially it’s not clear to me what relevance D should have for any
> greenfield projects where there is not already heavy investment in a code
> base to consider.

Another one of Cybershadow's articles on D that blew my mind is, IMO, one of
the greatest arguments for why you should care about D. The beauty,
conciseness and flexibility of the implementation here is extremely cool. And
this is not a one-off case; there are many programmers in the D community that
create awesome stuff like this all the time. Writing in D is like having
superpowers, and going back to a less expressive language feels horribly
constraining.

~~~
p0nce
Perhaps are you referring to this article?

[https://blog.thecybershadow.net/2014/03/21/functional-
image-...](https://blog.thecybershadow.net/2014/03/21/functional-image-
processing-in-d/)

Vladimir has one hell of a blog.

~~~
EvenThisAcronym
Yes, thanks. Forgot to actually post the link.

------
reikonomusha
This is all so complicated, and reminds me of the joy of doing incremental
development in Common Lisp with Emacs and SLIME. Edit-Compile-Run-Repeat
cycles feel comparably archaic and slow compared to the much, much shorter and
faster EditFunction-RunFunction cycles.

~~~
geokon
My only experience is with ELisp/Clojure and you're right that the turn around
time is much faster with editing, but at least in ELisp I find "stepping"
through code much more confusing. The call stack (or whatever the equivalent
is called) has some weird characteristics:

1 - It doesn't match one to one to the code and is in a funky non S-expr form
where the function name is outside the parans

2 - As expressions are evaluated and you pop up the stack it will dynamically
filled in parts of the higher expression so the "stack" is morphing and
changing. It's cool and convenient but also disorienting.

3 - Each line/frame can be horribly long (like a whole let or cond expression
in one line) and it's not clear which section/term/subexpression is being
called in the frames below

I'm using the normal debug-on-entry and I'm definitely no pro, so maybe there
is more ergonomic way to debug? (In the little Clojure I've done it seems to
be the same)

~~~
reikonomusha
I believe that this is an artifact of elisp being essentially interpreted, but
I could be wrong. With Common Lisp and SBCL, you only see function calls in
your stack trace.

~~~
geokon
Gotcha. Thanks. It's probably some issue with my workflow. I'll ask on
/r/emacs and see if they give me any tips :)

------
vectorEQ
this is very interesting and nice post. thanks! i love this use of fork. it
should be obvious from the description of the function that it can be used as
such but really never even came to mind. cool!!

------
p1necone
Did anyone else read this title as "D compilation is too slow and I am forking
the compiler (in source control because I think I can do a better job writing
it)"?

~~~
nineteen999
Everybody thought that until they read the article, which was clearly the
intention of the author.

------
qwerty456127
I wish somebody would fix the GCC/G++ this way. I was stupid enough to try
installing webkitgtk from aur (as I couldn't find what binary Arch package
would make it work in Python), many hours have already passed, I have already
written the program I wanted with Qt instead of Gtk while waiting (although
the whole computer works annoyingly slowly during the process) and it still is
compiling...

------
kardos
What would be involved to make ccache [1] support D?

[1] [https://ccache.samba.org/](https://ccache.samba.org/)

------
jon889
I understood it up to "Note how the compilation order is the reverse of the
topological order (we’ll get back to this in a bit)." But then couldn't find
where you got back to it, and can't quite understand why it's the reverse of
the topological order or how that's possible?

~~~
tpxl
>can't quite understand why it's the reverse of the topological order

For example: A depends on B depends on C, so the graph is A->B->C. To compile
A, first B must be compiled, and for B, C must first be compiled. The
compilation order has to be in reverse of the dependency graph.

>how that's possible?

The compiler first determines dependencies and then compiles. The compilation
then happens per file, so that part is trivial.

------
auscompgeek
What's the story over in gdc and ldc land? I presume since the plumbing for
this sort of thing is already there (presumably), they might already be doing
these sorts of AOT/incremental builds.

~~~
CyberShadow
They share the same frontend as DMD, which doesn't have any built-in capacity
to hibernate/serialize compilation results (parsed AST) to disk.

I haven't tried applying this to either compiler. Being able to perform code
generation serially would benefit those the most, because their backends are
considerably slower than DMD's, however the template instantiation problem
currently prevents that. I don't think it's insurmountable, probably just
needs someone very familiar with that part of the compiler to look at it.

~~~
WalterBright
> doesn't have any built-in capacity to hibernate/serialize compilation
> results (parsed AST) to disk

It turns out there isn't much purpose to that. DMD can build an AST from
source so fast, that building an AST from parsing a serialized version will
hardly be any faster.

Vladimir sidestepped that by saving a memory image via fork.

------
anticensor
The original author of D is suprisingly missing from conversation.

~~~
WalterBright
Walter's D9000 computer is monitoring this thread in his stead.

~~~
MaxBarraclough
Not polling, I trust.

Hacker News threads: like Hacker News fibers, but less cooperative.

------
grandinj
LibreOffice Online also uses a warmed-up-process that forks child processes to
execute requests

------
amelius
Note: in this case "fork" means fork() in the sense of creating a new process,
not as in duplicating the source tree.

~~~
zellyn
This discussion thread is hilarious (and adorable) to me. Makes me feel old
(in a good way) that there are folks for whom the default meaning of "fork" is
git(hub), rather than the unix fork syscall.

~~~
nathan_f77
OHHH. Haha I had to read your comment before I understood the title. I knew
they were talking about the fork syscall while reading the article, but I
still assumed the title referred to "forking the compiler on GitHub, so that
they could make it better and send a pull request."

It didn't click that the title was also talking about the fork() syscall. I
think it's a pun though: "To try it yourself, check out the dmdforker branch
in my dmd GitHub fork."

~~~
zellyn
Double OHHH. Just realized upon reading _your_ comment that I'd read the
title/article the same way, and then forgotten. I guess “forking the D
compiler” naturally leads to the source-forking context.

I guess I've learned to forget/ignore the titles on hackernews after using
them to decide whether to click through, since they so often mess with them.

------
bhengaij
I thought D was proud of compilation times and claimed to be useful as a
scripting language?

~~~
CyberShadow
Correct. The language is designed to allow code written in it to compile
quickly, and the reference implementation is very fast (both the front-end,
and back-end). Small, script-size programs compile quickly.

However, D's compile-time metaprogramming facilities allowed us to get
ambitious in some places... for instance, std.uni precomputes some Unicode
lookup tables during compilation, and std.regex makes heavy use of
metaprogramming to compile regular expression strings to D code, again during
compilation. As a result, making use of those features will result in heavier
load on the compiler.

~~~
Asooka
I don't like code that generates stuff at compile time precisely for this
reason - I would much rather use some template language to pre-generate the
source (I've even seen php used) and then compile it quickly. Of course that
then needs support from the build system to correctly invoke the generator
when the templates change and you run the risk of some developer changing the
generated source instead of the template and wreaking havoc. I would very much
like to have my cake and eat it - powerful code generation facilities in the
language itself, but with a compiler that's as smart as the build system and
can cache these for reuse when recompiling the same file. That would also
allow easy integration with IDEs. For now, I've decided that the better
balance is generating code using an external generator, checking everything in
source control and requiring responsible people to do code review. YMMV of
course, depending on what you're compiling and who you're working with.

~~~
qznc
I agree with you. It feels natural for lexer/parser generators for example
(Yacc, Bison, AntLR, etc). Also for FFI bindings.

On the other hand, for the little things it would be weird. Would you do that
for a single regular expression?

------
lyrachord
D compilation: Not only slow but also high usage of memory. f.e. a simple
program with searching and regex eats 2-3 G memory. sigh!

~~~
faissaloo
I haven't had this problem. I wrote a daemon to blend colours for my RGB
keyboard in D and it stays below 8192 bytes.

