It is interesting to use fork() to snapshot application state. It seems so high-level and powerful (Compared to other C/POSIX level stuff - so I'm not comparing it to Scheme/Lisp/"your favorite high/low level environment" here!). We may collect some other examples here - just for fun:
- I know that Redis does this (RDB snaphots) too, while the child process saves the actual snapshot, the parent can operate continously relying on the system provided copy on write mechanism
This is, incidentally, one of the built-in advantages of immutable data structures (in the languages that feature them). For example I recently had to write a test that worked with a fairly complex connection object, the execution of the test changed that object but later on I wanted to reset it to its earlier state- this was as simple as doing something like this in Elixir
original_conn = conn
conn = do_something_expensive(conn, params)
assert <whatever>
conn = original_conn # this "grabs" the earlier state, no matter how complex it is
conn = do_something_else(conn, params)
assert <something having to do with both calls>
Basically, if your language is immutable, then any reference to any state (no matter how complex the data structure) is the state it was at that point in time
You could say that changes to the state are forked from the earlier state.
Not exactly a fork, but Emacs is (famous/notorious) for saving its initial environment by basically coredumping and loading the coredump on startup if the startup files haven't changed:
Only of historical relevance, but sure gave me a good chuckle of the "omg I can't believe that's how they made that work!"-sort when I first read it: Ultima Online fork()ed its server process(es) for marshalling in-process data into files on NFS, and used that mechanism to persist world state: https://news.ycombinator.com/item?id=9820645
More precisely, the first version of the TeX program (now called TeX78 or TeX80) which was written in the SAIL programming language at Stanford was written in this way, because it was apparently the natural thing to do and lots of programs at the time were written like that. The present version (rewrite) of the TeX program (briefly called TeX82 but now just called TeX), was written in (a language/system based on) Pascal and meant to be portable for different OSes. So in addition to using the system's fork or dump/undump (if available) to snapshot the application state, it contains its own implementation of dumping all the relevant application state into a "format file", and loading from it. In fact, the dumping and undumping were traditionally done by different binaries:
Chrome and Android have both forked() warmed-up processes to speed application startup. Although looking at this 2011 post it's more complicated than I thought in Chrome:
There is a time-travelling debugger that forks the process at each statement, allowing you to go back to a previous statement and continue from that point.
It seems like forking so often would use up a system's resources very quickly. I'd imagine that each fork needs at least a handful of individual pages for OS bookkeeping overhead, so you'd get into the gigabyte range fast even with perfect copy-on-write.
Yes you can't run like that forever. But you could break, step forward and backward and bit, and then resume. I often step forward while debugging, realise that whatever is going wrong has already gone wrong, and would like to step back a few steps.
rr does use fork() internally to create checkpoints, and checkpoints are created implicitly as the program executes to support time-travel. The COW behaviour of fork() is extremely helpful for performance.
There's more to checkpoints than just fork(), though; for example, when two processes are sharing memory, checkpointing the process group fork()s both processes and then must explicitly duplicate the shared memory region, otherwise the checkpoint would share memory with the original, which would be disastrous. Unfortunately this means we don't get COW behaviour for shared memory regions; unfortunately the Linux/POSIX APIs are a bit deficient here.
But I suppose immutable data structures can be more efficiently stored using a heap with byte-size-granularity, rather than by forking, which uses pages of perhaps thousands of bytes.
There is another fairly similar design decision split:
LVM - low (block) level, hardware supported, generic, unavare of filesystem layer above (so it may not be optimal for some usecases) but at least this gives some separation of concerns, and still can provide some impressive features: resize, encrypt, snapshots
CoW filesystems (btrfs, zfs, ...) - specialized, carefully designed data structures/layout, can have more advanced features and more optimized to given ussecases (they have more information what is actually happening), but they are not a generic separatable layer - all the fancy advanced features are tied to the high level fs implementation
Why would you think so? A page is 4096 bytes and it's hard coded into the kernel (and the hardware). With byte sized granularity, you are just introducing extreme large overhead for page table entries and make the TLB a lot less effective.
- The Resque background job queue (popular in the Ruby world) forks itself to run each job, to isolate the scheduler from any badness in the jobs.
- I’ve used fork() in Ruby for similar reasons when a library had instability or memory leaks and I wanted to isolate it rather than try to find the bug (I’m looking at you, ImageMagick).
A while back, MRI Ruby got an improved garbage collector that keeps the mark bits separate from the objects just so copy-on-write would (mostly) work in cases like this.
Unfortunately the maintainers didn't like the idea so it won't be merged. But if your project is large enough that ninja takes more than a second to parse your build files every time you build (true for Chromium and Android at least), you might want to try the patch.
On the topic of Windows forking, the RtlCloneUserProcess function does exist and it works. The problem is that Microsoft's implementation of the Win32 API keeps state (such as an IPC connection to csrss.exe) that is invalid in the forked process and there's no easy way to clone or reinitialize that state. The child process will run until its next Win32 API call and then probably die. Since almost every Windows program uses Win32 APIs, RtlCloneUserProcess is not useful for forking Windows programs.
Microsoft could fix this if they wanted to, at least for a useful subset of Win32. But I imagine it requires a lot of hacking in deep, dark parts of Windows that nobody wants to touch.
That’s an interesting note about Chromium, I wonder if it will move at some point to Bazel given that Chromium's latest build script generator GN might be called Yet Another Build System That Looks Suspiciously Like Bazel. (Others: https://www.pantsbuild.org/https://buckbuild.com/https://please.build/)
Obviously this won’t happen unless Bazel adds, like, a million things that Chromium developers think are essential features but still, I wouldn’t be surprised. Bazel already has a ton of developer effort thrown at it to make it work with large code bases well, and recompile large, complicated projects quickly. The forking idea (or at least, the client/server model) is already used by Bazel, and Bazel is capable of managing subprocesses for compiling specific languages. This is how Bazel’s TypeScript support works, for example. When you run Bazel it spawns a server which hangs out in the background, and that server will run a TypeScript compiler in the background. Compiling a TypeScript project with Bazel is noticeably and significantly faster than compiling from the command line, without the hacky / fragile bits that tsc --watch gives you. And obviously none of this would be possible unless the TypeScript compiler were written to heavily cache things and run as a library (see https://youtu.be/f6TCB61fDwY).
I have successfully reinitialized enough Win32 in a forked process to be able to throw up a MessageBox, but without source I'd never be able to confidently state that every Win32 call would work correctly.
Back in the 1970's, I was fascinated by the (now famous) ADVENT game and how it worked, so I studied the Fortran source code.
When started up, it would do all kinds of slow (at the time) initialization of its data structures. Then the genius move kicked in - it saved the memory image to disk as the executable file! Next time it ran, it was already initialized and started up instantly. It blew my mind.
I liked the idea so much, that years later when I worked on the text editor I use (MicroEmacs), I did the user configuration the same way. Just change the global variables, then patch the executable file on the disk with the changed values. It was so, so much simpler than defining a configuration file, serializing the data, writing/reading the file, and deserializing it. It worked especially great on floppy disk systems, which were incredibly slow.
But then came viruses, malware, etc. Patching the exe file became a huge "malware alert", and besides, Microsoft would mark the executable as "read only" while it was running. Too bad.
It seems like someone should come up with an "executable+data storage" file format (where virus scanners can know to only check changes in the executable section) Maybe they can make one that is supported cross platform while they're at it...
> Why are the results in a video instead of a table?
Seeing is believing! :)
> I don't want to wait for dmd to know how much faster you made it.
Here are the timings from the video in table form:
Normal compilation (without forking): 9.163 seconds
Full compilation + creating forks: 10.464 seconds
Final step only (code generation + linking): 2.847 seconds
After editing entry point file only: 3.792 seconds
After editing deep dependency (first build): 9.562 seconds
After editing deep dependency (second build): 4.675 seconds
Impressive demo and lovely writeup. That was fun to watch.
I've been working on building the Zig self-hosted compiler this way from the ground up, except with reference counting the cached stuff rather than using fork(). This lets me do the caching at the function level, even more fine grained than the file level. Here's a 1min demo: https://www.youtube.com/watch?v=b_Pm29crq6Q
7 seconds for a rebuild after a small edit is already a little unergonomic when you are iterating heavily. But it all depends on the project. Simple C-style code should compile at ~1M lines of code per second on today's machines without heavy optimization, and maybe compiling on 4 cores or so. (Not possible with common infrastructure like C, gcc/clang/msvc, large headers per compilation unit, many small files, object file style linking! and the awfully slow linkers (I wonder if they could be faster))
I knew one guy who used C++ but hated STL and most of standard library. He usually implemented data structures in-place as he need them (linked list, growable array, etc). While it seems strange to reimplement similar data structure over and over again, it worked for him. And one particularly wonderful thing was how fast his projects compiled. He had one project with few libraries, around 100k lines and it compiled very fast, something like second from the scratch and it was 10 years ago.
I've almost purposely avoided studying what the STL does because of that. I have some of my own notions on how some of the data structures ought to be built in order to be efficient, and sometimes I implement them myself to be sure I get the behavior I want.
My own experiments. I have recently bought a new PC with an i5-7600 and my own pascal parser, written in straightforward unoptimized C, parses about 2M lines of code per second on a single thread.
Also I'm following Jonathan Blow's streams (Jai language). With his own trivial x64 backend he has/had a very real compilation speed of 80K lines/s, and only the parser was parallelized. I think he indends to improve that to 800K lines/s. Note that his language is quite a bit more work to compile than a basic C-like language (which is about as easy as you can get if you discount the parsing model).
Seven seconds is already pretty fast. Might have warranted some profiling and a brief attempt to speed up the parts of the compiler that your usage hits, since that could potentially be useful to everybody and thus magnify the value of cutting one second off a build. But this? Was a cool learning experience, no more, no less.
7 seconds is plenty fast enough to make “watch” the cheapest and most reliable solution to the speed problem.
I'm not sure what you're talking about with respect to the C compiler. C compilers are the fastest around even before being paralllelized - and they can be.
C parsing is rather slow due to the preprocessor (also: huuuge include files) and due to context-dependent syntax (need symbol table). And compiling itself need not be slow, but the popular compilers are not exactly super fast even with -O0. LLVM in particular has a reputation of being somewhat bloated. And finally, due to common practice of using a standard object file format, which is slow to link (at least with common tools), the whole build experience is not exactly blazingly fast.
With clang/gcc/msvc, I think I'm more in the ballpark of 3K-30K lines/sec (-O0, 100-1000 lines/file, not counting basic std* includes).
gold is pretty fast. I don't have a source but those numbers seem very off to me. C compilers are fast. C++ much less so, but C++ is a different language.
I tried to follow gold's author's series on linkers once but gave up. Now that I've written an ELF-64 writer and know a little more about linking I should give it another shot. In any case, I think I've heard of still faster linkers than gold, and at the time I also found gold's object oriented architecture (as it was described in the series) questionable.
The first part of this reads like an intro to algorithms class where I think they cover all of these things. Directed graphs, back edges (cycles), convert to a DAG (no cycles), strongly connected components, topological sorting, etc. Nice to see it applied to solve the problem at hand.
The Jupyter models is hard for this, but if you only allow in-order execution (like a repl, say IPython whose kernel is used in Jupyter) you could do something like this.
I wanted something like this for live programming and felt like I needed to write my own interpreter because the behavior you need is so different: http://dalsegno.ballingt.com
That's not really how notebooks are written, at least in my experience. Even if there aren't side effects, work is often done at the lowest level without defining functions at all. Also, the fact that function definitions can change makes it tricky, and I suspect you'd run into lots of subtle issues: e.g. if you modified a function foo and a subsequent function bar calls foo, then I think calls to bar would return stale memoized data.
Yes, agreed. I'm spitballing. I was thinking about how a hypothetical type of notebook could be structured in frames, where each frame can have side effects, but ultimately its input comes from the output of the previous frame. So each frame could be memoized. The UI could indicate what frames are being re-processed.
I'm basically thinking about ArcGIS Model Builder[1] that I used a lot in university. It was a great way to make complex process pipelines for GIS data, but only re-run the pieces that change. It allowed me to experiment at a very fast pace.
there’s a single algorithm (Tarjan’s) that actually solves all your graph needs. It returns a reverse topologically sorted DAG of strongly connected components
Tangentially it’s not clear to me what relevance D should have for any greenfield projects where there is not already heavy investment in a code base to consider.
First there’s the whole Rust comparison, and then the incredible impact of the comparative size and momentum of ecosystems for various languages.
This reddit thread has a few interesting comments. It was quite surprising to see D’s architect/designer discount the value of memory management issues in a sub link there. Anecdotally what I’ve read, and experienced, is that it’s one of the most important sources of bugs when quantity and time to debug and fix are considered.
> Tangentially it’s not clear to me what relevance D should have for any greenfield projects where there is not already heavy investment in a code base to consider.
Another one of Cybershadow's articles on D that blew my mind is, IMO, one of the greatest arguments for why you should care about D. The beauty, conciseness and flexibility of the implementation here is extremely cool. And this is not a one-off case; there are many programmers in the D community that create awesome stuff like this all the time. Writing in D is like having superpowers, and going back to a less expressive language feels horribly constraining.
This is all so complicated, and reminds me of the joy of doing incremental development in Common Lisp with Emacs and SLIME. Edit-Compile-Run-Repeat cycles feel comparably archaic and slow compared to the much, much shorter and faster EditFunction-RunFunction cycles.
My only experience is with ELisp/Clojure and you're right that the turn around time is much faster with editing, but at least in ELisp I find "stepping" through code much more confusing. The call stack (or whatever the equivalent is called) has some weird characteristics:
1 - It doesn't match one to one to the code and is in a funky non S-expr form where the function name is outside the parans
2 - As expressions are evaluated and you pop up the stack it will dynamically filled in parts of the higher expression so the "stack" is morphing and changing. It's cool and convenient but also disorienting.
3 - Each line/frame can be horribly long (like a whole let or cond expression in one line) and it's not clear which section/term/subexpression is being called in the frames below
I'm using the normal debug-on-entry and I'm definitely no pro, so maybe there is more ergonomic way to debug? (In the little Clojure I've done it seems to be the same)
I believe that this is an artifact of elisp being essentially interpreted, but I could be wrong. With Common Lisp and SBCL, you only see function calls in your stack trace.
this is very interesting and nice post. thanks! i love this use of fork. it should be obvious from the description of the function that it can be used as such but really never even came to mind. cool!!
Did anyone else read this title as "D compilation is too slow and I am forking the compiler (in source control because I think I can do a better job writing it)"?
I wish somebody would fix the GCC/G++ this way. I was stupid enough to try installing webkitgtk from aur (as I couldn't find what binary Arch package would make it work in Python), many hours have already passed, I have already written the program I wanted with Qt instead of Gtk while waiting (although the whole computer works annoyingly slowly during the process) and it still is compiling...
I understood it up to "Note how the compilation order is the reverse of the topological order (we’ll get back to this in a bit)." But then couldn't find where you got back to it, and can't quite understand why it's the reverse of the topological order or how that's possible?
>can't quite understand why it's the reverse of the topological order
For example: A depends on B depends on C, so the graph is A->B->C. To compile A, first B must be compiled, and for B, C must first be compiled. The compilation order has to be in reverse of the dependency graph.
>how that's possible?
The compiler first determines dependencies and then compiles. The compilation then happens per file, so that part is trivial.
What's the story over in gdc and ldc land? I presume since the plumbing for this sort of thing is already there (presumably), they might already be doing these sorts of AOT/incremental builds.
They share the same frontend as DMD, which doesn't have any built-in capacity to hibernate/serialize compilation results (parsed AST) to disk.
I haven't tried applying this to either compiler. Being able to perform code generation serially would benefit those the most, because their backends are considerably slower than DMD's, however the template instantiation problem currently prevents that. I don't think it's insurmountable, probably just needs someone very familiar with that part of the compiler to look at it.
> doesn't have any built-in capacity to hibernate/serialize compilation results (parsed AST) to disk
It turns out there isn't much purpose to that. DMD can build an AST from source so fast, that building an AST from parsing a serialized version will hardly be any faster.
Vladimir sidestepped that by saving a memory image via fork.
This discussion thread is hilarious (and adorable) to me. Makes me feel old (in a good way) that there are folks for whom the default meaning of "fork" is git(hub), rather than the unix fork syscall.
OHHH. Haha I had to read your comment before I understood the title. I knew they were talking about the fork syscall while reading the article, but I still assumed the title referred to "forking the compiler on GitHub, so that they could make it better and send a pull request."
It didn't click that the title was also talking about the fork() syscall. I think it's a pun though: "To try it yourself, check out the dmdforker branch in my dmd GitHub fork."
Double OHHH. Just realized upon reading your comment that I'd read the title/article the same way, and then forgotten. I guess “forking the D compiler” naturally leads to the source-forking context.
I guess I've learned to forget/ignore the titles on hackernews after using them to decide whether to click through, since they so often mess with them.
Hasn't the term 'fork' been overloaded long before GitHub (or even git)? Project forks have been a thing for as long as there have been open source projects...
That reminds me that "fork" when used as a verb on GitHub's UI should more accurately be "cloud clone".
GitHub's terminology has muddied the waters between the concept of actual forks (longterm divergent development) and a particular implementation of a permission system for branches.
We used to just call it a 'branch' because it was just another branch in the distributed revision tree. There is nothing special about a branch you create and commit to, whether it's by clicking on a picture on the screen or typing a command in a terminal.
Calling a branch a fork is just a water-muddying term probably coined by some suit in marketing who saw the buzzword somewhere in an article on the Internet Explorer and thought they would re-purposed to use as a product differentiator.
That's a second-order observation. We could use any noun for that if and when it's relevant - I suggest alternadoodley.
Like I said, "fork" already has a useful meaning with regard to an OSS project, so there's no need to overload it. Is the following sentence not confusing? When you fork on GitHub you create a fork but not every fork is a fork.
Put it this way, when the user presses the Fork button in the GitHub UI, is their intention to contribute to the project, or is it to create a competitor/alternative to the project.
I say the former situation is an obfuscated permission system and the latter is a fork.
[ I've gone rather off-topic from TFA so will stop commenting now! ]
Correct. The language is designed to allow code written in it to compile quickly, and the reference implementation is very fast (both the front-end, and back-end). Small, script-size programs compile quickly.
However, D's compile-time metaprogramming facilities allowed us to get ambitious in some places... for instance, std.uni precomputes some Unicode lookup tables during compilation, and std.regex makes heavy use of metaprogramming to compile regular expression strings to D code, again during compilation. As a result, making use of those features will result in heavier load on the compiler.
I don't like code that generates stuff at compile time precisely for this reason - I would much rather use some template language to pre-generate the source (I've even seen php used) and then compile it quickly. Of course that then needs support from the build system to correctly invoke the generator when the templates change and you run the risk of some developer changing the generated source instead of the template and wreaking havoc. I would very much like to have my cake and eat it - powerful code generation facilities in the language itself, but with a compiler that's as smart as the build system and can cache these for reuse when recompiling the same file. That would also allow easy integration with IDEs. For now, I've decided that the better balance is generating code using an external generator, checking everything in source control and requiring responsible people to do code review. YMMV of course, depending on what you're compiling and who you're working with.
It's pretty damn fast every time I use it, but I don't do as much metaprogramming, which doesn't surprise me that a specific portion of D takes up time, it makes total sense to me that metaprogramming would take up a bit of time, and throw in some regex and voila, but he's definitely a D wizard in my book, I've known his work for a while now. I'm laughing at the fact 7 seconds is too long, I think it's a reasonable amount of time, but I'm usually compiling C#. The most time for me when building D is when pulling in libraries using dub (which potentially could be improved, but I'm no expert in the matter) although if I enable parallel compilation that usually breezes on by through my code.
However, D's compile-time metaprogramming facilities allowed us to get ambitious in some places... for instance, std.uni precomputes some Unicode lookup tables during compilation, and std.regex makes heavy use of metaprogramming to compile regular expression strings to D code, again during compilation. As a result, making use of those features will result in heavier load on the compiler.
- I know that Redis does this (RDB snaphots) too, while the child process saves the actual snapshot, the parent can operate continously relying on the system provided copy on write mechanism
- I have also found this: https://github.com/thomasballinger/rlundo - It overwrites readline library to get undo function in any REPL using it
Other examples?