Hacker News new | comments | show | ask | jobs | submit login
Let's make the Emacs GC safe and iterative (gnu.org)
208 points by noch 3 months ago | hide | past | web | favorite | 64 comments

It alrady handles long lists without recursion:

    case Lisp_Cons:
	register struct Lisp_Cons *ptr = XCONS (obj);
	if (CONS_MARKED_P (ptr))
	CONS_MARK (ptr);
	/* If the cdr is nil, avoid recursion for the car.  */
	if (EQ (ptr->u.s.u.cdr, Qnil))
	    obj = ptr->u.s.car;
	    cdr_count = 0;
	    goto loop;
	mark_object (ptr->u.s.car);
	obj = ptr->u.s.u.cdr;
	if (cdr_count == mark_object_loop_halt)
	  emacs_abort ();
	goto loop;
The "goto loop" chases the cdr pointer iteratively. So with regard to list structure, this will only blow the stack if you have deep car recursion. When conses are used to represent nested lists, you don't get that much depth on the CAR side.

I'm just wondering what is the exact scenario, involving what structure.

Of course there can be other kinds of linked objects. E.g. a huge graph structure, where the GC happens to traverse a very long path through the graph before bottoming out somewhere.

BTW, "[i]f the cdr is nil, avoid recursion for the car." Initial reaction: good frickin' idea and I'm stealing it immediately. Though, wait, not sure what it buys you in practical terms; who the heck builds a deep structure of conses where the linkage is through car and the cdr-s are nil.

A better test might be for the cdr being any atom at all. Though obviously some kinds of atoms are structures with pointers that can have depth behind them, many kinds of atoms are shallow. If a cons has any atom whatsoever as its cdr, then maybe we should recurse on the cdr and tail loop on the car.

Analysis required.

You can crash Emacs for example with the recipe from #2099:

   $ emacs -Q --eval "(let (v) (while t (setq v (cons v v))))"
In response to how such deep structures can arise: They may for example arise when testing Emacs with randomly generated code snippets. For testing edge cases, it is highly desirable that Emacs run robustly also in such situations.

Indeed emacs should be robust. I recently ran some fuzz-testing of Emacs, and found evaluating a similar recursive example would crash it:


My case was resolved (a file with a few thousand "`" characters), but despite running more fuzzing I couldn't get any other interesting results.

I use AFL (American Fuzzy Lop) on TXR from time to time.

And, guess what, there was also a problem with nested backquotes. Backquote is spelled ^ (caret) in TXR Lisp, so the cases that were triggering the degenerate behavior had ^^^^^^^ rather than ```````...


Small world. This is my bug-report, which meandered a little:


That loop doesn't release any of the objects that it creates; the pointer v always hangs on to every cons cell that was created. So there is nothing for the garbage collector to do here other than to confirm that everything is reachable and not actually reclaim anything. If we disable garbage collection, it will hit OOM. Die here, die there.

BTW: (setq v (cons v v)) -> (push v v)

Whatever it does, I expect Emacs to not crash.

If it cannot allocate more memory, I expect it to throw a Lisp exception that tells me so, not to crash the whole process.

I greatly appreciate Daniel's work to increase the robustness of Emacs, because it means that there will be fewer situations where a process crash leads to lost work for users. In addition, increasing the robustness of Emacs in such cases will also greatly broaden the set of test cases that can be run reliably, and hence help to detect and correct even more issues.

Without a question, if cons cannot allocate, it should throw some exception; the code with its best intentions should be there. That way, someone who can tune their operating system or run on a MMU-less system can take advantage of it. The rest of the people won't actually it though; their system will thrash and either it or they will kill the offending process before it has a chance to reach that exception throw. That's if they don't run out of patience and just reboot the machine.

> I'm just wondering what is the exact scenario, involving what structure.

> Was: Re: bug#30626

    (seq-doseq (_ (stream-range 1 1000000)) nil)


Looks like an X Y problem here. Let's fix the garbage collector (X) because of some badly implemented lazy processing (Y) where the real issue is.

I don't believe anyone has a filesystem that can't be traversed by a lazy data structure in Lisp, with a recursive GC, in 8 megs of stack. (Unless, say, the traversal is being fooled by cycles created by directory symlinks.)

One could argue that. One could also argue that it should be possible to collect any garbage successfully created.

Nope; worse is better. It has to be recognized that a garbage collection scheme which can satisfy this idealist requirement has some disadvantages relative to one which doesn't.

Should it also be possible to successfully read any well-formed S-expression that we can create in a text stream? Oops, don't use recursion in the reader, then.

It doesn't have disadvantages apart from the code to manually manage an explicit stack. The actual space used will be reduced.

And yes, it should be possible to successfully read any well formed sexp that can be created in a text stream and held in memory. That's why quality parsers use an explicit stack as well.

Recursion is a useful development tool in some cases, but when iterative algorithms fail because they use it, the implementations are buggy.

I rewrote the JSC parser to be mostly iterative many years ago - for a “real” language making an iterative parser that is readable is incredibly challenging. Especially in something like js where an expression could actually contain a statement. The current state is “as iterative as is sanely understandable” and handles all real (and some insane) programs fine. I suspect similar is true for other production hand written parsers.

Your alternative is table based parsers which can only reasonably be implemented with generators. The problem is that table based parsers are slower than hand rolled recursive descent parser by a fairl large margin, and general purpose generators (bison, yacc , etc) tend to have difficulty with the quirky syntax of large languages. Note there have been a number of papers over the years that demonstrated 2x perf improvements in yacc/bison by switching from indexed state transitions to computed go to, but even that leaves them slower than hand rolled.

> for a “real” language making an iterative parser that is readable is incredibly challenging.

Not to mention if it is defined in terms of a character-classifying read-table which dispatches functions, which can be patched by the user program (as in Common Lisp).

You can hold a S-exp in memory such that you don't have anywhere near enough memory left to scan it with even your condensed explicit stack.

> That's why quality parsers use an explicit stack as well.

Observation: that rules out the entire description of the ANSI CL reader from being quality, haha.

Interesting. So that's why emacs whigs out sometimes when trying to open very large files: because it has a recursive GC that has trouble when it runs out of stack space. Assuming I read the first statement correctly.

Seems like a welcome change. I'm not too well versed with the internals, but does 4.6MB seem reasonable for a memory footprint? I mean, compared to Slack, not bad.

> The naive version of this scheme needs about 4.6MB of overhead on my current 20MB Emacs heap, but it should be possible to reduce the overhead significantly by taking advantage of the block allocation we do for conses and other types --- we can put whole blocks on the queue instead of pointers to individual block parts, so we can get away with a much smaller queue.

So that's why emacs whigs out sometimes when trying to open very large files: because it has a recursive GC that has trouble when it runs out of stack space. Assuming I read the first statement correctly.

That doesn't seem right to me. (a) In emacs a file's contents aren't stored in a deeply nested structure, no matter what the size. It's a gap buffer. (b) Structure aside, that buffer isn't a Lisp object, but an object internal to the C level, so not subject to GC. And note that Emacs's large file behavior isn't especially different than other non-minimalist editors[1].


Not to be rude but Electron apps' level of footprint has always been a compromise and should not be used to justify extraneous resource usage. Emacs must be able to run from the console and resource-constrained systems like AWS instances, where 5mb of RAM can make a difference.

Remember when people joked that EMACS stood for "Eats Memory And CPU Superbly?"

And then each generation reinvents its state-of-the-art abstracted memory hog...

Edit: more jokes, from 1985: https://github.com/sbp/lemacs/blob/master/etc/emacs.names

Eight Megs And Constantly Swapping

Precisely relevant to the issue at hand:

Eight Megs Ain't Comfortable Stack

If 5mb of ram makes the difference wouldn't that be the time to try mg[1] or edit remotely using tramp? Certainly agree that Electron should not be used as a justification.

[1] https://en.wikipedia.org/wiki/Mg_(editor)

As of late I've been using `emacs --daemon=some-name` on instances. I attach with `emacs-client -s some-name -t` after I ssh in and detach with `C-x 5-0`. Sometimes I'll run multiple named emacs daemons. This has worked really well to keep persistent independent workspaces on remote instances. Tramp was always a headache and running emacs in screen tends to introduce terminal color and keybinding problems.

How does mg differ from Emacs proper?

It's a text editor, not a lisp environment that edits text.

I once had a decent amount of fun splicing it with Lua, before deciding that I don't actually care about text editors...

An additional overhead between 20% and 25% definitely doesn't feel reasonable to me. Hopefully this isn't linear.

It's not reasonable to me either. That's the naive version. :-) There are various tricks for reducing the overhead to insignificance.

It must strongly depend on what you've got activated; when I edit multi-gigabyte files I start up with '-q' so my init file doesn't turn on FlySpell.

If you read further in the thread you'll see that Stefan Monnier pointed out that using a stack instead of a queue for the objects would consume less memory than the current GC implementation.

> We need to fix GC being deeply recursive once and for all. Tweaking stack sizes on various platforms and trying to spot-fix GC for the occasional deeply recursive structure is annoying. Here's my proposal:

There is still a default stack size limit of 8 megabytes on Linux, even though consumer machines nowadays have RAM measured in double-digit gigabytes.

On an Ubuntu 16.x machine:

  $ ulimit -a | grep stack
  stack size              (kbytes, -s) 8192
  $ free -h
                total        used        free      shared  buff/cache   available
  Mem:            15G        4.1G        2.8G        600M        8.7G         10G
  Swap:           14G        626M         13G
This 8192 kbytes has not changed in at least 15 years, unlike that 15G. Slight disconnect there.

This 8192 is just the soft limit; applications can raise it up to the root-imposed hard limit.

Unfortunately, that tends to be only 4X larger than the 8192:

  $ ulimit -s 32769
  bash: ulimit: stack size: cannot modify limit: Operation not permitted

  $ ulimit -s 32768
  $ # OK
Clearly, the hard limit needs revision too.

Threads are another issue; if you have a "can run in any thread" garbage collector, it has to live within a thread stack. You can't have huge stacks for each of a large number of threads. A thread's stack cannot be temporarily extended; there may be something in the address space preventing it, since it's just another mmap. (Don't think Emacs has this problem. :)

> There is still a default stack size limit of 8 megabytes on Linux, even though consumer machines nowadays have RAM measured in double-digit gigabytes.

Current stack sizes are fine; storing large amounts of information on the stack usually isn't a good idea anyway, and makes tools like debuggers and perf(1) slow and awkward.

Besides, the ulimit means nothing. If you really want, you can allocate your own "stack", as large as you want, and switch to it. I just wouldn't recommend it.

> A thread's stack cannot be temporarily extended; there may be something in the address space preventing it, since it's just another mmap.

You might be interested in the split stack model, described here [1]. You definitely can grow a thread's stack as needed. The approach has various issues (which is why Rust eventually went with conventional stacks), but it certainly works.

[1] https://gcc.gnu.org/wiki/SplitStacks

How does big stacks slow down perf?

I agree putting large amounts of data on the stack isn't a great idea, but I think it's mostly because (a) default stack sizes are quite small (b) they vary by distribution and even local configuration and (c) if you happen to be on any thread other than the main thread, it's the pthread stack size that matters and this is even usually smaller than the default main thread stack.

I never heard of it causing any problems for perf. Yes, perf unwinds the stack if you ask it to, but all the mechanisms will don't depend (AFAIK) on the stack size.

Putting large data on the stack is a good idea as a standard Fortran optimization, at least (gfortran -fstack-arrays and ifort default, I think). You typically need to increase the stack limit greatly for HPC jobs. I'm not aware of that causing trouble for profiling, and would also be interested in how it does.

By the way, on glibc/Linux, you need (main thread) stack space proportional to the complexity of your shared libraries (number of symbols). The ld.so memory manager uses alloca for data structures related to symbol resolution.

Interesting. I guess this memory is also used at somewhat unpredictable times in the usual case where symbols are resolved lazily though the PLT: when you first use a function you'll suddenly get this extra stack usage while ld.so does it's thing?

Doesn't that also imply this caveat also applies to thread stacks if you end up resolving symbols off of the main thread?

Current limits are fine, but also meaningless since you can trivially bypass them? If so why not raise the limits if they're only serving to have everyone implement their own bespoke stacks.

The ulimit serves as a good reminder that excessive stack use is undesired.

It's not a reminder if Linux's default configuration is to hard limit it at 32768, that means that if you need 32769 of stack you need to hack your own implementation for your program to be portable, or otherwise the user must be able to "sudo".

Why don't they just use the Deutsch-Schorr-Waite algorithm to store the mark stack in the heap itself with pointer-reversal tricks? This would remove the additional overhead.

Is there any paper describing how this extends to objects containing N pointers without blowing up the number of bits needed?

Like say we are marking a vector containing pointers to objects. We need a state machine which tells us, on each visit to that object which tells us: "we have marked index [i]; now mark index [i+1]".

A list of 256 nodes doesn't have this problem: it has the marking bits in each node!

Also, how do we design the API for extending the heap with new kinds of objects. With the recursive traversal, you just provide a handler which traverses your objects and recurses down the pointers.

Perhaps we need some complicated protocol now whereby each object has methods like "init_traversal", "get_next_pointer", and "finish_traversal". In init_traversal, it receives a backpointer that it must store, and it returns a pointer that it would like the caller to retain (the one that is bumped to make space for the stashed one). In "finish_traversal", it gets back the retained pointer, in exchange for returning the stashed backpointer. Something like that.

Actually a protocol like this could perhaps clarify the algorithm.

I find the easiest way to think about this trick is to consider the CPS transformation of a depth-first traversal of a tree (in the case of a general graph, you use a DFS spanning tree). At each recursive call, you need to be able to produce and store the continuation upon return. Upon returning from a recursive call, you need to rematerialize the continuation.

The object pointer for the most recent continuation is stored in a global variable, and others are stored in a stack constructed by using pointers in objects that can now be inferred from the stacking nature of the recursion.

Part of the information for the code pointer for the most recent continuation is stored in the object pointer, i.e. from the object header you narrow the possibilities to process it. For an object containing N fields, you need log_2 N bits to store the field you were last processing. Since each object can appear in at most one stack frame of the recursive traversal (modulo where you put the marked check to bail out in the graph case), you need to put these log_2 N bits somewhere in the object.

You have a few options for where to put the bits:

- Find a few bits lying around where you can store them. This works pretty well for objects with few fields.

- Add an extra field with log_2 N bits to store them. This is the most straightforward solution for a larger object with an arbitrary number of fields.

- If you have an easy way of locating object headers from internal pointers, you can use internal pointers as your back pointers.

- Steal a few bits from every field that has a pointer value stored in it. This works because you actually only need log_2 M bits where M is the number of fields containing a pointer. The non-pointer fields can all be skipped.

This scheme can be generalized to implement arbitrary structural recursion on polynomial datatypes:


Huh. I wasn't aware of that trick. I'll look into it. Thanks.

Schorr-Waite works for binary graphs, storing the left/right choice in tag bits. Your block bitmap queue idea is more interesting, perhaps with two levels of bitmaps to scan 64*64 words with a single instruction.

For the naive implementation you could either use a stack and tail recurse on the last unmarked cell, or a queue with tail recursion on the first unmarked cell (that is, you push on the queue cells starting at the second unmarked cell, and tail recurse on the first to improve locality of reference).

(I wrote GNU Smalltalk's GC in another life :-)).

I believe something similar (maybe exactly that) was suggested later on in the thread.

> Why don't they just use the Deutsch-Schorr-Waite algorithm

Here is a good reason: you're debugging some problem that occurs in the middle of garbage collection. You have the world stopped and would like to look at some objects. Oops, pointers are scrambled by this algorithm and won't be put back in place until it runs to completion. (Which it might not even be able to do due to whatever bug you're trying to investigate, even if you have a way to initiate that.)

With regular recursive traversal, you can just bail garbage collection at any time: loop through all the heaps, and reset the gc bits of all objects to the "unmarked" state. Then examine whatever you want.

(If the bits aren't in the type tag, you don't even have to do that; though it's a good idea for them to be in a type tag because then an object which has been mis-handled by memory management (e.g. marked object remains in the live object graph after gc) then it has the wrong type and blows up when used.)

In practice, if you get something like this wrong and are forced to debug it, you will probably have to do some code inspection or printf debugging to find the cause. But it's still better than most concurrent GC algorithms in that regard.

What can't you debug with code inspection and printf?

Some black box with that has no printf (or not that you can insert into a custom build), and for which code isn't available.

Emacs likes pointer reversal tricks!


(Unrelated, but an amusing coincidence.)

For emacs it might be OK, usually mutating objects in the heap during marking is not a good idea, as it breaks sharing CoW pages with forks.

Much better would be a copying collector, like a twofinger collector, because emacs doesn't have much foreign pointers for political reasons, which can be handled separately. No stack, much faster, much slimmer.

Copying collectors can handle foreign pointers. Why? Because foreign pointers can be managed via handle records. Those handle records are GC objects with a type tag and all that; they hold the foreign pointer as an opaque machine word in one of their fields. The handles themselves can be moved around by a copying/compacting allocator. Of course, the foreign objects aren't. You typically don't even know their size. They could be things like a library handle returned by dlopen(): don't even think about moving that. They could also be pointers into the middle of some object; a displaced C pointer can enter the GC world as a foreign pointer.

The biggest problem are callbacks from foreign calls back into your moved heap. Called back objects need to stay put, in a separate fixed area. All major lisps with proper FFI support do that.

modules are no big deal, FFI's with callbacks are. And since RMS opposed an FFI for emacs for decades (DLL or OLE/COM support) - there's now a simple internal one for the gtk binding - a moving collector would be pretty simple.

emacs doesn't have much foreign pointers for political reasons

This is an interesting connection. Could you elaborate on this?

We support external modules these days. I specifically designed its interface to allow for a moving collector. The current implementation of the C core, however, precludes such a thing.

Did sth. similar a while ago for our not so great D GC. https://github.com/dlang/druntime/pull/1073

I wonder if a hybrid approach is possible. Run GC recursively. Arrange for the segfault to run on an alternative stack (sigaltstack). If a segfault occurs due to insufficient stack space, then do the rest of the GC using the "queue draining loop" approach, within the small amount of alternative stack memory. Then longmp to some point in the original GC to bail out the original stack.

In my TXR Lisp, the "alt stack" is used transparently to the programmer. So we can catch a stack segfault. This is exploited in the solution to the Rosetta Code "Find Limits of Recursion" task:


The implementation code is here:


Search for "setup_alt_stack" and "teardown_alt_stack".

Basically, this allows a Lisp signal handler lambda to execute even though the stack has been blown.

What if this is in the middle of gc? We can't run Lisp code with a heap that is the middle of gc, due to objects being marked and such. That situation is detected and a function called gc_cancel is called to simply cancel gc. Everything that has been marked is unmarked. To bring the gc machine to the most obvious safe state, all the structures related to generational gc are reset, and a full gc is scheduled.

With that implemented, the above Rosetta task became reliable; the random failures it exhibited were gone. Users programs (on platforms supporting sigaltstack) have a fighting chance to do something if gc runs out of stack, like save some data to disk or whatever.

I wrote an implementation of Haskell on .net many years ago that essentially did something similar for deeply recursive code - basically catch stack overflow, and then spawn a new thread and continue execution on that. This infinit stack space :)

Why would you bother supporting both approaches? It's not as if the recursive approach is any more efficient --- and whether or not you start with the recursive approach, you need to keep memory reserved for a worst case queue traversal.

For one thing, because the recursive approach is debugged. If we can show that the new code doesn't change what the old code is doing when that case is not triggered, then then we can be confident that a bug hasn't been introduced.

Super dumb question: why not just use a queue, and visit the objects iteratively?

Eg each visited object appends/prepends its children to the queue of objects to visit, and you just repeatedly drain until you have an empty list.

That’s what I had to do many years ago for the GC I was working on.

Then any special cases (like the cons case referenced elsewhere) just becomes performance optimization rather than “don’t crash” correctness

This talks about possible generational GC. Incremeantal/generational was available many years ago with a port to the Boehm collector. I never understood rejecting it and generally spending effort on a bespoke collector. (I don't remember whether the Ravenbrook system was available at the time, or if it was unsuitable for some reason.)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact