I imagined a separate pass for the preprocessor, where the state would only be #...

kragen · on March 17, 2022

Yeah, definitely running the preprocessor as a separate process eases the memory pressure — I think Unix's pipeline structure was really key to getting so much functionality into a PDP-11, where each process was limited to 16 bits of address space.

Pointer-based trees seem like a natural way to handle compound types to me, but it's true that they can be bulky. Hash consing might keep that manageable. An alternative would be to represent types as some kind of stack bytecode: T_INT32 T_PTR T_PTR T_ARRAY 32 T_PTR T_INT32 T_FN or something for the type of int f(int *(*)[32]) or something (assuming int is int32_t). That would be only 11 bytes, assuming the array size is 4 bytes, but kind of a pain in the ass to compute with.

Interned strings — pointers to a symbol object or indexes into a symbol array — can be bigger than the underlying byte data. Slices of an mmap can be even bigger. This is of course silly when you have one of them in isolation — you need a pointer to it anyway — but it can start to add up when you have a bunch of them concatenated, like in a macro definition where you have a mixture of literal text and parameter references.