Hacker News new | past | comments | ask | show | jobs | submit login
Stack size invisibility in C and the effects on "portability" (utcc.utoronto.ca)
215 points by todsacerdoti on Sept 28, 2021 | hide | past | favorite | 215 comments



I just wrestled with this. You cannot effectively handle stack overflow situations if you use the system stack on Linux, because stack segments are in magical "MAP_GROWSDOWN" regions. These regions are managed by the kernel itself. There is a base address and a size, and the kernel initially only maps some small number of pages near the upper end (high addresses). Faulting accesses to addresses that are not mapped, but in the region, just below what is currently mapped cause the kernel to map in new pages automatically, "growing down".

The customary way to handle stack overflow gracefully in userland is to install a SIGSEGV handler and check the faulting address to see if it is near the stack segment end, handling the signal on an alternate stack. Except that determining if the access was in the stack segment requires knowing the stack segment start. The signal API doesn't allow querying that; you'd have to query the process's mappings by other means. It's also not robust because a program with big frames might just stride too far and confuse the kernel's automatic growing, causing spurious SIGSEGVs when in fact, there is stack remaining.

I found it was just easier to allocate my own stack segment and immediately switch to it and just leave the loader-provided stack segment alone. Virgil does this in the first 50 or so instructions upon entry.

This also makes stack overflow very deterministic (only depends on the program) and controllable--there is a compiler option to set the stack size in KiB. There's no need for the compiler to insert strided accesses either, as the whole segment is mapped, and you can have more than a single guard page at the beginning (think: 1MiB guard region).

Stack segments in C (and Linux) are madness if you use very much and want anything to be robust to overflow.


The way to handle stack overflow gracefully on Linux is to check in your signal handler whether the faulting address was between the beginning of the stack and the current stack pointer. You can read the stack pointer register from the signal handler's third parameter.

This is also how the kernel grows the stack. When there's a fault, it compares the faulting address to the stack pointer register. This way, big frames don't confuse the automatic growing. (On Windows, by contrast, stack growth is detected using a single 4k guard page, so compilers must be careful with big frames and insert strided accesses)


Managing your own pthread stacks can make thread more costly though, can't it? Use of pthread_set_stack requires stack_size at least PTHREAD_STACK_MIN which on glibc linux tends to be 16384, four times larger than the initial cost of a demand-paged stack map. I understand that on musl this is 2048, so that's nice.


Yeah, I think there would be some overhead, because it doesn't look like pthread_create allows specifying a stack segment. But I wouldn't be using pthread_create (or any C library for that matter), because Virgil goes all the way down to kernel; there is no C environment.

I only experimented briefly[1] with getting Virgil threads to run on Linux. You need to call SYS_clone, which forks a lightweight process (i.e. shares all virtual memory), and specify a stack segment. The segment in question would of course, be mapped by Virgil and not have MAP_GROWSDOWN, so it could be made robust to stack overflow along the lines above.

[1] https://github.com/titzer/virgil/blob/master/apps/Multi/Mult...


Nice to see you and the Virgil project again!! Really interesting to read about your Linux systems programming experiences. I learned a lot reading your source code but this bit about stacks escaped me.

Also I see x86_64 Linux targets are supported now and that's awesome. Do you plan to support ARM architectures?


Hey, thanks! I wrote about 1/3rd of an arm32 backend but it's languished because I've focused Virgil energy on only those things I need for my Wizard Engine project. Probably the next port I'll do will be arm64 to run on the M1 and Raspberry Pi, but that'll be some time in the future.

Nice to hear someone has read the code. Glad you like it. Cheers :)


Tangent because this the first I’ve heard of Virgil. From the main README, what do you mean when you say you would like programs written in Virgil to be dependency-free?


You can write programs that depend on no libraries at all, not even an SDK. They just need a compiler binary (and a machine to run that--x86-linux, x86-darwin, jvm, or even a Wasm engine) to compile, and compile to a binary that also only needs a machine to run--no shared objects, external jars, etc.

The language has no builtin classes, interfaces, or functions. In contrast to say, Java, where the language (and compiler) know about java.lang.Object, java.lang.String, etc. The compiler knows how to compile classes, but it doesn't know the name of any built-in ones.

Virgil is really designed to build the lowest level of software. Thus all my focus on low-level things like calling the kernel and mapping stacks, etc. The only thing that depends on more than just the kernel is the test suite, which uses bash and a little bit of C.


Interesting! Virgil looks really cool! I've been looking for something like it since about the time you started working on it, and I'm glad I've finally found it.

For others reading this thread, to clarify the "no builtin functions" claim, although Virgil doesn't have any builtin values which refer to methods, it does have some built-in operations: ==, !=, +, -, *, /, %, <<, >>, >>>, &, |, and ^, as well as data structuring facilities like arrays and tuples and operations to access them.


Ah, that's really cool. The clarification about it targeting the lowest level really helps explain the motivation, thanks, and all the best!


i think maybe you can do it with userfaultfd, performance won't be as good as using a signal handler because of the context switch but you aren't restricted in what you can do in in the handler thread like you are with a signal handler.


What is Virgil?



The lack of stack size is one of my continual irritations. It makes it difficult to write tree-searches in the obvious recursive way, because there is a good chance one day you will have an input big enough to overflow the stack. I wish on modern, 64-bit OSes the default thread's stack size was just made a decent size (128MB or even 1GB) -- would it really cost that much (as until the memory was touched it wouldn't really be allocated).

Your choices are:

* Make a "stack" yourself, in heap memory, and implement your algorithm to manually push/pop from the stack you implemented (often not hard to do, but feels like reimplementing for no reason).

* Spawn a new thread, where in most languages you can set the stack size. This is what I tend to do, although it does seem like a waste to have an extra thread for no reason other than to get more stack.

You can (in principle) set the stack size when you compile a program, but how to do it seems to differ between OSes and compilers, so I prefer the new thread strategy as an in-language option.


> would it really cost that much (as until the memory was touched it wouldn't really be allocated).

The problem here is that once it is used once it is never freed. (Other than being swapped out). So you effectively cache all of the space your program needs forever.

There are some solutions like trying to mark the stack below you freeable but it is hard to do this in a way that doesn't cause huge performance concerns if you happen to be crossing whatever boundary is picked often. IIRC Golang used to do this but switched to copying, growable stacks because no matter how hard they tried they always ended up in scenarios where they were allocating and deallocating stacks in a hot loop.


> IIRC Golang used to do this but switched to copying, growable stacks because no matter how hard they tried they always ended up in scenarios where they were allocating and deallocating stacks in a hot loop.

They used to do segmented stacks. The bad overhead is not in allocation, but in switching between segments when stack pointer is oscillating around the boundary — but if you want to allocate really many stacks that could grow large, you don't have much choice: either this or tracking pointers that need to be updated on reallocation. Normal contiguously mapped stack, but with dynamic growing and shrinking of actually mapped pages, is an entirely different proposition.


In an implementation of a language that can't take the address of a local variable, you can move some stack frames to the new segment when you switch segments, 256 bytes or so. Then, to oscillate around the boundary, you need to return enough times to deallocate those 256 bytes. This will cost enough to swamp the cost of switching segments.

(If you can take the address of a local variable, but you have garbage collection and can handle memory allocation failures by exiting, you can heap-allocate whatever local variables have their address taken.)

Segmented stacks are a particularly efficient way to implement call/cc.


Ah, that makes more sense, since you could easily limit stack deallocation to once a second or whatever.


> So you effectively cache all of the space your program needs forever.

All the space your program uses at the largest stack size, you mean, which is different. Until you touch it, the huge stack is purely virtual. And it only lasts until you restart the program.

Yes, if you default to a 10 GB stack, then recursively tree-search over 9 GB of data, then close the data and spend the next week with 1 MB data that only touches the bottom of the stack, you'll have 8.999 GB swapped out, which does suck.

But that only happens if you run the program on that 9 GB of data. The alternative was to fault when you overflowed the smaller stack when that data was imported.

It does mean if there's an infinite recursion bug, instead of crashing quickly with a stack overflow, you have to wait while it pages out other programs and fills the whole 10 GB of stack...


Yes, that is basically what I mean. If you are going to use a very large amount of stack space you should consider alternatively allocating something on the heap and freeing it afterwards.

Yes, if most of your program is using roughly the same amout of stack space then it is great. However I think if you are worried about running out then you are likely using far more than the average part of your program.

So yes, it isn't intrinsically bad, but context important to evaluate the decision.


I assume you're describing a degenerate case where a tree structure is unbalanced. When operating at that scale, the overhead from all the function calls alone is going to be significant. It's likely going to be maddeningly slower than an iterative solution, like getting an answer in minutes vs. seconds.

If it's a balanced tree, then consuming 9GB of stack to do a search would imply that the data structure is what, on the order of 2^1000000 in size? Yikes :-)


Not every unbalanced tree is necessarily a degenerate situation. Syntax trees aren't necessarily balanced naturally, for example. (They typically are somewhat balanced, of course, but if you do have one that's unbalanced, there's no way of balancing it.) Search trees for NP-hard problems may be an even better example.


My statement was within the context of there somehow being 10GB of node metadata (which could mean different things, but I'm assuming it means 10GB of stack usage to recursively visit each node.)

If you're up to 10GB worth of nodes in a syntax tree, you're probably doing something elaborate enough that you'd want to squeeze out every ounce of performance, and you're possibly a domain expert. Maybe you're making a static analysis tool? On the other hand, a lot of static analysis tools are painfully slow, so maybe there's a lot of recursive searching on deep data structures going on!

Likewise, if you're trying to solve an NP-hard problem and you're recursing through 10GB of anything, you're probably already asking the computer to do something unreasonable by many orders of magnitude, and all a larger stack would do is delay an eventual out of memory error. Or, you're doing something clever and knowingly pushing the computer to its limits, and so you once again would want to squeeze out every ounce of performance and go with an iterative solution.


> The problem here is that once it is used once it is never freed. (Other than being swapped out). So you effectively cache all of the space your program needs forever.

What do you mean? It will get freed once you pop that stack-frame won't it?


No. It was never allocated either.

It was virtual memory before it was used, but once you use it, it's mapped into physical memory and marked as dirty. So it may be swapped out or live there, those are the only two options.

It would take your OS being fully aware of your usage schema for it to be able to free the memory.


Zig appears to have a proposed solution to this similar to your option 1, move recursion to the heap.

https://github.com/ziglang/zig/issues/1639

They're also looking at static analysis of the maximum stack depth which might help in some cases.


Being able to make recursive functions use the heap would be great. I assume it would need to be opt-in since you'll change performance characteristics, but I almost always rearrange recursive algorithms to use a loop and a heap allocated stack structure instead. There's no reason the compiler couldn't help do that for me.


I'm highly skeptical that static analysis of maximum stack depth will work, because of higher-order functions.


In rust, unsafe operations can be marked such to silence the compiler when you know better; one could imagine something similar for a stack depth analyzer in zig. Or, for example, zig could use a finite stack depth when it can prove that's feasible, and emit code that will spill to the heap for cases it can't.


Any recursive tree traversal function is incorrect. You need to write the function in an iterative fashion. Hoping that a recursive function won't run out of stack space is like allocating 500 characters to an input field and hoping that the user never will enter longer inputs. Consider a degenerate tree consisting of a linked list of N nodes. If N is large (100m perhaps) you will inevitably run out of stack.


I agree where there isn't inherently a stack. Many of the introductory examples of recursion are trivially transformed to an iteration in constant space.

However, there are plenty of interesting cases where if you don't use recursion, you end up having to implement your own stack as a resizable vector.

The question "why should the busy work of hand-rolling your own stack be necessary?" is the kind of question that you don't want to waste too much time on as a working programmer, but unless some people think deeply about such questions, the field as a whole stops progressing.

As others point out in this discussion, with a 64-bit address space one could organize the stack(s) in such a way that recursion could use all of physical memory if necessary.


As much as I love recursive algorithms, I couldn't agree with you more. Why do the clever and pretty thing, when it is better to do the less elegant but reliable thing.


+1, though in general designing an iterative solution is more effort for problems that have intuitive recursive solution so OPs argument has some merit.


The initial/main thread stack size is dynamic, it grows until it runs into another allocation (and these days there are guard pages and such). You can already use 128MB on the initial/main stack, even with musl-libc / Alpine linux (it needs to be linearly probed it if it's unreasonably huge, so the OS isn't surprised by a random memory access out in the middle of nowhere, but the compiler should do that automatically these days).

You can also request whatever stack size you need for additional threads when you create them, this has always been a good idea.

It's only the default stack size for non-main threads, if you don't specify it, which may be considered reasonably limited.


> The initial/main thread stack size is dynamic That seems doubtful, AFAIK all OS have a static allocation for the main thread, just one that is commonly (though not necessarily) much larger than what they give off-threads, as https://ariadne.space/2021/06/25/understanding-thread-stack-... quotes Windows gives 1MB to all threads, while macOS or OpenBSD 4.6+ give 8MB to the main thread and 512k to off-threads.

The former would not merit consideration if it was automatically increased.

> it grows until it runs into another allocation

Which it won't because vmem.

> and these days there are guard pages and such

That just makes the access segfault if the stack allocation is not large enough to straight jump over the guard page.


I'm not so familiar with other operating systems, but on Linux and presumably other unixes, the only thing holding back the initial/main "process stack" is ulimit. Try "ulimit -s unlimited". (On macOS that increases it from 8MB to 64MB, but on linux it seems to really go unlimited)


Honestly, there is almost no difference between a purely recursive implementation on the stack and one that uses a passed-in stack structure (whether extensible or fixed) in terms of complexity. Though I can see why it would be nice if the languages supported some sugar for this.


A good tree will have something like log(n) depth so normal recursion should be fine.

Apart from recursive algorithms, stack size can be determined by static analysis.

Which reminds me: what is a good tool for analyzing / viewing a C++ programs call tree?


> Which reminds me: what is a good tool for analyzing / viewing a C++ programs call tree?

As already mentioned, perf can capture call stacks, but those are statistical samples, so not necessarily complete.

callgrind (https://www.valgrind.org/docs/manual/cl-manual.html) by contrast can capture all call-stacks.

kCacheGrind/qCacheGrind (https://kcachegrind.github.io/html/Home.html) can be used to view the output from either perf or callgrind. Sources for kcachegrind, including converters from perf, oprofile, etc., can be found at https://github.com/KDE/kcachegrind


> A good tree will have something like log(n) depth so normal recursion should be fine.

Yes, this is true for comparison/search/sort/bucketing algorithms. But in other applications (e.g. graph theory, networks, sparse matrices) a depth of n, for n elements, can be perfectly legitimate. Or it can even be an edge case, but one you don't want to handle by just segfaulting.

> Which reminds me: what is a good tool for analyzing / viewing a C++ programs call tree?

Linux perf can collect the necessary data. Then for frontends, I am not sure. KDAB's "hotspot" ( https://www.kdab.com/tag/perf/ ) can do flamegraphs, but this may be more performance-oriented than what you're looking for.

Edit: perf can collect the data statistically. It will miss many calls, the ones that are the least time-consuming.


> A good tree will have something like log(n) depth so normal recursion should be fine.

But in many contexts you have to be robust against a bad tree.

> Apart from recursive algorithms, stack size can be determined by static analysis.

Not always, in the presence of dynamically sized arrays (or alloca and friends).



Yeah. You won't be able to fit the tree in memory long before you overflow the stack while recursing down the tree.


I recall that the Rust compiler uses the stacker crate [0] to grow the stack into the heap, and it's under the rust-lang organization on GitHub, so it's probably the closest you can get to an "in-language option" without it actually being in the standard library. However, it currently isn't supported for several platforms due to the lack of a suitable OS interface.

[0] https://docs.rs/stacker/0.1.14/stacker/


I did try to enlarge the default stack size on Linux and ran into problem that it caused maximum number of threads to decrease. Because each thread has its own stack so it's (virtual memory limit)/(stack size) iirc and if you set stack size to 1G you'll get very low threads limit.


> but feels like reimplementing for no reason

In C++, std::vector::emplace_back() and pop_back() are in the standard library for a decade now.


Well that's a stack, but it's not a call stack, which is the point.


it's even worse in the kernel. I think it's 16kb

yes, kilobytes.


Could also be solved if the compiler supported Tail calls optimization


I don't think tail call will help depth-first ("obvious recursive way") tree algos.


It doesn't help with breadth-first searches either. A non-degenerate tree of depth N will use O(N) space, and that space will be on the call-stack if you use a recursive algorithm to access it.

As long as you use less than 1 page per activation record though, most modern libc implementations behave safely under unbounded recursion as they will allocate a guard-page around each stack.

[edit]

It would also be useful if the compiler could warn about activation records larger than 1 page. Absent alloca (and deprecated c99 VLAs), that should be statically knowable.


A BFS is essentially a while loop with a queue datastructure. This is (relatively) easy to implement in TCO-recursive erlang/elixir, which is what I'm currently most proficient in. And, yeah, TCO would help.


What does a recursive breadth-first search look like? I think you'd have to contort yourself to write that, which is why people are specifically mentioning depth-first where recursion is easy and natural.


Ah you're right; the most natural BFS would be tail-recursive or iterative.


The main reason to use recursion is to use the stack variables, when tail call optimization is possible people usually just prefers loops.


TCO is an optimization. It's not always possible.


It's a misnomer. It's not just "an optimization"; in some cases, it alters program semantics.


Usually the only visible side-effect of TCO is stack trace compression and failure to exhaust the stack. You can call that semantics if you like, but I think most people wouldn't. It's absurd that Java doesn't do TCO because of the stack compression "issue" (as I understand, that's why Java doesn't TCO).


> You can call that semantics if you like, but I think most people wouldn't.

Tail Call Optimization (TCO) is an optimization, yes;

but if Tail Call Elimination is required in the implementation of a language as it is in Scheme, then programmers in the language can rely on it (along with lexical scope) to write iterative loops instead as recursive calls. That is a change in program language semantics.

see Guy Steele's "Debunking the 'Expensive Procedure Call' Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO" http://dspace.mit.edu/handle/1721.1/5753


> but if Tail Call Elimination is required in the implementation of a language as it is in Scheme

This is also supported as a non-standard extension in C, C++, and Objective-C as of Clang 13: https://clang.llvm.org/docs/AttributeReference.html#musttail (I implemented this).

> see Guy Steele's "Debunking the 'Expensive Procedure Call' Myth

Yes! :) I cited this paper in my blog article describing how we are using musttail: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...


My point is that adding TCO is not a breaking semantics change, as its only visibility will be through a) the fact that more code works with it than w/o it, b) elision of tail-calling frames in stack traces. In particular I don't consider (b) to be a problem. I'm aware (b) is a matter of opinion, but really, TCO is much too valuable to use (b) as an argument against TCO.


Termination instead of divergence definitely is a change in semantics.


No, failure to fail is not a change in semantics.

EDIT: "My program used to fail with a stack size exhaustion error, but now it works -- I'm so angry about this!" -no one ever.


But they'd complain about the opposite -- for example if the program were an external-event-driven state automaton.


TCO does not do the opposite. TCO helps avoid stack exhaustion without introducing new faults.

TCO might cause your program to avoid one stack exhaustion fault only to reveal a different fault (possibly a different stack exhaustion fault). This is not a semantics change. And it's not a bug either.


What do you mean by "reveal a different fault"? A program with an indefinite amount of tail calls does not have "a fault".


If you have a program that fails due to stack exhaustion, but fixing it by enabling TCO causes it to run longer and therefore run into some other problem...


But that's not a fault in the program. That's a fault in its execution environment revealed by the program.


C++ compilers are free to optimize infinite loops without side effects away and people are generally fine with that.


Well, I'm sure this is not the worst thing about C++ that C++ people are "generally fine with", so that really doesn't tell me a lot. Not quite sure if this applies to useful programs, though. Lack of tail call elimination can definitely break a class of useful programs even with side effects.


It's interesting that nobody has mentioned the fact that "stack" is a word that is not mentioned in the C standard (caveat: I don't own it. but I just checked in a draft copy). In other words, there is no assumption/guarantee that a C compiler will rely on the machine stack in order to implement certain fetures (i.e. automatic variables, parameter passing, and function call/return).

So I guess that means that an imaginary API to query these things should not talk about "stack", since that is making an assumption that is a bit out of line. That does not make naming any easier ... and, as pointed out by vasama, it would be hard to use an API that queries about the current function only. Perhaps something like "autosizeof f" where f is a function could return, at compile-time, the projected "automatic storage size" of the given function (perhaps making it into a compilation error if f() uses VLAs, or calls functions that do), and a runtime "autoleft()" function could be used to determine how much space is left.

Uh, no, probably not. These things are hard, and I'm making assumptions about code analysis capabilities that are probably not realistic.


The stack isn't in the C-standard, but it is often minimally defined in the psABIs (e.g., the ELF ABI on Linux) for the architecture. The psABI is generally trying to balance performance of addressability (on the architecture), with security (position-independence), while constraining the toolchain developers' implementation as little as possible. As a programmer you don't have to implement to the ABI, but if you want interoperability and you want to use a C language compiler, you should (good luck creating your own ABI otherwise).

I'm not aware of any way to programmatically query the rules laid out in the psABI. The implementation of the corner cases is often based on mutual agreement (convention) between the compiler and the link-editor.

You could probably write a tool that maintains a representation of these rules and returns the results in some sort of parse-friendly EBNF, but you'd need a convention linter to make sure that the implementation of your tool tracks the change in convention in the toolchain components.

On-the-other hand, you could write an extension to the compiler that warns you when certain C code results in behaviors around the stack that might be undesirable, i.e., if you're passing too many parameters to a function and it's going to spill those to the stack.

The standard way to avoid unknown behavior in this regard is to explicitly tell the compiler what you want to accomplish by not relying on auto-allocation or parameter passing rules (as stated in the top thread). This means explicit allocation on the heap, rather than than allowing the compiler to auto-allocate on the stack, and using pass-by-reference to avoid parameters spilling to the stack. All of this is slower since accessing values in the stack is often a single instruction operation (small number of cycles), vs multi-instruction loads from far-away addresses (large number of cycles).


LLVM-MOS targets 6502 and so goes out of its way to use as few of the hardware stack as possible: it does global program optimization, figuring out what calls what, and what functions don't need to be reentrant so it could allocate as many of activation records as possible statically, FORTRAN-style.


You ask for an automatic allocation of N bytes. You can use alloca() if you have that available, or VLAs if you have that available.

If you have neither, then you still have to worry that fixed sized automatic allocations might be too large, but at least this can be checked statically.

Given either alloca() or VLAs, those can be allocated on a stack or on a heap, and the spec does not say either way. The OS might over-commit memory, too, and if the C run-time has fixed-sized stacks there may or may not be a guard page at the end.

However, even with all of that, using memcpy() or similar (see elsewhere in this thread) to touch every byte or every 4KB's worth of bytes in what might be the direction of stack growth (if there's a stack, otherwise the direction is irrelevant), is something you can do to make sure that the memory allocated is committed, and that any failure due to over-committing (bounded stacks look like over-committed unbounded stacks) is detected as early as possible and without corrupting memory (e.g., by wildly exceeding stack bounds). Failure may ensue, but it should be behavior defined by the machine rather than undefined behavior.


> fixed sized automatic allocations [...] can be checked statically

May I ask how you'd do this ? I need this for optimizing a C project.


You can guarantee undetected stack overflow is impossible if you do two things. First you mprotect() a 4096 byte at the bottom of the stack. Second, you program the compiler to fail if an individual function allocates a stack frame larger than that.

    #ifndef STACK_FRAME_UNLIMITED
    #pragma GCC diagnostic error "-Wframe-larger-than=4096"
    #if __GNUC__ >= 9
    #pragma GCC diagnostic error "-Walloca-larger-than=1024"
    #pragma GCC diagnostic error "-Wvla-larger-than=1024"
    #endif /* GCC 9+ */
    #endif /* STACK_FRAME_UNLIMITED */


Oh! Now that's an actionable tip.

I fail to see the relation between your first measure (which seems to act as a guard?) and the second. Also their sizes don't match.

Though I'm afraid I'll encumber this thread with my naive questions.


Why not mprotect + -fstack-clash-protection?


Check every function's automatic variables' sizes, add them all up, and add whatever function call frame size the compiler and ABI are using. Getting an exact accounting of function call stack memory usage is tricky because you might have framepointer-less code, and you might have inlining of function calls, but you can figure it out assuming no inlining and assuming you don't enable framepointer-less-ness.


Thank you. It doesn't err.. look super straightforward. I was hoping for an automated way of some sort.

This stack affair of `local arrays are super fast! but keep it small, 1024 should be okay` is very puzzling to aspiring C lovers like me.


There's a way to dump DWARF debug information that you might then be able to parse and use to do just this.


Conversely, I don't think `malloc` et al. say anything about a "heap". All they promise is they'll allocate some memory, with certain size/alignment and lifetime guarantees. It's up to the platform to figure out how to fulfill those guarantees.


Yeah, I've written C for the PIC16 which has a hardware stack .. of depth 8. https://electronics.stackexchange.com/questions/158307/is-th...

I'm not entirely sure it was a standards-compliant compiler though. I think it may have denied recursion?

Mind you, I also think people have unreasonably high expectations of the C standard; in order to get things done you inevitably have to make compromises and use features that aren't available on every platform. Which leads to autoconf hell, but if you're still developing on C you've accepted its imperfections.


I can easily imagine compiler+linker being able to forbid recursion and statically calculate maximum stack needed. Probably you need to ban general function invocation via pointers... methods on objects could be fine, you just evaluate all virtual method override combinations.


The SPARK Ada language prevents recursion by only allowing you to call methods that are defined above the method you are in. It has a similar rule to prevent recursive calls between packages although I forget the details of that. From this you can calculate the worst case stack size. As all other memory has to be statically allocated you can always caclculate the maximum memory footprint of any progam.

Obviously those restrictions mean you have to write your code without recursion, but in practice the uses that SPARK is put to don't tend to need it. A statically allocated stack in your code and a loop can solve a lot of problems (along with suitable logic for gracefully handling that stack filling up).


This article is a bit strange. It's referring to the thread stack size but acting as though it's the main stack size. It's true there's no portable C API for doing anything with the main stack size, but musl libc doesn't behave any differently than glibc here -- the main stack is allocated by the kernel. The thread stack size on musl is small, yes, but the only way to even end up in a thread involves using the pthread API. The pthread API exposes an API for configuring the stack size for a thread; it's called pthread_attr_setstacksize. You can find it here: https://pubs.opengroup.org/onlinepubs/9699919799/functions/p...

Granted, it is maddening that C does not offer any great ways to reason about what stack size is needed, but it is simply nonsense to imply (as people seem to be inferring here) that there's no way for an application to handle running into musl's stack size limits.


my understanding is that the author is talking about stack usage at runtime not the size of an entire stack which can be set by the pthread API you mentioned. it's confusing because the article starts with a discussion of how default stack size differs depending on a platform.


I once worked on a project where we were building a browser plugin in IE (back when plugins existed) and found ourselves needing to use Open Dynamics Engine within the plugin. Under the hood, ODE uses `alloca` a lot... Once per frame to set aside enough RAM to compute a Jacobian that solves for all active collisions. But on Windows, you can configure "redzone protection" to fault the program if the stack takes a flying leap of a (relatively) paltry few kB off the current top of the stack, relative to the default configuration of redzone protection on most single-threaded Windows apps. From ODE's point of view this was fine; they didn't really care about the "compiling on Windows" use case at all and considered our problem to be ours. But this proved disastrous in the context of running on Windows in a thread, and since we were running in IE's process context, we couldn't change the thread allocation rules because they're baked into the process itself.

The nice thing about C++ is that the preprocessor is a monster. We #define'd all references to alloca in the ODE codebase away into calls to our own function that returned, instead of a bare pointer, an object with the right pointer-dereference semantics to act like a bare pointer but that was really a reference to a slice of a colossal chunk of heap memory. When the context of the alloca call closed, the returned object was destroyed, and its destructor moved our imaginary "stack" pointer in the big chunk of heap memory back into place.

Ugliest thing I ever wrote. ;)


As long as there's a guard page and as long as you can assume that page is at least 4KB in size, you can detect C stack boundaries safely. Basically, don't use alloca() or C99 VLAs, or huge automatics, or, if you do, touch a byte every 4KB in the direction of stack growth before you use bulky stack allocations. I have a macro for this somewhere.

Also, this isn't limited to C... The real problem is not that the stack size is invisible to the program, or that you can run out. The real problem is that you can make a stack allocation so large that you might miss the end of the stack and scribble over some other memory. This is why you need to touch a byte ever guard-page-size bytes (in the direction of stack growth) in large stack allocations before using those allocations.

What's specific to C is alloca() and C99 VLAs. Not that other languages don't have those, but that in their absence it's extremely unlikely that any function would have a frame so large as to run past the guard page. A maximum frame size can be computed by the compiler in the absence of alloca() or VLAs.

Also, the compiler could (and arguably should) generate code to do this touching of a byte every guard-size-bytes thing for dynamically sized stack allocations, which would make the problem disappear. (You could still run out of stack space, but more safely.)


> touch a byte every 4KB in the direction of stack growth before you use bulky stack allocations. I have a macro for this somewhere.

> you need to touch a byte ever guard-page-size bytes (in the direction of stack growth) in large stack allocations before using those allocations

> Also, the compiler could (and arguably should) generate code to do this touching of a byte every guard-size-bytes thing for dynamically sized stack allocations, which would make the problem disappear.

That's precisely what MSVC does since 1995 or so: inserting the calls to the _chkstk() function that touches the guard pages. I believe gcc also does this when it targets Windows.



This doesn't work in practice, see my above comment. You can't put guard pages into the middle of stack segment (using mprotect()--the kernel will just override that mapping, happily growing right over top of it). Even putting a guard page at the end doesn't work, as you can still have spurious SIGSEGVs, as you mention, because programs might just stride more than 4KiB at a time; you can't trust a C compiler to insert strided accesses for you, and that's just overhead.

The only way you can reliably do this, I've found, is to allocate your own stack segment.


Have you considered --split-stack? It provides infinitely growable stacks (up to the heap limit I guess) and a defined ABI to find the stack boundaries.

The cost is slower function calls and an ABI break.


Of course you can have a SIGSEGV when you exceed the max stack size. That's fine and to be expected.

What's NOT fine is for a large stack allocation to cause the program to silently scribble over another stack or heap.


> Of course you can have a SIGSEGV when you exceed the max stack size. That's fine and to be expected.

It's more that you can't expect anything better, having an actual stack overflow error would be a lot more helpful.


Right. With 64-bit ISAs you can address 2^64 bytes, but... you can't exactly have 2^64 bytes. At some point the system will fail. How it fails is important. A ran-out-of-resources errors is a lot better than undefined behavior resulting from random memory corruption.

And yes, portable C has finite size_t. Therefore portable C must fail in some way when limits are exceeded.


> As long as there's a guard page and as long as you can assume that page is at least 4KB in size, you can detect C stack boundaries safely.

How do you do this in pure C?


You're gonna make me dig it out, ain't ya. I don't know where I put it, but I've written a macro and function that does this. Basically, given a pointer and allocation size, if it's smaller than 4KB, do nothing, if it's larger, loop for allocation size div 4KB touching a byte from one end of the allocation to the other. The tricky things are: determining in which direction the stack grows, and the page size (which, arguably, you can hard-code to 4KB) if you really don't want to hard-code 4KB.

You can determine stack growth direction by having one function call another with a pointer to a local variable in the former and then compare that to a pointer to a local variable in the latter. To do this correctly you have to cast to uintptr_t.

And, yes, pure C does not even assume a stack, much less a bounded stack. In pure C it's possible to have every call frame allocated on the heap, which means that assuming that they are allocated contiguously is not safe. Moreover, even with stacks, the run-time could allocate a second stack when the first is exhausted and push new call frames on the second rather than die. In pure C the only assumptions about limits have to do with the size of size_t. The real world intrudes though, and in practice all C run-times allocate call frames on bounded stacks.


None of this is possible in a well-defined pure C program. Pages with special memory protection aren't a thing C knows anything about.

That's the point - you can't do it without sacrificing portability.


What I'm describing is portable. It may be unnecessary in C run-times that allocate call frames on the heap, but those aren't realistic C run-times.


Say I ran your code on a special C implementation that tracks all memory access and disallows it when you try to read from stack memory that doesn't contain a live object.

That'd be a completely standards-compliant C implementation.

Your code wouldn't run on it. Your code would not be portable to my standards-complaint implementation. Therefore your code is not portable.


You can almost certainly make it portable. After all, the allocation has a pointer, size, and type, and if it is at all an array, you could dereference every Nth element as needed, possibly with memcpy(), and avoid aliasing rules via casting to unsigned char *.


Wouldn't the compiler be allowed to elide the memory accesses that are being used to poke the memory region? If you are doing the poking before the memory has been initialized, then I'm not sure there are any guarantees that the compiler will read from the actual memory region (since it's uninitialized). And if you do writes to do the poking then there might be similar issues if the result of the write is never used. Even for reads you would have to make sure that the result gets used. Also not sure if casting to volatile is allowed if the object itself is not volatile. This is all true regardless of UB.

It seems to me like it is impossible to do poking in a way that could not be elided by the compiler, because removing poking never alters the semantics of the program (stack overflow or guard pages are not part of the semantics of C, I believe).

Of course, in practice you'll probably get the right behavior, but I'm personally always interested in language lawyery questions. I do believe that there are cases of security issues introduced by compilers eliding memset operations on password buffers, because the compiler can tell that the result of memset is never used and so it doesn't actually clear the memory. These sorts of things are tricky to get right.


Yes, you need to find a way to have the compiler not elide memset() or this particular function. That part is probably not portable.

You can cast to volatile, but casting away volatile-ness and dereferencing the resulting pointer is UB.


> That part is probably not portable.

If you need that part, and that part is not portable, then your solution is not portable is it!


This wouldn't work for exotic architectures like a Symbolics Lisp machine where pointers aren't numeric indices into memory. C has been kept loosey-goosey about platform requirements to avoid ruling out future architectural possibilities that haven't been anticipated yet. You can't assume anything about memory organization and still be truly portable.


So you're going to touch every byte of the memory, to check it's been successfully allocated.

What happens when you access memory which has not been successfully allocated?


> So you're going to touch every byte of the memory, to check it's been successfully allocated.

Strictly less (not more than every 4KB), but, yes :)

> What happens when you access memory which has not been successfully allocated?

Portable C has finite size_t. Actual machines, virtual or otherwise, have finite resources. At some point you may run out of resources. With a dynamic stack allocation checker you'll fail with a segmentation fault when you run out. Without a dynamic stack allocation checker (but with dynamic stack allocations) you may well fail in random ways due to corruption of other stacks, heap, mmaps, etc.

If your C run-time doesn't have stacks then a dynamic stack allocation checker may be pointless... unless your OS is over-committing memory, then it may cause failure to happen sooner than it might otherwise.


> you'll fail with a segmentation fault

Who says you will?

Nothing in the C specification says that.


The C specs assume unlimited (up to size_t) resources.

Real life has no machines with 64-bit ISAs and 2^64 bytes' worth of resources.

Real life isn't portable. Who knew.

If there's no guard pages, no segmentation faults, no OOM killers, etc., then an allocation checker will cause failure in much the same kinds of undefined ways as not having the checker.

A segmentation fault is what you can expect in real systems, and it's absolutely better than undefined behavior.

You can do this portably, in C. It may not do you any good, and it may waste CPU cycles and memory bus usage. But it shouldn't harm. If the alternative is to not use alloca() or VLAs, and to make it so you can statically know your program's resource needs (which is very limiting, and not really attempted outside contexts where that such limitations are very strongly desired), then do that.


You're saying 'if run on a system that does things in a way I expect then it works' but the point of the article is 'but what about if you want to run on a system that's not as you expect and you want to run pure C'.

We all know what you're saying works in practice on most machines.

But it's not pure C and it's not portable if you follow the spec.


It's portably assuming that you can cast an allocation's pointer to a pointer to unsigned char. It's like assuming you can memset(), since that's effectively what it's doing. In the absence of memory over-commit, this will have no effect beyond wasting resources. In the presence of over-commit -bounded stacks resemble memory over-commitment!- it will have the effect of committing the memory OR failing in whatever way the machine fails when resources are exhausted. However, in the latter case, and when the machine uses stacks for C function calls, the failure will be defined by the machine rather than undefined (due to memory corruption).

Also, in the absence of stacks, there's no "direction in which the stack grows", and any attempt to determine that direction will yield a nonsensical result, but doing memset() in one direction or the other will also then be equivalent anyways.

EDIT: There is one non-portable thing here: the assumption that machines are more limited than size_t implies. However, that's a distinction without a difference for any 64-bit ABIs or larger because all such machines must be more limited than size_t implies. That there is any failure mode when resources are exhausted is a non-portable assumption is uselessly true. Not being able to assume defined resource exhaustion failure modes would be destructive -- that the C spec does not is irrelevant.


determine stack growth direction by having one function call another with a pointer to a local variable in the former and then compare that to a pointer to a local variable in the latter.

This is undefined behavior.

See n1570.pdf, Annex J page 560, J.2 Undefined behavior

Pointers that do not point to the same aggregate or union (nor just beyond the same array object) are compared using relational operators (6.5.8).


Comparing the uintptr_t cast of two disparate pointers is not undefined behavior. A comparison may not yield a portable result because the integer representation of pointers may not be as plain memory addresses.

So such a simplistic stack growth detector may get the wrong answer, but it's not undefined behavior:

  char stack_grows_up(uintptr_t callers_local) {
    char my_local = (uintptr_t)&my_local > callers_local;
    return my_local;
  }
EDIT: Ah, and uintptr_t is an optional type. If the platform doesn't have it (e.g., because it's a segmented memory architecture) then this won't work, so it's not portable, but aside from that it's not undefined behavior, it's just that the computation of stack growth direction may well be wrong.


> A segmentation fault is what you can expect in real systems,

Nope. Just today I've seen a stack corruption bug that instead of producing a segmentation fault caused the process to go into uninterruptible sleep, somehow.


In pure C, there is no stack abstraction. Technically speaking, the stack is an extremely common implementation detail shared by almost all C runtimes.

There's no real way to answer this question at that layer of abstraction; the minute you add any library calls that deal with a stack at all (such as alloca), we're no longer talking about pure C.


Correct. Note that alloca() can be an alias for malloc() in stackless Cs. If the stack assumption is invalid, then a checker will not trip over a guard page, and it will at best trip over memory over-commit (which may still be desirable). If the guard page assumption is invalid, then a checker will cause failure just as surely as not checking (provided you intended to use the whole allocation anyways), and either way the failure will be undefined, therefore this is still portable C!


I was going to reply that alloca is a standard function, but then I checked and it's actually not is it!


However VLAs are in C99 and up. They're rather similar to alloca() in that whatever mechanism you'd use to implement alloca() (automatic stack allocation, or heap allocation), you'd also use to implement VLAs.


Except alloca'd space is only dropped at the end of a function. VLAs are dropped at the end of a block. This makes a huge difference in loops as alloca can quickly exhaust the stack.


No real difference:

  for (size_t i = 1; i < WHATEVER; i <<= 1) {
    char vla[i];
    
    vla[0] = 0;
    vla[i - 1] = 0;
    ...
  }
that is, you can still exhaust your stack trivially with VLAs.

Sure, the two things have a different "API", but as far as allocating arbitrary amounts of memory goes, they both let you do it.


I'm not sure what your point is. You could replace the VLA with malloc in that example and exhaust your entire VM space.

In the rare cases where VLAs might prove useful, e.g. a vector math library (IIRC, a major reason for VLAs was to help FORTRAN programmers transition to C), this difference absolutely does matter in reality.


Maybe I missed the point of your comment. Mine is that the scoping difference between VLAs and alloca() isn't relevant to the question at hand of how to protect against memory corruption due to large stack allocations.


Notably, C11 moved alloca to an optional feature, not a required one.


> touch a byte every 4KB in the direction of stack growth before you use bulky stack allocations

I don’t think you can portably (1) do that. You have to allocate before you can touch those bytes.

Also, if you do before, I don’t think there’s a guarantee that you won’t hit some volatile memory that interacts with hardware, and of course, there’s no portable way to get the stack pointer, in the first place (if only because it, conceptually, doesn’t exist in C)

(1) specific compilers can safely do that because they know how they are implemented, and many compilers are fairly similar, so one can make this “quite portable”, but not really portable.


One way is to allocate first. alloca(1GB) typically isn't going to cause a SEGV, so you can check the allocation after making it. With VLAs you have to make sure (this is not portable) that the compiler will put your loop iterator state variables ahead of the VLA.

If you can assume a linear stack (which is kinda the whole point of this discussion) then you can call a function that will repeatedly alloca() in small chunks and touch a byte and return on success or SEGV on failure, and then when it returns it will release that memory but then you know you can safely alloca() that much.


Kids these days - let grandpa tell you a story about writing OS/2 1.x device drivers pre-virtual memory. Stack was obsessively monitored, especially in interrupt handlers where you were called with just a few hundred bytes of stack or so. If running low you had to manually switch to your own stack and keep track which one you were on.

Excellent point being made about C, though. Amazing how we've lived with this limitation for so long. I get that C is supposed to be portable assembler but the cases where stack wasn't used for return address and local var frames are rare so some support should have been added early on.


I believe even later, userspace code had to grow stack page by page, no skipping.


Wanna feel that thrill again? Do some bare-metal embedded.


Yeah, I do that quite a bit actually, when I write code for smaller embedded systems I get the same rush I got back on my old Apple ][.


Stack size is visible and settable. This article describes a non-issue. A programmer just has to be knowledgable about the tools at his disposal. Here's an example

cat <<EOF>demo.c

#include <sys/resource.h>

#include <stdio.h>

#include <unistd.h>

int main(int argc, char *argv[], char *envp[]) {

    struct rlimit rlp = { 0 };
    getrlimit(RLIMIT_STACK, &rlp);

    printf("stack size=%llu\n", rlp.rlim_cur);    
    if (rlp.rlim_cur < rlp.rlim_max) {
        printf("setting stack size to %llu\n", rlp.rlim_max);
        rlp.rlim_cur = rlp.rlim_max;
        setrlimit(RLIMIT_STACK, &rlp);
        execve(argv[0], argv, envp);
    }
}

EOF

make demo

./demo

Similarly, there is pthread_attr_setstacksize(3), which does the obvious thing for threads.


TFA is right, sadly.

One of the projects I was to go in the future is implement boringcc [1], and this sounds like something I would want in it.

The question then becomes: would it even be possible to do?

[1]: https://groups.google.com/g/boring-crypto/c/48qa1kWignU/m/o8...


The spam on that thread is really something.


Wouldn't the majority of it just be, effectively, "all pointers are now fat"?

It could disable some (many?) optimizations, but I don't see it as a particularly hard thing to do (at least conceptually).


No, not a hard thing to make a boringcc compiler, but I was referring specifically to querying stack size.


TFA is wrong. The C run-time's call frame allocation method and actual resource limits leak past the C function call and variable-length-array (VLA) abstractions. You can write portable code to force defined (by the platform, not by the C spec) resource exhaustion behavior sooner than undefined behavior caused by memory corruption caused by use of much-larger-than-the-stack stack allocations.


Can you really? If you depend on a SIGSEGV handler, that handler needs stack space unless you set up a separate stack for it, and I am not sure that is portable.


The point is to cause a defined (by the platform) stack exhaustion failure mode before possibly causing undefined behavior by corrupting memory.

The C specs do not define any resource exhaustion failure modes. But actual operating systems, ABIs, platforms -- they do.

If you are on a system that gives you a SIGSEGV but no way to catch it on an alternate stack -or if you don't catch or ignore it- then your process will crash. This is fine. It's better than memory corruption.


Can you give me an example of a platform where your program cannot cause memory corruption if you access an array on the stack out of bounds far enough to corrupt actual existing memory on the heap? Because a SIGSEGV is the best case scenario; the other scenarios are not helped at all by platforms.


Say you alloca(1GB), and alloca() is naive and gives it to you even though that 1GB spans several other stacks, and now you write somewhere in the middle. If you're lucky you'll hit unmapped space and trigger a segfault. If you're unlucky you'll scribble over another thread's stack or whatever.

But if you progressively write into the alloca()ed area in the direction of stack growth then you cannot corrupt other mappings, and you can only segfault if you exceed your stack's maximum size. Of course, there can be platforms with no SIGSEGV or whatever.


Probably the only way is to use platform-specific quirks, but that's basically the same with everything else. The portability only works for some generic code, pretty much any non-trivial/real-world software uses a lot of platform-specific code.


I don't get Alpine's reasoning here. In order to be portable does every C program have to work with any stack size limit? How could that even work? Without a guaranteed minimum there's nothing to do except "portable to common platforms-ish."


Do you know about the shortest fully conforming C compiler?

   #include <stdio.h>
   int main(int argc, char*argv[]) {
      printf("ERROR: resources exhausted!");
      return -1;
   }


How do functional languages and, in general, the languages where the use of recursion is pervasive - how do they deal with this?


So, you can't answer "in general" because things are handled differently depending on the language.

I believe Scheme basically only uses the heap (which is necessary because of call/cc iirc).

Haskell uses a variation of a G machine (looks like a spineless tagless g machine to be specific). The way that works is that they convert your program into a giant lambda term and then they work on reducing it. This of course looks nothing like a traditional call stack, so you can replace that whole thing with the G machine. Recursion (at least of the non-tail recursive variety) will look like a lambda term slowly getting bigger in heap space.

And then I also heard of someone (I forget which language) talking about setting up some logic to look for stack overflows and then throw all the current stack onto a heap and then start over.


The part about detecting stack overflows probably refers to implementation strategy in some scheme implementations (IIRC chicken scheme does this) which involve using the platform/machine stack as effectively hardware-assisted nursery generation for generational copying GC, when it gets full (usually well before true overflow) the minor GC cycle gets done and the stack gets completely emptied. In these designs the "stack" is usually used for all allocations, not only for activation records.



That's right.


In many cases, with tail call optimization, turning specific types of recursion into iteration (which doesn’t use more than a constant amount of stack space).


Not tail call optimisation, tail call elimination.

If the language is based on recursion, you can't just hope it happens commonly enough, it has to happen all the time, even for corecursive functions.

Except I've never been super clear on Haskell having TCE or not, because laziness means there are limited opportunities for it to even fire.


Because function evaluation is lazy all Haskell (well, GHC) calls are tail calls. The only thing that consumes stack is case.


Tail Call Optimization & allocating their own stacks, generally.


What are the advantages of low maximum stack size limits (given growable stacks)?


One obvious area is in small embedded systems.

It's surprising how hard it can be to track down a stack-caused bug when you've got a bunch of interrupts banging on it.

Generally it's a little unnerving to just pray that you've got enough stack given it's hidden usage and the possibility for things like recursion. Kind of a shame considering what a slick way it is to deal with memory leak avoidance.


This is speculation, but maybe it catches unintended infinite recursion earlier? I’m willing to bet that the vast majority of stack overflows happen due to programming errors rather than genuine uses running out of stack.


This is the best answer I could think of. We have to assume the question is about Alpine Linux, not about embedded systems per se. Small thread stacks run out of space quickly (fail fast), help prevent a malicious process from gobbling up lots of VM, and force you to program in a style that doesn't use much stack space, although that's questionable because most memory leaks are heap-based.

It's worth noting that the limits are adjustable on a per-package basis in Alpine based on the compiler-time link options.


Being able to build web servers with 10,000 threads w/ clean imperative code rather than event-driven yo-yo pattern finite state machines. It's what Google does. They limit stack size to 60kb for just that reason.


Kernel threads have tons of switching overhead so state machines are still better. Get your compiler to compile clean imperative code into them :)

(okay, by "clean" I do mean littered with `.await`, but it's still structurally clean)


You don't need kernel scheduling for stackful threading. So-called green threading was common long before event-oriented asynchronous programming became popular with the advent of kqueue and epoll. See, e.g., https://www.gnu.org/software/pth/

That said, you can make kernel-aware user-controlled context switching much faster than it is currently: https://www.theregister.com/2020/08/10/google_scheduling_cod...


Time to abandon stack as a 'display'? That's why they overflow (mostly) - lots of large items or lots of recursion and local variables.

Also having the addressable method arguments in the same space as the return address, allows a whole host of other bugs and cracks.

I have long thought, the return stack should be a memory region un-addressable by the code except via call and return.

Local method arguments could then revert to ordinary allocations or some other check-able allocation, and have normal error recovery.


Never thought about it but yes, what a huge (and easily implementable) oversight - stackSize(), stackUsed(), stackTopPtr() would be a few asm instructions in most cases.


There is no useful way to employ these functions. It's equivalent to asking if a mutex is locked. That piece of information is worthless.


What are you talking about? For lots of functions, it is perfectly possible to statically determine a maximum stack size (not all functions obviously, that would be the halting problem). An trivial example would be a function that calls no other functions, but you can imagine others: functions with static call trees (i.e. no function pointers), don’t use VLAs or alloca, and that are not recursive (mutually, indirectly, or otherwise) will have a static limit on possible stack usage.

If you’ve formally verified that function X uses at most Y pages of stack, and the remaining stack is smaller than that, you could return an OUT_OF_STACK error instead of calling it. Most libraries/functions wouldn’t need this, but there are C libraries which place an extraordinary importance on reliability, and it seems like it would be a useful feature for them.

There might be other reasons why these functions shouldn’t be in the standard (or are impossible on some platforms), but ”the information is worthless” is not one of them. Of course this information can be useful.


The information is indeed worthless unless the standard were to also define such things as the stack layout and a closed set of operations which are allowed to consume more stack, such as defining a VLA, or calling a function.


https://news.ycombinator.com/item?id=28683773 gives a useful way to employ them; Chicken Scheme has to run a minor garbage collection and longjmp to its trampoline before the stack runs out, but because there's no way to find out what the stack size limit is at runtime, you have to configure a stack size limit when you compile it. The CPython interpreter could also use them usefully; in order to ensure that recursive loops in Python programs raise a Python exception instead of segfaulting, it defaults to a limit of only 1000 levels of Python recursion, because some platforms it ran on were segfaulting at 2000. If it could query the stack size limit, it could instead raise an exception when it's getting close to the real limit instead of guessing. They wouldn't be like access().

There are more hacks in Heaven and Earth, Horatio, than are dreamed of in your philosophy.


I believe we were discussing a potential addition to the C standard. It would however be useless unless the standard also defined such things as stack layout and a closed set of operations which may consume more stack space, such as calling a function.


It's true that this potential addition to the C standard would not permit things like CPython or Chicken to depend only on the C standard, it's true, but then every C program already depends on things other than the C standard—such as the example given in TFA, that any C implementation is entitled to abort any C program because it ran out of stack.

Moreover, every nontrivial C program in practice contains UB.

Nevertheless, it does matter what is and isn't standardized; standardizing such functions would greatly improve the situation for programs like those I mention, because they would be able to rely on the C standard to find out when they're about to run out of stack, an event they already have code for handling. It's true that a pathological C library implementation could still totally break their stack-exhaustion-handling code, and in fact that's already true, but fortunately there are lots of C library implementations that behave well enough in practice.


If writing a recursive algorithm on arbitrary input, check remaining stack size upon recursive function entry. If space is depleted, allocate a new fiber (aka a brand new clean stack), switch to it, and resume. Switch back to calling fiber on stack unwind.


If you’re writing a recursive algorithm on arbitrary input, you should be using a stack container to store your state, not the call stack.


Why do you think so? A stack container ordinarily contains a single data type. You can jump through some hoops to let it store completely arbitrary types, but when you consider a function's stack frame as a distinct type, the call stack naturally does that for you.


1. A stack frame usually takes up far more memory than the state that actually needs to be persisted, due to local variables and padding from alignment requirements. With a stack container, you can specify exactly what gets stored, and save on your memory footprint.

2. There are very few knobs you can tweak to control how your call stack space is allocated or represented in memory. With a stack container, you have much finer control over that - you can do reallocations or define custom error handling when you overflow the container, you can deallocate/shrink the container when you no longer need the space, you can serialize the container to disk to make your processing resumable, etc.

3. A call stack has a very limited set of operations. You can only access data from the current stack frame, and you can only push/pop (call/return) stack frames. But with your own stack-like data structure, you could extend it to do far more, e.g. accessing data from the first or last n traversals, or insert/remove multiple elements at once.


Most commonly you handle this with a stack of records.


you mean std::stack?


"I can't complete this recursive algorithm because I ran out of stack at this particular step" is a much nicer error message than "segmentation fault". For that it would be useful if you could check remaining stack space before recursing.


Whatever values these functions returned, there would be no way for you to make a determination on whether another step can be performed.


I could see that sort of thing being very useful for a debugger.


Debuggers don't need to restrict themselves to what the C standard provides, and never do.


Could you explain this please?


Just like with the mutex, this piece of information is out of date as soon as it reaches you. Preparations for whatever action you perform after, or the very act of retrieving this information may invalidate it.


It seems to me that in the case of the x86, the virtual memory system could detect overrunning the end of the stack and announce it to you.

The meta-problem is what to do with the information.


A clean / clear error would already be better than just segfaulting with no indication (that's why e.g. Ruby or Python have explicit stack depth check and will raise an exception hopefully before they run out of C stack).


As mentioned in the OP though, there is zero guarantee about the stack usage by library functions.


There is also no guarantee that the stack is contiguous—see the "-fsplit-stack" GCC option, or search for "segmented stacks". Libraries for coroutines and delimited continuations also routinely segment the stack. A simple API based around easy access to "top of stack", "bottom of stack", and "current stack pointer" addresses is thus non-universal. There may not be a meaningful limit to the amount of stack available, aside from available heap memory or address space. And this can change over time in response to actions other than allocating or freeing memory on the stack.


No, but that's a pretty weak argument not to have them: they're better than nothing. Not all code uses libraries. And if it was part of the C standard, libraries could use them as well: libraries could fail gracefully instead of blowing the stack. Many libraries already do this when malloc fails, after all.


I have yet to see a library that fails gracefully when malloc fails. Many claim to be able to do so, but usually keel over when actually presented with a failure.


...or much of anything else. I've seen commercial libraries monkey with processor state in some nasty ways.


Well, there's still libsigsegv and its stackoverflow_install_handler. You won't get the size, but get the overflows.


Seems like a moot point on any 64 bit platform, where you can effectively reserve GBytes of virtual memory for each of thousands of threads, and page in physical memory as required.

It's the only sane interpretation I can give to a standard that hides any way to manage stack usage - the compiler/OS layer takes responsibility.


Gigabytes for thousands of threads is just barely true; if you reserved say 8GB each for 16k threads, you would just about use up the 47 bits of address space reserved for userland on Linux leaving nothing for text or heap.

Certainly a few MB per thread isn't too crazy though; at 8MB per thread, you can have a million threads and use only a tiny fraction of the address space available.


You could reserve, say, 46 bits for heap and 46 for stack, 70 terabytes each ought to be enough for anybody. Then start allocating thread stack in successive halvings: 2 threads get 35TB each, 65536 threads get 1GB each etc.

Effectively, you can make sure the program runs out of available memory long before a stack exhaustion issue, unless the stack usage pattern is severely unbalanced for some thread.


Virtual memory isn't free. 16k threads with 8gb stacks technically needs like 250gb of page tables. You must be using some strange memory model and microprocessor that lets you have 92 bits of address space.


1. 2^46 + 2^46 is 2^47

2. The actual size of the page-table will depend on the page sizes supported by the architecture. On modern x86_64 for programs with less than 32k threads, It might make sense to use 1GB stacks with 1GB guard pages, for just 2 PTEs per thread, less than the 5 needed for 8MB using 2MB pages.


If it's the OS's responsibility then Alpine Linux shouldn't blame developers or say their code isn't "portable."


Linux kernel APIs are interesting. Couple days ago I was writing code which needs sendmsg/recvmsg. The manual page says this:

Ancillary data should only be accessed by the macros defined in cmsg(3).

The problem being, the code I was writing was in C# which can't consume C headers or evaluate these macros.


Can't you write C functions that invoke the cmsg macros and call them from C#?


Sure, but that complicates things. I have reverse-engineered the macros instead. They aren't terribly complicated, they simply rounding up some of the pointers by sizeof(void*).


That's probably reasonable. The Linux kernel has historically been very good about not breaking existing compiled code by changing ABIs like this.


This also makes Chicken Scheme basically guess the stack size: it guesses it's 1 MiB on 64-bit platforms, and 256 KiB on 32-bit ones. Of course, you can also manually override it, at your own risk.


What’s the point of a low stack limit, apart from places like the kernel?


A microcontroller programmer says: "What kernel?"


Any environment with restricted memory resources compared to needs, and a need for reliable behavior

Embedded applications such as automotive, or industrial control come to mind (or even consumer electronics, with less life-or-death consequences possible)


… and embedded, of course. The main problem is that there IS a limit, and you have no idea what the limit is. The only way you find out is when you crash.


People have differing opinions of "embedded" but I would claim that most embedded uses cases the stack is actually specified by the programmer via linker scripts and/or manually setting it up during early bootstrap. Plus, its not exactly hard in most of these environments to write target and/or compiler specific macros which detects top of stack and can runtime check it via DEBUG() like macros or modifying function entry macros. None of this is of course portable, but neither is code designed to run in 1-16k of RAM and a few more K of flash.

OTOH, there seem to be a lot of "we don't control anything" IoT devices which are random overly powerful processors with random uboot+linux kernels+rootfs's running my_first_c_program.c level applications. All of it tossed together with very little engineering effort, and frequently less low level knowledge, and left to rot because its basically unmaintainable, even if it happens to be the 1% of devices that can get over the air updates.


gcc's -fstack-usage works well, at least for the kind of C I write on small microcontrollers. You still have to add it up manually across the callgraph and of course take ISRs into account.


> The only way you find out is when you crash.

You can fill the stack region with sentinel data and periodically scan it to warn of impending doom.


There is - in embedded (also kernel), but not in general purpose systems, like the Unix userspace. And the same piece of software very rarely runs on both. So, again, what’s the point of this exercise?


Virtual memory is not exactly free. To work fast, modern CPUs require quite a lot of data to be available in the on-chip caches. TLB cache is one of these pieces of data. When you reserve many GB of address space for stacks of many threads, the code running on these threads is less likely to hit that TLB cache.

I've heard that huge pages might help, unfortunately they're somewhat harder to use.


Yet another feature of Amiga that we don't have in more "modern" systems. You can actually tell the OS how big of a stack to provide to a new task and it'll provide it.


On Linux you do that with ulimit -s.


How can you predict how much stack you need? Isn't this basically the halting problem?


As far as I know, there aren't any languages which provide a way to do this out of the box. If a language wants to provide an environment where no stack overflows can happen, then it needs to provide a runtime where the stack programs use is transplanted over to the heap as it gets too large (or maybe it just starts out there). At this point the answer is that you only need a stack big enough to hold your runtime (which you can engineer to be sufficiently small always).

Of course you can imagine a language where your function signature includes how much stack space it takes up. The stack size being recursively calculated by looking at all the functions your function calls etc.

If recursion is included, then you're correct that this would require solving the halting problem. The obvious solution would be to just not allow recursion. Although, if you also consider languages like Idris (which are technically sometimes not turing complete) you can setup a termination checker and then certain rules can give you an upper bound on how long a recursive function will run (and therefore how much stack it takes up).


With the exception of recursion, static analysis can answer that question.


A brilliantly written article. Couldn't have said it better myself.


How does something like Ada help you measure and control stack usage?


Measuring of stack requirements is typically implemented by the way of walking the call graph and adding up functions' stack sizes (for usual function these are well known at compile time). There are “only” a couple problematic cases where this algorithm fails: recursion, `alloca` and the likes, and indirect calls.

For all of these problems there are more complex analyses that can be done to determine e.g. maximum recursion depth or possible functions that an indirect call could resolve to. And failing all of these there is usually a way for an engineer to override or provide this information.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: