The OpenBSD project has been fighting the ecosystem on this for a long time, there are no executable stacks on OpenBSD.
The pmap support appeared in subsequent commits for multiple platforms, almost 18 years ago, even before support the amd64 architecture landed in-tree in 2004. It had non-exec stacks from day 1.
Historically we didn't include a section in the binary indicating noexec stack, because that was the default.
The OpenBSD kernel and ld.so (runtime-linker) have always ignored PT_GNU_STACK. On other platforms, its absence is treated to assume an executable stack. The mupdf-x11 example in this article highlights that. The default should be a non-exec stack, but is instead treated like the NX bit. It's a fail open design.
If you compare OpenBSD against a modern Linux distro, you'll be surprised.
How does FreeBSD compare vs OpenBSD on that aspect?
Edit: I give FreeBSD too much credit.
The kern.elf64.nxstack / kern.elf32.nxstack sysctls were introduced in FreeBSD 9.0, which is 0 by default. 1 enables PT_GNU_STACK behaviour. Apparently not implemented for all platforms. So you get to choose from executable stacks by default and the same nasty behaviour Linux has.
This whole chain of hacks is infuriating. Unloading code is a good and useful thing to do. To silently disable dlclose to fix a bad fix for a C++ problem that shouldn't have been a problem in the first place is, well, why we can't have nice things at an ABI level.
I'm convinced that the root of the problem is the ELF insistence on a single namespace for all symbols. Windows and macOS got it right: the linker namespace should be two-dimensional. You don't import a symbol X from the ether: you import X from some specific libx.so. If some other library provides symbol X, say, libx2.so, that's not a problem, because due to the import being not a symbol name but a (library, symbol name) pair, there's no collision possible even though libx.so and libx2.so provide the same symbol.
Why? What are you doing that requires unloading code?
Other operating systems manage to support library isolation and safe unloading just fine. Windows has done it for decades. There's no reason except a long history of bad decisions that the ELF world should have a hard time with a concept that comes easily everywhere else.
> But with dynamic libraries you’re not necessarily the only user of that library, so depending on dlclose to actually unload the library is not really something you can do.
Libraries are reference counted. They get unloaded when the last reference disappears. (And if necessary, we should be able to create island namespaces with independent copies of loaded libraries.)
There is a separate library, or a part of libgcc, that can be mapped in multiple times. This allows trampolines without any limit other than memory. Alternately, just decide how many you want in advance, and place them into the executable.
Trampolines are conceptually in an array, to be allocated one by one, but there are actually two arrays because we'd want to separate the executable part from the non-executable part. On a 64-bit architecture with 4096-byte pages we might have a read-write page with 256 pairs of pointers, an execute-only page with 256 trampoline functions (which use those pairs of pointers), and possibly another execute-only page with helper code. The trampoline functions can use the return address on the stack to determine location, though x86_64 also offers RIP-relative addressing that could be used.
These arrays could be done as thread-private data or done with locking. Pick your poison. The allocation changes accordingly, along with the space requirements and cache-line bouncing and all. Be sure to update setjmp and longjmp as required.
There is also a much simpler answer to the usual security problems. If the stack will become executable and the option hasn't been given to explicitly allow this, linking should always fail at every step. Propagating the lack of a no-exec flag should be impossible without confirmation of intent.
That ought to be a tagline at the bottom of every man page in section 2 and 3.
It really disappoints me that making fundamental improvements to the GNU/ELF/etc.-family programming model is so difficult. I've already commented about the benefits of a two-level namespace. The article we're discussing is about a bad default in linker interfaces. There are tons of other systemic problems that nobody is interested in solving because the default response to every suggestion is "no".
Consider, for example, symbol interposition: why should programs go through the GOT to access their own symbols? Yes, yes, in theory, you can use LD_PRELOAD to interpose symbols, except you actually can't, because Clang aggressively inlines , which means that whether your interposed function is actually called in a given context is a crapshoot. We're paying for intra-library symbol interposition even though it's become unreliable. Yes, you can set symbol visibility to hidden by default or use a linker script to trim your export list, but most people don't know to do that. Why is the default so bad? Why do we have to live with bad defaults for decades?
The expertise of yourself and others on this subject might be welcome in upcoming Unix-like OSes that are being written like:
I am involved in neither of those but I follow them both with interest.
each closure is a function which takes a number of saved arguments and a number of new arguments. construct a set of macros to define a closure structure. this structure contains a function pointer to the closure, plus all the saved arguments. another macro defines function to take the new arguments and construct a complete call to the handler function with the saved arguments. this solution carries the type information for all the arguments and flags at compile time attempts to call the closure with non-unifying new arguments. this general approach is from Sergei T and is used quite heavily here: https://github.com/nanovms/nanos
dynamically synthesize a function to shift the new arguments to the right in the call, and fill in the saved arguments from immediates. this is faster and application can use normal function calls, but its not type safe. working on a little c->c front end that maintains type safety
these are both really implementations of partial application rather than general closures. they are both heap based which allows them to be used for general callbacks without worrying about lifetime, but they become a big memory management problem.
with the exception of the memory issues - i think c+closures is a great environment for systems programming.
You should be using the whole C++ language. There's no such thing as a "sane subset", because all of the language has "sane" uses. What you want is to create good programs.
Anyway, normally in C interfaces that take a callback also take a pointer-sized value to pass to the callback, precisely to avoid having to generate code dynamically like this, so this feature shouldn't be that prevalent.
For example, glibc has a qsort_r function that works like that.
If the compiler could generate code that allocates executable memory, it could generate a stub off in executable heap, and free it when the called function returns. This would handle nested functions in the downward funarg situation (the only one that's handled by GCC C (and Standard Pascal)).
Things would be complicated by longjmp (the stubs wouldn't be freed), but it could be handled similarly to how C++ compilers handle destructors for stack frames that are skipped by a longjmp; per the standard, the destructors aren't called, but the compiler may do so if appropriately configured.
But without a C runtime library function to allocate executable memory, the compiler is a bit stuck.
If GCC had a special function pointer type for closures, they would have worked just fine.
Whilst this can't be done in a platform-independent way, C does allow you to do this, doesn't it?
void* virtualCodeAddress = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
I believe these appeared at least in glibc 2.27 (and POSIX 2001), if not earlier, and they run at runtime.
I'm not sure how legal it would be for the compiler to handle things in this way (apart from in "implementation defined" areas), but there isn't a technical reason they can't allocate executable memory at runtime.
(I also believe Windows handles things similarly with VirtualAlloc & VirtualProtect).
quadmath isn't supported by every architecture or platform. That doesn't mean the compiler is stuck and can't make use of it on the platforms that do support it.
> Whilst this can't be done in a platform-independent way, C does allow you to do this, doesn't it?
I'm curious: Is there a programming language or a compiler that is actually smart enough to make that distinction? In principle, C++ compilers wouldn't have to monomorphize template instantiations all the time: it's in principle possible to compile templates in a way that adds a hidden "type" parameter to the function which essentially acts as an out-of-band vtable to provide the operations that the template uses (yes, there are ABI issues).
For example, due to operator overloading, given "T a, b;", you don't know whether "a + b" should be a simple addition or a function call. For full genericity, you'd want to compile this as a vtable'd addition, but that's going to severely harm performance for primitive types. So you'll have to monomorphize for primitive types. This applies to every operator in C++, of which there are a lot.
But that's just performance. Worse, the semantics of operators can depend on whether they are overloaded or not. For example, "a && b" short-circuits normally, but does not short-circuit if "operator&&" is overloaded. So now you need a variant for whether or not operator&& is overloaded (similar for "operator||", and even "operator,").
I'm fairly sure I've only scratched the surface of weird things that come up with compiling C++ templates. There's loads of other fun examples of template weirdness, like for example the program that throws a syntax error if and only if an integer template parameter is composite: https://stackoverflow.com/a/14589567/1204143
Monomorphization would ideally be treated the same way.
Yes, there are aspects of C++ that make that so difficult that it's just not realistic, but really I was only bringing up C++ because the parent poster certainly meant the C++ std::sort. The same argument could also be applied to other languages that have less baggage, e.g. Rust comes to mind.
If this question intress you, it is worth watching.
If the comparisons are expensive enough to warrant not inlining them, then I very strongly doubt you'll find a meaningful speed difference between qsort and std::sort. I don't know if you're concerned about code size and instruction cache etc., but I doubt that matters if comparisons are expensive.