
Infectious Executable Stacks - pantalaimon
https://nullprogram.com/blog/2019/11/15/
======
brynet
Indeed, this is a very serious problem on other systems. Even more so than
this article suggests. This has impacted portable versions of OpenSSH on
several occasions, due to certain major PAM modules, meaning that sshd on
Linux would always have a writable/executable stack.

The OpenBSD project has been fighting the ecosystem on this for a long time,
there are no executable stacks on OpenBSD.

[https://github.com/openbsd/src/commit/32c3fa3f83422ae0dec73c...](https://github.com/openbsd/src/commit/32c3fa3f83422ae0dec73cbfe03bee1501cbb25b)

The pmap support appeared in subsequent commits for multiple platforms, almost
18 years ago, even before support the amd64 architecture landed in-tree in
2004. It had non-exec stacks from day 1.

Historically we didn't include a section in the binary indicating noexec
stack, because that was the default.

[https://marc.info/?l=openbsd-
cvs&m=149606868308439&w=2](https://marc.info/?l=openbsd-
cvs&m=149606868308439&w=2)

The OpenBSD kernel and ld.so (runtime-linker) have always ignored
PT_GNU_STACK. On other platforms, its absence is treated to assume an
executable stack. The mupdf-x11 example in this article highlights that. The
default should be a non-exec stack, but is instead treated like the NX bit.
It's a fail open design.

If you compare OpenBSD against a modern Linux distro, you'll be surprised.

~~~
ekianjo
> If you compare OpenBSD against a modern Linux distro, you'll be surprised.

How does FreeBSD compare vs OpenBSD on that aspect?

~~~
notaplumber
Probably the same (or worse) as Linux. IIRC FreeBSD still had executable
stacks by default until at least 2016?

Edit: I give FreeBSD too much credit.

The kern.elf64.nxstack / kern.elf32.nxstack sysctls were introduced in FreeBSD
9.0, which is 0 by default. 1 enables PT_GNU_STACK behaviour. Apparently not
implemented for all platforms. So you get to choose from executable stacks by
default and the same nasty behaviour Linux has.

------
quotemstr
The GNU executable and linking model has a ton of warts and I wish we could
start from scratch. I recently found out about another ugly wart:
STB_GNU_UNIQUE. Over a decade ago, someone noticed programs crashing due to
symbol collisions causing use-after-free bugs in std::string, so instead of
fixing the symbol collision, this person added a flag (STB_GNU_UNIQUE) that
would force the colliding symbols ("vague linkage" ones) to be merged even
among different libraries loaded with RTLD_LOCAL. Later, someone noticed that
due to this merging, if you unloaded a library that exported a STB_GNU_UNIQUE
symbol, other, unrelated libraries would crash. The awful solution to _that_
problem was just to make these STB_GNU_UNIQUE libraries immune to dlclose.

This whole chain of hacks is infuriating. Unloading code is a good and useful
thing to do. To silently disable dlclose to fix a bad fix for a C++ problem
that shouldn't have been a problem in the first place is, well, why we can't
have nice things at an ABI level.

I'm convinced that the root of the problem is the ELF insistence on a single
namespace for all symbols. Windows and macOS got it right: the linker
namespace should be two-dimensional. You don't import a symbol X from the
ether: you import X _from some specific_ libx.so. If some other library
provides symbol X, say, libx2.so, that's not a problem, because due to the
import being not a symbol name but a (library, symbol name) pair, there's no
collision possible even though libx.so and libx2.so provide the same symbol.

~~~
saagarjha
> Unloading code is a good and useful thing to do.

Why? What are you doing that requires unloading code?

~~~
cesarb
A simple example from work (though ours is in Java): providing a plugin
interface, where the plugins can be updated to a new version by the user
without taking down the whole server. (And yes, we've had our share of class
loader leaks, I personally fixed a couple of them.)

~~~
saagarjha
Ok, so the use case you have seems to differ from how dynamic libraries are
currently designed: if you have a plugin interface, you’re the only one who
controls loading (and hence unloading) so it makes sense to have those
operations actually revolve around your whims. But with dynamic libraries
you’re not necessarily the only user of that library, so depending on dlclose
to actually unload the library is not really something you can do.

~~~
quotemstr
> use case you have seems to differ from how dynamic libraries are currently
> designed

Other operating systems manage to support library isolation and safe unloading
just fine. Windows has done it for decades. There's no reason except a long
history of bad decisions that the ELF world should have a hard time with a
concept that comes easily everywhere else.

> But with dynamic libraries you’re not necessarily the only user of that
> library, so depending on dlclose to actually unload the library is not
> really something you can do.

Libraries are reference counted. They get unloaded when the last reference
disappears. (And if necessary, we should be able to create island namespaces
with independent copies of loaded libraries.)

~~~
saagarjha
I'm not too familiar with dynamic loading on Windows, but looking at the API
(LoadLibrary/FreeLibrary) how is it any different? It maintains a reference
count and unloads when it reaches zero.

------
burfog
It didn't have to be done this way. In case somebody to fix this, here is an
alternate implementation:

There is a separate library, or a part of libgcc, that can be mapped in
multiple times. This allows trampolines without any limit other than memory.
Alternately, just decide how many you want in advance, and place them into the
executable.

Trampolines are conceptually in an array, to be allocated one by one, but
there are actually two arrays because we'd want to separate the executable
part from the non-executable part. On a 64-bit architecture with 4096-byte
pages we might have a read-write page with 256 pairs of pointers, an execute-
only page with 256 trampoline functions (which use those pairs of pointers),
and possibly another execute-only page with helper code. The trampoline
functions can use the return address on the stack to determine location,
though x86_64 also offers RIP-relative addressing that could be used.

These arrays could be done as thread-private data or done with locking. Pick
your poison. The allocation changes accordingly, along with the space
requirements and cache-line bouncing and all. Be sure to update setjmp and
longjmp as required.

There is also a much simpler answer to the usual security problems. If the
stack will become executable and the option hasn't been given to explicitly
allow this, linking should always fail at every step. Propagating the lack of
a no-exec flag should be impossible without confirmation of intent.

~~~
Gibbon1
I use them and I don't care about the trampolines because I don't have
protected memory. Having used them, they are really useful. Leading me to
think that the biggest thing missing from C is closures. Fix that and C
becomes a lot better language.

~~~
psyclobe
That and destructors, maybe some constexpr evaluated stuff like C++ is doing
now, and then really C would become way funner to code in.

~~~
quotemstr
So just write in C++ and don't use the parts you don't like. I've never
understood the almost-reflexive aversion to C++ you see in some low-level
circles. If you don't want huge Java-style class hierarchies, don't make them.
If you don't want templates to bloat your code, use templates sparingly and
take advantage of type erasure. You can make a C++ program look like anything
you want.

~~~
convolvatron
yes you can. but if you walk into a C++ shop, every developer has thrown in a
few of their favorite tricks. everyone says they are using a 'sane subset',
but it always seems to turn out to be 'most of it'.

~~~
quotemstr
The people who use the phrase "sane subset" are those who in my experience
distort C++ beyond recognition by banning particular features of the language
wholesale rather than shaping the _programs_ that they write to fit the
environment.

You should be using the whole C++ language. There's no such thing as a "sane
subset", because all of the language has "sane" uses. What you want is to
create good programs.

------
devit
They could just use a secondary manually allocated executable stack with
pointers and data stored in thread-local variables.

Anyway, normally in C interfaces that take a callback also take a pointer-
sized value to pass to the callback, precisely to avoid having to generate
code dynamically like this, so this feature shouldn't be that prevalent.

For example, glibc has a qsort_r function that works like that.

------
barrkel
Arguably, this is a symptom of C's anemic runtime library culture.

If the compiler could generate code that allocates executable memory, it could
generate a stub off in executable heap, and free it when the called function
returns. This would handle nested functions in the downward funarg situation
(the only one that's handled by GCC C (and Standard Pascal)).

Things would be complicated by longjmp (the stubs wouldn't be freed), but it
could be handled similarly to how C++ compilers handle destructors for stack
frames that are skipped by a longjmp; per the standard, the destructors aren't
called, but the compiler may do so if appropriately configured.

But without a C runtime library function to allocate executable memory, the
compiler is a bit stuck.

~~~
amluto
I disagree. Generating code at runtime like this is a giant can of worms. The
right fix is to have a calling convention that allows closures to be passed a
context parameter, so a closure would be a function pointer and a data
pointer. Basically every reasonably modern language can do this. GCC’s hack to
shoehorn closures into C couldn’t.

If GCC had a special function pointer type for closures, they would have
worked just fine.

~~~
dooglius
Except functions like qsort which are in the standard library wouldn't have
such functions in the signature, so the code in the example still wouldn't
work.

~~~
amluto
glibc's qsort_r() takes a callback argument that could be used.

------
petermcneeley
This post is concerned about security but the performance of this code is
actually also scary. Does it execute that trampoline for each comparison?? In
c++ the entire code would be fully inlined with a std::sort.

~~~
atq2119
qsort in general executes one call per comparison, this doubles that to two.
Whether the std::sort method is better depends on the cost of the comparison:
for cheap comparisons, having it inlined into the sorting routine is better.
For expensive comparisons, it's likely better to have only a single
instantiation of the sorting algorithm that uses function calls.

I'm curious: Is there a programming language or a compiler that is actually
smart enough to make that distinction? In principle, C++ compilers wouldn't
_have_ to monomorphize template instantiations all the time: it's in principle
possible to compile templates in a way that adds a hidden "type" parameter to
the function which essentially acts as an out-of-band vtable to provide the
operations that the template uses (yes, there are ABI issues).

~~~
nneonneo
There are a _lot_ of operations in C++ which behave wildly differently across
types. Polymorphically compiling templates would either involve some
significant performance hits, or significant limitations on how much
polymorphism could actually take place. Unfortunately, the language is not
designed to make it easy to avoid monomorphizing templates.

For example, due to operator overloading, given "T a, b;", you don't know
whether "a + b" should be a simple addition or a function call. For full
genericity, you'd want to compile this as a vtable'd addition, but that's
going to severely harm performance for primitive types. So you'll have to
monomorphize for primitive types. This applies to _every_ operator in C++, of
which there are a lot.

But that's just performance. Worse, the semantics of operators can depend on
whether they are overloaded or not. For example, "a && b" short-circuits
normally, but does not short-circuit if "operator&&" is overloaded. So now you
need a variant for whether or not operator&& is overloaded (similar for
"operator||", and even "operator,").

I'm fairly sure I've only scratched the surface of weird things that come up
with compiling C++ templates. There's loads of other fun examples of template
weirdness, like for example the program that throws a syntax error if and only
if an integer template parameter is composite:
[https://stackoverflow.com/a/14589567/1204143](https://stackoverflow.com/a/14589567/1204143)

~~~
atq2119
Well yes: monomorphization is an important optimization, but I'd think of it
like inlining: whether a function is inlined is subject to a heuristic that
aims to get the majority of the benefit of inlining while reducing code size
blow-up.

Monomorphization would ideally be treated the same way.

Yes, there are aspects of C++ that make that so difficult that it's just not
realistic, but really I was only bringing up C++ because the parent poster
certainly meant the C++ std::sort. The same argument could also be applied to
other languages that have less baggage, e.g. Rust comes to mind.

~~~
saagarjha
It’s a bit orthogonal to monomorphization, but I believe C++ compilers can
occasionally deduplicate identical template instantiations, where “identical”
means they compile to the same assembly even though the types are different.

~~~
account42
One next step that I don't think C++ compilers do yet is deduplicate functions
that differ only in constants used or functions called. This could help with
std::sort-style functions where the outer algorithm can work on generic
pointers and only the comparison is a call to a different function.

------
bdarnell
This misfeature can show up in surprising places. Any use of assembly (`.s`)
files in a Go program will by default add the executable-stack flag, as we
discovered in CockroachDB:
[https://github.com/cockroachdb/cockroach/issues/37885](https://github.com/cockroachdb/cockroach/issues/37885)

