The first reason was disk space: object files could simply be smaller. Nowadays, for all intents and purposes, disk space is free and unlimited.
The second was the ability to ship fixes to system facilities. These were fragile, leading to even more complex methods such as versioning, so though this was a good idea in principle, in practice it was very painful. I believe the Windows folks had their own name for it, "DLL Hell". The npm people still suffer from this from time to time, sometimes notoriously.
As for the fixes: software isn't updated on magtape any more, and most systems have robust upgrade systems. In addition there are all manner of quasi-hermetic isolation systems (VMs, docker images, venvs and the like) so why not just ship a static binary?
Plus with a static binary, if you really care about order of loading/unloading etc (which you ideally shouldn't) its trivial to manage with a custom linker script.
My only exception to this is the kernel: it can be upgraded, and mostly promises not to vary system call semantics too much, so it's OK to me not to link the kernel into my binary :-).
It's not for disk space, it's runtime RAM occupation usage. The code for a .so is going to be slightly larger than the equivalent .a. At the very least, you'll need to burn extra space for the PLT and GOT. And PIC relocations are going to tend to burn more space than static linking relocations.
The advantage of a .so you're talking about is that you can load a library shared by multiple processes at a fixed virtual address location that's the same for all processes, and thus you only burn the library once in RAM. For static libraries, the embedded addresses are going to be slightly different depending on their corresponding libraries, so even magic deduplication isn't going to be able to keep you from having the library in multiple places in physical memory.
Ancient Linux distributions once had something called "a.out shared libs". These were mapped to fixed addresses. The distribution had to work out non-conflicting addresses for all available libraries. The build procedure for a.out shared libs involved intercepting the assembly output of GCC and doing some text processing on it.
During the (static) linking of libraries, an extra metadata item would be stored in the ELF file: the canonical pathnames of all the static libraries used, along with base addresses of R/X segments (code) and R/O segments (r/o data). Upon forking a process, the information would be provided to kernel, and passed over to the KSM, to use as basis of same-page lookup and possible merging of the pages.
This effectively inverts the current mechanism. Currently when libraries are dynamically loaded, the dynloader uses actual file pathname of the shlib as the key for operation opposite to "merging" - i.e., re-uses the already loaded R/X and R/O pages, by adding proper memory mapping.
The proposed change is three-fold:
1) extend the linker to add the metadata to ELF,
2) extend in-kernel ELF interpreter to extract the info upon exec() and friends,
3a) extend the KSM with a limited mode, where it would look up & merge only the hinted memory regions, in linear fashion, right after an exec() & friends.
3b) modify the VM subsystem to extend current swap/SHM so it provides an unique address range for each static lib canonical pathname. Requesting pages from this address range would map from the shared pages, if any process already loaded one such.
Also, it's quite likely that static linked libraries wouldn't be located at the exact same page offsets every time they're linked in to a binary, making page-level dedupe not useful.
Linux does have something called "KSM" that de-dupes memory pages, but it appears that it's only so a hypervisor can de-dupe guest OS pages? Odd that it's not more general than that. Though I guess there is a cost: the KSM code needs to scan through memory to find duplicate pages, which I can't imagine is fast.
Shared libraries are a goofy hack dating back to a time when people mistakenly thought they needed them, and they need to go away.
Err, we're running Electron apps today, with a full blown embedded browser engine, a vm, and a hellish complex DOM for rendering. Others run apps inside containers, with their own basic OS and standard libs -- compared to that the above is negligible.
Some of us try not to on the belief that they don't perform acceptably.
The real problems with static linking don't have to do with fork, except in the sense of increasing reliance on VM overcommit (which IMO is a bad idea already). It's more to do with masking reuse of the same library across unrelated processes, as an efficiency issue but even more importantly in terms of tracking dependencies and updating software. No matter how good your configuration management (or similar mechanism) is, rebuilding large-N statically linked executables is less efficient and more error-prone than using the same information to update small-N shared libraries.
I don't think we need to get into my optimization street cred, but suffice it to say I've never used Electron and couldn't really tell you exactly what it is.
Electron is an embedded web-browser meets cross-platform application platform.. it's bloated, slow and awful... it makes Java look snappy, lightweight and brilliant.
You mean for your code, deployed on your systems. Yeah, that works fine.
Now ship a binary to a client system with your static TLS implementation, JPEG decoder, LZ decompressor, whatever. And then go hire someone to watch the CVE feeds every day to know when you have to get them to install an upgrade.
Of, if you don't want it to be your fault when they get pwned, you could just link against the system libraries and tell them to stay updated.
"Static only" makes sense for web apps developed and deployed within the same organization, for embedded solutions where the whole update/upgrade process is known to be under the control of the developer, and... basically nowhere else.
This advice is bad, sorry.
Things have in fact gotten much better over the past years, though. Largely because developers (even of games) have moved away from this architecture and onto managed runtimes like Java and .NET where the runtime and system provide cleaner separation of dependencies and don't require the "static all the time" nonsense that was the norm since the beginning of PCs.
The problem described in the article appears when you dynamically load/unload shared libraries. This only comes up in the context of plugin systems, or sometimes hot code reloading, and in those cases there is no alternative; there is nothing you can do with static binaries to emulate this behaviour.
A primary goal that you don't really cover is conserving not just disk space, but memory. The disk space saved is a proxy for memory savings, which are (even more) valuable. Shared libraries linked in to multiple independent binaries can consume little to no additional memory due to the magic of virtual memory. The same does not hold true for static-linked libraries in multiple binaries.
The reason I ask is, even as "bloated" as software is these days, I could fit ~100,000 fat 10 megabyte Go binaries on my SSD, without bothering to reach for compression. In practice, binaries on disk always pale in comparison to the amount of disk space I spend on media files, caches, backups and imaging, etc.
Perhaps your concerns about virtual memory are true, but looking at top on a machine running Firefox with many tabs open, Discord (an Electron based application,) 3 GNOME Terminal windows, HexChat, Wine, and Telegram... I still don't crack 100 tasks. Still.*
How about a server environment? Well, saving memory is certainly valuable on a server, but I would fail to imagine a scenario where you want so many different distinct binaries running that it would make a major difference in utilization. The shared memory savings that come from dynamically linked libraries are effectively removed when using Docker because each Docker container is going to have its own isolated system libraries and many of them probably won't be the same anyways, and yet I've never had any Kubernetes system where the majority of memory usage came from binary sizes.
I suppose if every single system binary was statically linked, you would maybe be able to notice some kind of difference. But honestly, probably not, unless you were really searching. Usually disk usage of binaries is a much less important concern than CPU and RAM usage, especially across large fleets.
*Though, that was only checking under my local user; it does turn out that there are a fair bit more tasks, but a good amount of them are actually kernel tasks. There's 400 counting kernel tasks, but only 191 not counting kernel tasks. Still not a very large number, in my opinion.
First, there’s load order which can mean that you’re not actually loading the DLL you think you are.
The second is WOW64 (Windows on Windows). Rather than using fat binaries like Linux and Mac OS, Microsoft decided to use a “brilliant” strategy where the file system lies to you in the “right” ways so that old 32-bit folders are moved to new addresses and the old 32-bit named golfers hold 64-bit binaries. This means that unless you use the “I really mean it” escape mechanisms (e.g. the “sysnative” directory) 32-bit binaries on 64-bit windows is actually loading DLLs from different paths than the code says.
* Speaking in my own capacity
In name of stability, one may say to run external processes with shared memory instead, but now imagine the resource usage of something like InteliJ or Eclipse using only processes for their plugins.
And on an 8GB machine, which most business still use nowadays.
This kind of desktop software, each feature can be its own plugin, meaning menu entries, UI widgets, integration with external tool, database drivers, ….
Now imagine that to the extreme that every class could be a possible plugin, and even 300 will be a low bar.
IDEs are only one exemple, there are plenty of use cases on desktop software, like music, graphics editing for two other common cases.
Browsers are VM managers. Their tabs are the tabs of the VM manager like a VirtManager for KVM or VmWare vSphere. Browsers boot a JS VM fast over HTTP and then communicate over the same HTTP.
Not on iOS?
More or less the same happens with C++ header-only libraries, which are basically uncompiled static libraries. They are very nice and possibly allow strong optimization, but are rather painful to handle in a distribution.
> Plus with a static binary, if you really care about order of loading/unloading etc (which you ideally shouldn't) its trivial to manage with a custom linker script.
Why? It seems to me rather sensible to deinitialize resources in the opposite order in which you initialized them. This way the time span of different resources are contained each within the other instead of just overlap. When using RAII in C++ (which is basically just atexit on a finer grain) it often makes sense to have this requirement.
Without dynamic linking you duplicate all that stuff and, even worse, cannot be certain that all your executables use the same version.
It seems odd for the author to have invested so much time in this, but not to have found or patches those glibc test case to demonstrate the behavior he expects in various hairy test cases glibc itself is worrying about, and which forms the basis for its current behavior.
1. The program crashes during termination as atexit() tries to invoke the previously-registered-but-now-invalid function.
2. The function is silently skipped as it's no longer loaded.
3. The function is invoked upon dlclose().
4. dlclose() does not actually unmap the library.
Of these options, only the 3rd actually seems reasonable. The first two are obviously bad, and the 4th seems like it rather defeats the purpose of calling dlclose() if you can't actually unload the library.
In any case, image unloading is traditionally done with C libraries, not with Obj-C/Swift, and we're talking about a C function (atexit).
As for making it a no-op, they said only on platforms other than macOS, which is fine because the other platforms (iOS, watchOS, tvOS) are sufficiently constrained that there really is no reason to ever dlclose() anyway (there's barely even any reason to dlopen() besides trying to poke at Apple SPIs; I think sqlite will dlopen to load extensions, but that's about the only valid reason that comes to mind).
It's kind of confusing to me why the author wants this or believes it should be the default behavior.
That being said, none of these behaviors really make any sense. If the registered function isn't mapped into the process address space any more, what is supposed to happen?
GCC has local functions which build trampolines on the stack. If I register a trampoline with atexit() what should happen if the process exits after that stack frame is gone? Gee, let's be idiots and go on a crusade against this Important Problem: of course, GCC must be patched to generate code to look for atexit-registered trampolines every time a stack frame is popped and run them right there and then.
The spec for atexit is very clear that it's process-termination:
The spec for dlclose doesn't say that it must forcefully unmap everything:
The basic description is "The dlclose() function shall inform the system that the symbol table handle specified by handle is no longer needed by the application." The "symbol table is not needed" doesn't mean "the chunks of memory referenced by function pointers, and any data they need is no longer needed". Those are not "symbol table" material. Also: "Although a dlclose() operation is not required to remove any functions or data objects from the address space, neither is an implementation prohibited from doing so."
As far as I can tell, “neither is an implementation prohibited from doing so” does not have an exception for atexit. In other words, a POSIX compliant system is free to implement atexit in a straightforward way without any magic to check which library the function pointers come from. On such a system, if atexit is called with a function from a dlopen’d shared library, and that library is subsequently dlclose’d, the C library will cheerfully call the now-invalid function pointer at program exit, resulting in undefined behavior. As such, using atexit from libraries intended to be dlclose’d is fundamentally non-portable. Even if some system supported the “atexit keeps a library alive” behavior you’re asking for, any program relying on it would have to be very careful to ensure it only gets run on that system: unlike other non-portable features that can be tested for at compile-time or runtime, or at least fail cleanly if they’re not supported, this one would give you no warning it’s unsupported other than a segfault, if you’re lucky. And the system had better clearly document that it is guaranteeing that behavior forever, as opposed to it being an implementation detail that could change at any time. IMO, it’s a much better idea to just avoid that pattern altogether.
Problem is, whether or not a library can be dlclose'd is not usually up to that library. Some things are explicitly intended as runtime-loadable plugins; others are dependencies of those plugins, and so on. Additionally, the person packaging a library for runtime linking may not be the library author.
The upshot of this is basically an ass-backwards obligation (not an unfamiliar situation when working in this area): atexit is only valid for use if you know exactly how code containing it will be linked, which is impossible for a lot of code. The failure mode is worse than other similar situations (e.g. 'a library I linked clobbered my signal handler' is easier to debug and mitigate).
All of that is a bit beside the point, though. We do know (and test for, and are careful about) the implementation details of atexit, and code around them, with all the benefits and drawbacks that entails. That's what this thread is about. "POSIX technically allows this function to blow up the universe when you call it" is a good thing to be mindful of, though.
True. The logical conclusion is that in most cases you just shouldn't use atexit from libraries, which... seems fine to me. In my opinion, a well-behaved library should endeavor to be cleanly unloadable without leaking any memory. (That includes using __attribute__((destructor)) if necessary; it's non-portable but without the blow-up-the-universe potential. Or just link in a C++ source file if you want portability.) That way, you can repeatedly load and unload different versions of the same library into the same process, without accumulating wasted memory over time. At least, that's useful in some cases for runtime-loadable plugins, and as you noted, other libraries may end up being loaded and unloaded as dependencies of those plugins. In fact, I'd argue that this rule of etiquette is least applicable when your library is a runtime-loadable plugin itself, but for a host that is known to never unload plugins.
Admittedly, unloading libraries is fairly rare in practice, so I can forgive a library for not bothering with global destructors. But I don't think it's good practice to go out of your way to make a library not unloadable, at least without a good reason.
Indeed, I'm curious what exactly the author's use case is. They say they "need to have a dependable form of post-process destruction for my tests"... what tests? Why unload libraries in a test program?
But of course, that can't happen, because you can have N function calls between the registered atexit function and the call to the library, and deciding if any calls to the library are anywhere in the graph and will be called is probably equivalent to the halting problem.
I agree that none of the possibilities are what we actually want.
I don't think that's a problem. A simple sweep through the atexit list to see if any of the pointers are into this library, to block the unload.
If some atexit-registered function not in that unloaded library has a secretly stashed pointer to a function in that library, and calls it during atexit, that's a programmer problem.
Edit: no-op dlclose() was something under consideration for dyld 3 on everything but macOS. https://devstreaming-cdn.apple.com/videos/wwdc/2017/413fmx92...
On process death vs on module death. Is the function to clean up the process, or is it to clean up the module?
I think the argument for module cleanup is stronger, irrespective of the difficulty of registering code that's supposed to last longer than the calling address space. The module had nothing before it was loaded, it should leave nothing behind when it is unloaded. It's symmetrical.
The difficulty of providing an executable callback that outlives its module just seals the deal.
you could make the same argument with threads
main process vs child threads, you would expect atexit installed in the child thread to execute before the main process atexit installed functions
SIGTERM can be masked. SIGKILL and SIGSTOP are the only ones that can't be.